Nvidia Releases Nemotron 70B Model Claims To Beat Gpt 4O And Claude 3 5 Sonnet

In LMSYS’s Arena Hard benchmark, the Llama 3.1 Nemotron 70B model scores 85.0 whereas GPT-4o gets 79.3 and Claude 3.5 Sonnet achieves 79.2 points. On AlpacaEval and MT-Bench too, Nvidia’s latest model does better than proprietary models despite its smaller size. Nvidia has not released traditional ML benchmarks for this model.
Apart from that, Nvidia says that Llama 3.1 Nemotron 70B can correctly answer the strawberry question (how many r’s in strawberry?) that has stumped so many LLMs. It doesn’t use additional reasoning tokens like OpenAI o1 models or take advantage of specialized prompting to get the answer right. In my brief testing, the model got it wrong on the first try. However, when I asked the same question again, it correctly answered 3 R’s.
You can test the Llama 3.1 Nemotron 70B model on HuggingFace (visit) for free. And developers can try the hosted inference for free at build.nvidia.com (visit).