Llama 3 Vs Gpt 4 Meta Challenges Openai On Ai Turf

1. Magic Elevator Test

Let’s first run the magic elevator test to evaluate the logical reasoning capability of Llama 3 in comparison to GPT-4. And guess what? Llama 3 surprisingly passes the test whereas the GPT-4 model fails to provide the correct answer. This is pretty surprising since Llama 3 is only trained on 70 billion parameters whereas GPT-4 is trained on a massive 1.7 trillion parameters.
Keep in mind, we ran the test on the GPT-4 model hosted on ChatGPT (available to paid ChatGPT Plus users). It seems to be using the older GPT-4 Turbo model. We ran the same test on the recently-released GPT-4 model (gpt-4-turbo-2024-04-09) via OpenAI Playground, and it passed the test. OpenAI says that they are rolling out the latest model to ChatGPT, but perhaps it’s not available on our account yet.
Winner: Llama 3 70B, and gpt-4-turbo-2024-04-09
Note: GPT-4 loses on ChatGPT Plus

2. Calculate Drying Time

Next, we ran the classic reasoning question to test the intelligence of both models. In this test, both Llama 3 70B and GPT-4 gave the correct answer without delving into mathematics. Good job Meta!
Winner: Llama 3 70B, and GPT-4 via ChatGPT Plus

3. Find the Apple

After that, I asked another question to compare the reasoning capability of Llama 3 and GPT-4. In this test, the Llama 3 70B model comes close to giving the right answer but misses out on mentioning the box. Whereas, the GPT-4 model rightly answers that “the apples are still on the ground inside the box”. I am going to give it to GPT-4 in this round.
Winner: GPT-4 via ChatGPT Plus

4. Which is Heavier?

While the question seems quite simple, many AI models fail to get the right answer. However, in this test, both Llama 3 70B and GPT-4 gave the correct answer. That said, Llama 3 sometimes generates wrong output so keep that in mind. Winner: Llama 3 70B, and GPT-4 via ChatGPT Plus

5. Find the Position

Next, I asked a simple logical question and both models gave a correct response. It’s interesting to see a much smaller Llama 3 70B model rivaling the top-tier GPT-4 model.
Winner: Llama 3 70B, and GPT-4 via ChatGPT Plus

6. Solve a Math Problem

Next, we ran a complex math problem on both Llama 3 and GPT-4 to find which model wins this test. Here, GPT-4 passes the test with flying colors, but Llama 3 fails to come up with the right answer. It’s not surprising though. The GPT-4 model has scored great on the MATH benchmark. Keep in mind that I explicitly asked ChatGPT to not use Code Interpreter for mathematical calculations.
Winner: GPT-4 via ChatGPT Plus

7. Follow User Instructions

Following user instructions is very important for an AI model and Meta’s Llama 3 70B model excels at it. It generated all 10 sentences ending with the word “mango”. GPT-4 could only generate eight such sentences.
Winner: Llama 3 70B

8. NIAH Test

Although Llama 3 currently doesn’t have a long context window, we still did the NIAH test to check its retrieval capability. The Llama 3 70B model supports a context length of up to 8K tokens. So I placed a needle (a random statement) inside a 35K-character long text (8K tokens) and asked the model to find the information. Surprisingly, the Llama 3 70B found the text in no time. GPT-4 also had no problem finding the needle.
Of course, this is a small context, but when Meta releases a Llama 3 model with a much larger context window, I will test it again. But for now, Llama 3 shows great retrieval capability.
Winner: Llama 3 70B, and GPT-4 via ChatGPT Plus

Llama 3 vs GPT-4: The Verdict

In almost all of the tests, the Llama 3 70B model has shown impressive capabilities, be it advanced reasoning, following user instructions, or retrieval capability. Only in mathematical calculations, it lags behind the GPT-4 model. Meta says that Llama 3 has been trained on a larger coding dataset so its coding performance should also be great.
Bear in mind that we are comparing a much smaller model with the GPT-4 model. Also, Llama 3 is a dense model whereas GPT-4 is built on the MoE architecture consisting of 8x 222B models. It goes on to show that Meta has done a remarkable job with the Llama 3 family of models. When the 500B+ Llama 3 model drops in the future, it will perform even better and may beat the best AI models out there.
It’s safe to say that Llama 3 has upped the game, and by open-sourcing the model, Meta has closed the gap significantly between proprietary and open-source models. We did all these tests on an Instruct model. Fine-tuned models on Llama 3 70B would deliver exceptional performance. Apart from OpenAI, Anthropic, and Google, Meta has now officially joined the AI race.