Meta executive denies hyping up Llama 4 benchmark scores – but what can users expect from the new models?
Meta has released its latest AI model, Llama 4, using a "mixture of experts" architecture
Sign up today and you will receive a free copy of our Future Focus 2025 report - the leading guidance on AI, cybersecurity and other IT challenges as per 700+ senior executives
You are now subscribed
Your newsletter sign-up was successful
A senior figure at Meta has denied claims that the tech giant boosted performance metrics for its new Llama 4 AI model range following rumors online.
Meta’s VP of generative AI, Ahmad Al-Dahle, took to X to address the issue, claiming they’re “simply not true”. The rumors centered around claims Meta trained its new Llama 4 Maverick and Scout models on “test sets”.
These data sets are used to evaluate model performance post-training. The speculation over Meta’s alleged use of test sets suggested this would make the model appear more powerful than it is compared to rival options.
This was further fueled by posts online that the Maverick and Scout models have performed below expectations, particularly when used with different cloud service providers.
Al-Dahle admitted that some early users have experienced “mixed quality”, but said the company expects it will take time for public implementations to reach maximum output and for wrinkles to be ironed out.
“We've also heard claims that we trained on test sets -- that's simply not true and we would never do that,” Al-Dahle said. “Our best understanding is that the variable quality people are seeing is due to needing to stabilize implementations.
“Since we dropped the models as soon as they were ready, we expect it'll take several days for all the public implementations to get dialed in. We'll keep working through our bug fixes and onboarding partners.”
Sign up today and you will receive a free copy of our Future Focus 2025 report - the leading guidance on AI, cybersecurity and other IT challenges as per 700+ senior executives
The speculation over model performance juicing may have put a dampener on the highly anticipated launch of the Llama 4 model range, which is powered by a massive 288-billion parameter "teacher" model called Behemoth.
These new models use a "mixture of experts" format and include a teaching model, which are designed to overcome scaling challenges such as the cost of simply making bigger models.
In this instance, Behemoth has 288 billion active parameters, with two trillion total parameters, and 16 experts, and is a "teacher model" that is distilled to inform the two smaller models, Maverick and Scout.
Controversy aside, Meta appears confident that the new model range will place it in good stead to compete with industry rivals - so what can users expect?
Everything you need to know about the new Llama 4 models
Llama 4 Maverick, to give the model its full name, has 17 billion active parameters, 400 billion total parameters, and 128 experts.
That combination tops competitors GPT-4o and Gemini 2.0 Flash, Meta claimed, while offering comparable results to DeepSeek v3 with fewer active parameters for a "best in class" performance to cost ratio.
Meanwhile, Llama 4 Scout features 17 billion active parameters, 109 billion total parameters and 16 experts, which fits onto a single Nvidia H100 GPU, meaning it can be used without top end systems.
Meta said the Llama lineup will let developers build "more personalized multimodal experiences", noting that it's continuing to offer access to its weights — making it more open than rivals, though some have questioned if that counts as fully open source.
Scout and Maverick are available on Llama.com and Hugging Face, and Llama 4 will power Meta AI products including those in WhatsApp, Messenger, and Instagram. Llama 4 Behemoth isn't being released yet as it's still in training.
"As more people continue to use artificial intelligence to enhance their daily lives, it’s important that the leading models and systems are openly available so everyone can build the future of personalized experiences," Meta said in a blog post.
"We’re introducing Llama 4 Scout and Llama 4 Maverick, the first open-weight natively multimodal models with unprecedented context length support and our first built using a mixture-of-experts (MoE) architecture.
"We’re also previewing Llama 4 Behemoth, one of the smartest LLMs in the world and our most powerful yet to serve as a teacher for our new models."
Rise of the expert models
Meta is attempting to pair the best of scaling massive large language models (LLMs) with more specific expertise that can be run using smaller systems.
The two smaller models, Llama 4 Scout and Llama 4 Maverick, both use 17 billion active parameters but pairing that with so-called experts - which are specialised neural networks that are activated only when necessary - helps to improve models without scaling up computing cost.
This is achieved by only using the necessary parts to answer a question and ignoring aspects that aren't needed.
For example, if you needed to answer a math question, there's no reason to include systems for science or language. Scout has 16 experts, while Maverick features 128; Behemoth also has a system of 16 experts.
"Our new Llama 4 models are our first models that use a mixture of experts (MoE) architecture. In MoE models, a single token activates only a fraction of the total parameters," the blog post noted.
"MoE architectures are more compute efficient for training and inference and, given a fixed training FLOPs budget, delivers higher quality compared to a dense model."
Meta explained that for Maverick, a token is sent to a "shared expert" as well as to one of the 128 routed experts.
"As a result, while all parameters are stored in memory, only a subset of the total parameters are activated while serving these models," the post said.
"This improves inference efficiency by lowering model serving costs and latency—Llama 4 Maverick can be run on a single NVIDIA H100 DGX host for easy deployment, or with distributed inference for maximum efficiency."
Benchmarks
Meta noted that Maverick would cost between 19 cents and 49 cents per one million input and output tokens, making it roughly inline with Gemini 2.0 Flash and DeepSeek v3.1, and significantly cheaper than GPT-4o.
At the same time, it claimed to top those rivals on most benchmarks, including image reasoning, coding and language, though DeepSeek v3.1 did pip it on one reasoning benchmark and on LiveCodeBench.
A similar pattern was shown for Scout, with it topping rivals in selected benchmarks.
Benchmarks aside, Meta claimed that Scout offers an "industry leading" context window of 10 million tokens, but AI researcher Simon Willison noted in a blog post that it's not currently possible to actually get more than a fraction of that token window running.
MORE FROM ITPRO
Freelance journalist Nicole Kobie first started writing for ITPro in 2007, with bylines in New Scientist, Wired, PC Pro and many more.
Nicole the author of a book about the history of technology, The Long History of the Future.
-
Low-budget devices are the biggest casualty of the RAM crisisNews Say goodbye to budget devices; vendors are doubling down on high-end options to absorb costs
-
Sectigo taps Clint Maddox to lead global field operationsReviews The appointment follows a year of strong momentum for the security vendor as it expands its global channel footprint
-
Concerns are mounting over the cognitive impact of AI as workers report experiencing ‘brain fry’ – and it’s causing "increased employee errors, decision fatigue, and intention to quit"News Research from Boston Consulting Group backs earlier studies in highlighting the negative cognitive impact of AI at work
-
If you thought RTO battles were bad, wait until AI mandates start taking hold across the industryOpinion Forcing workers to adopt AI under the threat of poor performance reviews and losing out on promotions will only create friction
-
Sam Altman just said what everyone is thinking about AI layoffsNews AI layoff claims are overblown and increasingly used as an excuse for “traditional drivers” when implementing job cuts
-
Google says hacker groups are using Gemini to augment attacks – and companies are even ‘stealing’ its modelsNews Google Threat Intelligence Group has shut down repeated attempts to misuse the Gemini model family
-
Why Anthropic sent software stocks into freefallNews Anthropic's sector-specific plugins for Claude Cowork have investors worried about disruption to software and services companies
-
B2B Tech Future Focus - 2026Whitepaper Advice, insight, and trends for modern B2B IT leaders
-
What the UK's new Centre for AI Measurement means for the future of the industryNews The project, led by the National Physical Laboratory, aims to accelerate the development of secure, transparent, and trustworthy AI technologies
-
‘In the model race, it still trails’: Meta’s huge AI spending plans show it’s struggling to keep pace with OpenAI and Google – Mark Zuckerberg thinks the launch of agents that ‘really work’ will be the keyNews Meta CEO Mark Zuckerberg promises new models this year "will be good" as the tech giant looks to catch up in the AI race
