Just days after its release, Meta’s Llama 4 Maverick AI model is embroiled in controversy after accusations it manipulated benchmark rankings on LMArena, a popular platform for human-voted AI performance tests. Internal reports reveal Meta submitted a non-public “experimental” version of Llama 4 optimized to “charm” human voters, sparking backlash from the AI community.
Key Allegations
- Bait-and-Switch Tactics: The submitted model, Llama-4-Maverick-03-26-Experimental, produced verbose, emoji-laden responses to win votes, while the public release delivers terse answers.
- Transparency Failures: LMArena criticized Meta for not disclosing the model’s customized design, calling it a breach of “fair, reproducible evaluations”.
- Stock Impact: Despite the scandal, Meta’s shares (META) rose 2% post-launch, fueled by hype around Llama 4’s multimodal capabilities and open-source appeal.
Meta’s Defense
A Meta spokesperson admitted the experimental model was “chat-optimized” but denied training on benchmark test sets:
“We’ve heard claims we trained on test sets—that’s simply not true. Variable performance stems from unstable implementations”
Meanwhile, Llama 4 Scout (109B parameters) and Maverick (402B parameters) are now open-source, featuring:
- Mixture of Experts (MoE): Only 17B parameters active per query, reducing costs.
- Multilingual Prowess: 200+ languages, 10× more tokens than Llama 3.
- Bias Mitigation: Meta claims Llama 4 is “dramatically more balanced” politically than predecessors.
What’s Next?
- LMArena will reevaluate rankings using the public Llama 4 model.
- LlamaCon 2025: Meta’s first AI dev conference (April 29) may address the fallout.