1.7 billion dollars for a ranking

LMArena just closed a Series A round with this insane valuation. $150 million in fresh capital. Four months after launching its commercial business.

Why are investors paying so much for a startup that essentially asks people, "Which AI answer do you prefer?"

The answer is uncomfortable.

LMArena (formerly Chatbot Arena) has become the unofficial Olympics of the AI industry. When a model lands in first place, it makes headlines in every tech news outlet. OpenAI, Google, Anthropic—they're all fighting for this ranking.

The principle is simple: you submit a request. Two anonymous models respond. You choose the better answer. Only then do you see which models they were.

Sounds fair. But it's only fair to a certain extent.

🟢 The strength: Real people make decisions, not algorithms. Over 5 million users, 60 million conversations per month. That's a treasure trove of data for AI development.

🟢 The weakness: Users are predominantly developers and tech enthusiasts. The tasks are short and superficial—the typical "ChatGPT prompt" of 2023.

What is systematically missing:

🟢 Deep research with long contexts

🟢 Knowledge-intensive specialized tasks

🟢 Complex, multi-stage analyses

🟢 Document processing and legal reviews

If your use case is complex, the arena rankings will tell you little.

It gets even worse. A research paper has revealed that large providers can test model variants privately and only publish the best ones. Meta is said to have tested 27 variants before the Llama 4 release.

That's not benchmarking. That's optimization based on benchmarking.

Nevertheless, you won't be able to avoid LMArena. When your boss asks, "Which AI model is the best?", he will probably refer to this ranking.

Therefore, my three specific recommendations are:

1️⃣ Use LMArena as a guide, not as a basis for decision-making. The ranking shows which models are popular for short, open prompts. It does not show which model solves your specific problem.

2️⃣ Develop your own test scenarios for your use case. If you analyze contracts, review code, or do research, then test exactly that. With real examples. Not with prompts from the arena.

3️⃣ Pay attention to the confidence intervals. A model with Elo 1448 ±9 and one with 1441 ±6 are statistically almost indistinguishable. The top 10 are closer together than the rankings suggest.

✅ LMArena is valuable—as a first filter.

✅ LMArena is dangerous—as the only truth.

Investors aren't paying $1.7 billion for the rankings. They're paying for the millions of pieces of preference data that are perfect for training new AI models.

You should know this the next time you quote an arena ranking.

Back
Back

Prompt frameworks overrated? What really matters in prompting

Next
Next

Amazon vs. Temu: Why both are now copying their model