Elo Driven Development

The next breakthrough in AI may come from a playground where two outputs fight and a stranger picks the winner

Mar 05, 2026

In January/26, Arena, the platform formerly known as LMArena, formerly known as LMSYS Chatbot Arena, announced a $150 million Series A led by Felicis and UC Investments, with participation from Andreessen Horowitz, Kleiner Perkins, and Lightspeed. That is on top of the $100 million seed round they closed less than a year earlier. A quarter billion dollars for a platform whose core mechanic is: show two outputs, hide the labels, ask a stranger which one is better.

The Core Claim

Everyone already agrees that data matters. That is not the argument.

The argument is this: the best product for collecting useful model feedback is a blind comparison interface. Not a benchmark suite. Not a multiple-choice exam. A head-to-head fight in front of a human judge who does not know which model produced which answer.

Arena proved this for text. Over 50 million votes from real users, across 400+ models, spanning text, vision, code, image, video, and search. The platform started as a PhD research experiment at UC Berkeley. It is now, by funding and by influence, one of the most important pieces of AI infrastructure in the world.

Design Arena, built by Arcada Labs, is proving it for interface and visual generation. 2.2 million users picking winners across website design, game dev, 3D modeling, data visualization, logo, video, and more, all rated with the same Elo-based Bradley-Terry system. When Claude Opus 4.6 sits at the top of that leaderboard, it is not because Anthropic said so. It is because strangers on the internet, who did not know which model made which output, clicked on it more often.

The mechanism is the same every time:

Two answers appear.
The user picks one.
The system updates rankings.
The platform learns what people actually prefer.

That loop is the point.

Why This Product Pattern Wins

Three things, and they compound.

First, brand bias disappears. If users do not know the model name, they judge output quality, not reputation. This matters more than it sounds. There is a well-documented tendency in AI evaluation for people to rate GPT-4 outputs higher when they know it is GPT-4. Remove the label and the rankings shift. Arena’s entire credibility rests on this insight.

Second, binary decisions scale. A/B choices are fast, clear, and cognitively cheap. You do not need an expert to say “this website looks better.” You need a lot of people saying it, quickly, and a good statistical model to aggregate. That is exactly what Elo provides.

Third, evaluation becomes continuous. Instead of waiting for a quarterly benchmark release (which may not even reflect how people actually use models), arena platforms measure quality in real time. A model that ships an update on Tuesday has new signal on Wednesday.

This is why teams trust arena-style rankings. They are not perfect. But they track real preference in real time, which turns out to be far more useful than a static leaderboard that measures performance on a fixed test set.

The Evidence: Follow the Money

Arena’s $250 million in total funding is the most visible signal. But it is not the only one.

The entire training-data and evaluation stack is converging toward the same thesis. Scale AI, the incumbent in post-training data and RLHF, is evolving from raw labeling into workflow-level RL environments. Datacurve runs a “bounty hunter” marketplace where skilled engineers tackle complex code and data tasks for frontier model training. Mercor connects AI labs with domain experts for RLHF. Surge AI focuses on expert labeling for reasoning and long-form tasks. Turing provides large expert workforces for fine-tuning and evals. Invisible turns complex, multi-step human workflows into verifiable work traces that function as de facto training environments.

But here is the key distinction, and it is why this post is not about the training-data ecosystem. Those companies improve models by generating better data. The arena pattern improves models by generating better signal. A blind head-to-head vote does not teach a model what to say. It tells the lab which model already says it better — and by how much. That is a fundamentally different feedback mechanism: not data supply, but preference measurement. The data companies feed the training loop. The arenas close it.

The smart money is flowing into both. But the arenas are the part that scales with users rather than annotators, which is why a voting platform can raise a quarter billion dollars.

Why It Matters for Model Companies

Model labs need a reliable way to answer one practical question: did the last update make users happier, or not?

Traditional benchmarks answer a different question: did the model get better at tasks we defined in advance? The gap between those two questions is where most evaluation failure lives. A model can improve on MMLU, HumanEval, and every public benchmark while regressing on the things users actually care about: tone, helpfulness, creativity, visual coherence, code that runs on the first try.

Blind comparison products close that gap. They measure what people prefer, not what a test suite rewards. That is why every major lab now watches Arena rankings obsessively. It is why OpenAI, Google, Anthropic, and a dozen others submit models under codenames before public release. The arena is their pre-launch focus group, except the sample size is in the millions and the methodology is more rigorous than any internal eval.

But what about the objections

“Arena rankings are gameable.” True in theory. Model providers can optimize specifically for arena-style prompts. But in practice, the sheer volume and diversity of users makes this harder than gaming a fixed benchmark. 50 million votes from strangers is a harder target to overfit than 1,000 curated test cases.

“Human preference is noisy and inconsistent.” Also true. But Elo systems are designed precisely for this. They extract a stable signal from many noisy pairwise comparisons. That is literally what the algorithm was built to do, originally for chess, now for language models.

“Blind comparison only works for simple outputs.” This is the strongest objection. For complex, multi-step agent tasks, a simple A/B vote may not capture quality. That is exactly where players like Halluminate are building: richer evaluation environments where the “judgment” is more structured than a single click. The core pattern (human feedback, captured systematically, turned into a ranking signal) still holds. The interface just needs more surface area.

Elo Driven Development

Elo systems are useful here because they do something no static benchmark can: they turn many small pairwise decisions into a ranking that moves with user preference, continuously.

The development loop for a frontier model increasingly looks like this: train a model, ship it to an arena, watch the Elo rating, identify where it loses, retrain on the gaps, ship again, watch the Elo move. The arena is not just the evaluation layer. It is becoming the development feedback loop itself.

This is a structural shift.

What Happens if Everyone Behaves Rationally

Start with the big labs. If you are OpenAI, Google, or Anthropic, you now have an external scoreboard that your customers, your investors, and your recruits all watch. You cannot ignore it. The rational move is to treat arena rankings as a first-class product metric, right alongside internal evals and revenue. That means submitting every model update to Arena and Design Arena before launch, building internal tooling that correlates Elo movement with architecture changes, and staffing teams whose job is to close gaps that arena data reveals. Some labs are already doing this. Within a year, all of them will be.

The second-order move is more interesting. If arena rankings drive reputation, and reputation drives enterprise contracts, then labs have an incentive to invest directly in arena infrastructure. Not to manipulate rankings, but to ensure the arenas they depend on remain credible and well-funded. Expect to see labs become strategic partners, data contributors, or even minority investors in arena platforms. Arena’s Series A investor list (a16z, Kleiner Perkins, Lightspeed) already reads like a who’s-who of AI lab backers. That is not a coincidence.

Now consider VCs. The arena pattern has a clear power-law dynamic: the platform with the most users generates the most votes, which produces the most trusted rankings, which attracts more models, which attracts more users. That flywheel favors early movers and makes the category winner-take-most per modality. If you are a VC, the rational play is to identify the emerging arena for each new modality (code, design, video, audio, agentic workflows) and fund the front-runner before the flywheel becomes self-sustaining. Arena and Design Arena are already there for text and visual generation. The next arenas to watch are for agent evaluation, audio, and domain-specific verticals like legal, medical, and finance.

For startup founders, the calculus is different. Building a general-purpose arena to compete with Arena head-on is probably a losing bet. But building a vertical arena, one that serves a specific modality, domain, or user base, is wide open. Design Arena proved this: you do not need to be Arena to matter. You need a focused judge pool, a clear modality, and enough model coverage to make the rankings useful. The founders who understand this will build arenas for niches that the general platforms cannot serve well. Think: an arena for medical summarization judged by physicians. An arena for contract analysis judged by lawyers. An arena for code review judged by senior engineers. Each is a small market on its own, but the ranking data they produce is extraordinarily valuable to any lab building models for those verticals.

The final actor is the training-data ecosystem: the Scale AIs and Datacurves of the world. If arena rankings become the scoreboard that matters, then the value of their work rises in direct proportion. Better training data leads to higher Elo. Every point of Elo translates into deals, press, and recruiting advantage for the lab. That makes the data companies direct beneficiaries of the arena economy, and it means they will increasingly price and position their services around Elo improvement as a deliverable, not just data volume.

Put it all together and you get a new equilibrium: labs compete on arenas, VCs fund the arenas and the data infrastructure around them, founders build vertical arenas for underserved modalities, and data companies sell Elo points instead of labeled rows. The arena becomes the central coordination mechanism for the entire AI improvement stack.

A quarter billion dollars for a voting interface. The market is telling you that the next breakthrough in AI may come from building a better arena.

faizan khan's blog

Discussion about this post

Ready for more?