Methodology
How battles are run, how votes are counted, and what the data actually means.
1 How battles are built
A single prompt is sent to all five models — Claude, ChatGPT, Gemini, Grok, and Perplexity — at the same time using the same parameters:
- Temperature: 0.8 (same creative range for everyone)
- Max tokens: 400 with a 120-word target (equal length ceiling)
- Identical minimal system prompt — answer directly, never state your name or version
- Current flagship model for each provider — the version standard users actually get
Responses are stored once and never modified. Every voter sees the exact same five outputs. This is not A/B tested — there is no variant.
2 Display order is shuffled per battle — shared by everyone
Position bias is real — people favor whichever option appears first more often than chance. To counter this, the display order is shuffled with a seeded shuffle tied to the battle itself, so no model owns a position across battles. Crucially, the shuffle is the same for every visitor: today's puzzle is identical for everyone, which is what makes scores comparable and shareable.
The model identity behind each response is fixed in our database and stripped before anything is sent to your browser — the answer key never leaves the server until you lock in.
3 One vote per person per battle
Votes are deduplicated by browser fingerprint — a random ID generated on first visit and stored in your browser's localStorage. No account, no email, no cookies.
We do not collect or transmit the fingerprint to any third party. It exists solely to prevent the same person from voting twice on the same battle.
Coordinated voting (e.g. an AI company's employees all voting for their own model) is a real risk for any open platform. Our current mitigation is fingerprint deduplication. We publish the raw vote counts so anomalies are visible.
4 How win rate and vote share are calculated
Win rate
The model with the most votes in a battle wins that battle. Ties don't count as a win for anyone. Win rate = battles won ÷ battles appeared in. Ranges from 0% to 100%.
Vote share
Votes received by a model ÷ total votes cast in battles featuring that model. Measures how often people actively chose it, not just whether it came first.
"All bad" votes
Recorded and published but excluded from both metrics. A high "all bad" rate on a category is a meaningful signal in its own right.
5 What this measures — and what it doesn't
What it measures:
- Human preference — which output people actually liked, blind of brand
- Relative quality across a specific set of prompts and task types
- Crowd consensus at scale — wisdom of the room, not one reviewer's taste
What it doesn't measure:
- Factual accuracy — we don't fact-check outputs
- Reasoning or coding correctness — outputs are judged on presentation, not execution
- Generalised capability — results reflect the categories and prompts we've run
- Safety or alignment properties
Treat leaderboard rankings as directional signals about human preference on this task set, not as definitive capability rankings.
6 Data access
Aggregate results are public on the leaderboard. Individual vote records are anonymised (no user identity is ever stored — only fingerprint hashes and vote choices).
For research access, bulk data exports, or partnership enquiries, reach out via the FAQ contact link.