Social human benchmark

Methodology

How battles are run, how votes are counted, and what the data actually means.

1 How battles are built

A single prompt is sent to all five models — Claude, ChatGPT, Gemini, Grok, and Perplexity — at the same time using the same parameters:

Temperature: 0.8 (same creative range for everyone)
Max tokens: 400 with a 120-word target (equal length ceiling)
Identical minimal system prompt — answer directly, never state your name or version
Current flagship model for each provider — the version standard users actually get

Responses are stored once and never modified. Every voter sees the exact same five outputs. This is not A/B tested — there is no variant.

2 Display order is shuffled per battle — shared by everyone

Position bias is real — people favor whichever option appears first more often than chance. To counter this, the display order is shuffled with a seeded shuffle tied to the battle itself, so no model owns a position across battles. Crucially, the shuffle is the same for every visitor: today's puzzle is identical for everyone, which is what makes scores comparable and shareable.

The model identity behind each response is fixed in our database and stripped before anything is sent to your browser — the answer key never leaves the server until you lock in.

3 One vote per person per battle

Votes are deduplicated by browser fingerprint — a random ID generated on first visit and stored in your browser's localStorage. No account, no email, no cookies.

We do not collect or transmit the fingerprint to any third party. It exists solely to prevent the same person from voting twice on the same battle.

Coordinated voting (e.g. an AI company's employees all voting for their own model) is a real risk for any open platform. Our current mitigation is fingerprint deduplication. We publish the raw vote counts so anomalies are visible.

4 How win rate and vote share are calculated

Win rate

The model with the most votes in a battle wins that battle. Ties don't count as a win for anyone. Win rate = battles won ÷ battles appeared in. Ranges from 0% to 100%.

Vote share

Votes received by a model ÷ total votes cast in battles featuring that model. Measures how often people actively chose it, not just whether it came first.

"All bad" votes

Recorded and published but excluded from both metrics. A high "all bad" rate on a category is a meaningful signal in its own right.

5 What this measures — and what it doesn't

What it measures:

Human preference — which output people actually liked, blind of brand
Relative quality across a specific set of prompts and task types
Crowd consensus at scale — wisdom of the room, not one reviewer's taste

What it doesn't measure:

Factual accuracy — we don't fact-check outputs
Reasoning or coding correctness — outputs are judged on presentation, not execution
Generalised capability — results reflect the categories and prompts we've run
Safety or alignment properties

Treat leaderboard rankings as directional signals about human preference on this task set, not as definitive capability rankings.

6 Data access

Aggregate results are public on the leaderboard. Individual vote records are anonymised (no user identity is ever stored — only fingerprint hashes and vote choices).

For research access, bulk data exports, or partnership enquiries, reach out via the FAQ contact link.

← Back to battles