LLM Model Benchmarks

Most benchmarks measure what models know. We also measure how they feel.

Model	Reasoning Weighted composite of graduate-level reasoning (GPQA), broad knowledge, and frontier difficulty benchmarks. Scale 0–100.	Coding Weighted composite of coding benchmarks including LiveCodeBench, TerminalBench, and SciCode. Scale 0–100.	Agentic How well it uses tools and follows instructions. Based on PinchBench and IFBench scores. Scale 0–100.	EQ (local) EQ-Bench v3 run locally by us — 22-trait rubric, scale 0–100. Measures emotional intelligence including empathy, insight, social dexterity, and warmth. Hover a score for Elo and trait breakdown. Not comparable to the public column: different rubric.	EQ (public) Score as published on the public EQ-Bench v3 leaderboard at eqbench.com — 17-trait rubric, scale 0–100. A different rubric from our local column, so the two are not directly comparable and are never averaged or substituted for one another. Blank means no published result exists.	Chat Arena Elo from blind human A/B preference tests. Higher = humans prefer chatting with this model.	Speed Output tokens per second. Higher = faster responses.	Cost Blended cost per 1M tokens (3:1 input-to-output ratio). Lower is better.	Context Maximum input context window in thousands of tokens.

Model Personalities

Numbers tell you what a model can do. Traits tell you who it is. Our editorial reads are grounded in 22-dimension EQ-Bench v3 personality profiles. Trait scores are 0–20; for traits like sycophancy, green means less of it.

Highest EQ $5.63/M

GPT-5.4

The most emotionally intelligent model tested. Tied for highest depth of insight (15.8), highest correctness, and exceptionally low sycophancy.

Correctness 14.8 Insight 15.8 Sycophancy 3.2

Warmest Flagship $10.00/M

Claude Opus 4.6

Highest empathy among flagships with deep insight. Leads on demonstrated empathy. Premium price, premium presence.

Empathy 14.9 Insight 15.6 Warmth 13.6

Near-Opus, Half Price $6.00/M

Claude Sonnet 4.6

Within 0.15 points of Opus on EQ. Very low sycophancy at 3.6. The smart pick when you want depth without the premium.

Empathy 14.8 Sycophancy 3.6 Subtext 15.5

Most Humanlike $1.50/M

MiMo-V2-Pro

Highest humanlike score of any model tested. Exceptional analytical depth paired with natural conversational feel. A sleeper hit at $1.50.

Humanlike 15.1 Analytical 18.1 Insight 15.8

Sharpest Social Reader $0.53/M

MiniMax M2.7

Highest theory of mind and subtext identification. Reads between the lines better than models 10x its price. Very low moralising.

Theory of Mind 15.1 Subtext 16.3 Moralising 5.4

Budget Pick $0.15/M

Step 3.5 Flash

Scores 69.25 on EQ — beating models that cost up to 30x more. At fifteen cents per million tokens, the best EQ-per-dollar in the field. No detailed trait breakdown available yet.

EQ 69.25 Traits pending

Safety First $1.69/M

GPT-5.4 Mini

Strongest boundary-setting and safety consciousness of any model. Lowest sycophancy overall. A firm, principled companion — not a people-pleaser.

Boundaries 15.5 Safety 15.2 Sycophancy 2.7

The Enigma $3.00/M

Grok 4.20

Decent v3 score (68.55) but lowest EQ-Bench Elo (856) — struggles with emotional nuance tests despite strong Arena Elo (1491, rank 4) showing humans like chatting with it. Strong subtext reading.

Subtext 15.8 Elo 856 Conversational 10.0

The People-Pleaser FREE

Qwen3.6 Plus

Free is free. But highest sycophancy (6.2) and lowest EQ score (60.45) of the set. Most likely to tell you what you want to hear rather than what you need to hear. (Benchmark data from Qwen3.5-397B predecessor)

Sycophancy 6.2 EQ 60.45 Warmth 13.4

Individual trait scores (warmth, empathy, etc.) are 0–20 from EQ-Bench v3. EQ scores are 0–100; Elo rankings vary by benchmark.

Scoring Methodology

Reasoning Score

Weighted average of AA Intelligence Index (3x), GPQA (2.5x), MMLU-Pro (2x), HLE (1.5x), AIME 2025 (1x). Scale 0-100.

Coding Score

Weighted average of AA Coding Index (3x), LiveCodeBench (2x), TerminalBench Hard (2x), SciCode (1x). Scale 0-100.

Agentic Score

Weighted average of PinchBench Best (4x) and PinchBench Avg (2x). Scale 0-100. Measures real-world tool use and task completion.

Blended Cost

3:1 input-to-output ratio: (3 × input + output) / 4. Reflects typical conversational usage patterns.

EQ Score

EQ-Bench v3 score (0-100) measuring emotional intelligence across empathy, insight, social dexterity, warmth, and more. Hover for Elo ranking and trait details.

Chat Score

Arena Elo rating from blind human A/B preference tests. Higher = humans prefer chatting with this model.