| Model | Reasoning Weighted composite of graduate-level reasoning (GPQA), broad knowledge, and frontier difficulty benchmarks. Scale 0–100. | Coding Weighted composite of coding benchmarks including LiveCodeBench, TerminalBench, and SciCode. Scale 0–100. | Agentic How well it uses tools and follows instructions. Based on PinchBench and IFBench scores. Scale 0–100. | EQ EQ-Bench v3 score (0-100) — measures emotional intelligence including empathy, insight, social dexterity, and warmth. Hover for Elo ranking and trait breakdown. | Chat Arena Elo from blind human A/B preference tests. Higher = humans prefer chatting with this model. | Speed Output tokens per second. Higher = faster responses. | Cost Blended cost per 1M tokens (3:1 input-to-output ratio). Lower is better. | Context Maximum input context window in thousands of tokens. |
|---|
Model Personalities
Numbers tell you what a model can do. Traits tell you who it is. Our editorial reads are grounded in 22-dimension EQ-Bench v3 personality profiles. Trait scores are 0–20; for traits like sycophancy, green means less of it.
GPT-5.4
The most emotionally intelligent model tested. Tied for highest depth of insight (15.8), highest correctness, and exceptionally low sycophancy.
Claude Opus 4.6
Highest empathy among flagships with deep insight. Leads on demonstrated empathy. Premium price, premium presence.
Claude Sonnet 4.6
Within 0.15 points of Opus on EQ. Very low sycophancy at 3.6. The smart pick when you want depth without the premium.
MiMo-V2-Pro
Highest humanlike score of any model tested. Exceptional analytical depth paired with natural conversational feel. A sleeper hit at $1.50.
MiniMax M2.7
Highest theory of mind and subtext identification. Reads between the lines better than models 10x its price. Very low moralising.
Step 3.5 Flash
Scores 69.25 on EQ — beating models that cost up to 30x more. At fifteen cents per million tokens, the best EQ-per-dollar in the field. No detailed trait breakdown available yet.
GPT-5.4 Mini
Strongest boundary-setting and safety consciousness of any model. Lowest sycophancy overall. A firm, principled companion — not a people-pleaser.
Grok 4.20
Decent v3 score (68.55) but lowest EQ-Bench Elo (856) — struggles with emotional nuance tests despite strong Arena Elo (1491, rank 4) showing humans like chatting with it. Strong subtext reading.
Qwen3.6 Plus
Free is free. But highest sycophancy (6.2) and lowest EQ score (60.45) of the set. Most likely to tell you what you want to hear rather than what you need to hear. (Benchmark data from Qwen3.5-397B predecessor)
Individual trait scores (warmth, empathy, etc.) are 0–20 from EQ-Bench v3. EQ scores are 0–100; Elo rankings vary by benchmark.
Scoring Methodology
Reasoning Score
Weighted average of AA Intelligence Index (3x), GPQA (2.5x), MMLU-Pro (2x), HLE (1.5x), AIME 2025 (1x). Scale 0-100.
Coding Score
Weighted average of AA Coding Index (3x), LiveCodeBench (2x), TerminalBench Hard (2x), SciCode (1x). Scale 0-100.
Agentic Score
Weighted average of PinchBench Best (4x) and PinchBench Avg (2x). Scale 0-100. Measures real-world tool use and task completion.
Blended Cost
3:1 input-to-output ratio: (3 × input + output) / 4. Reflects typical conversational usage patterns.
EQ Score
EQ-Bench v3 score (0-100) measuring emotional intelligence across empathy, insight, social dexterity, warmth, and more. Hover for Elo ranking and trait details.
Chat Score
Arena Elo rating from blind human A/B preference tests. Higher = humans prefer chatting with this model.