LLM Model Benchmarks

Most benchmarks measure what models know. We also measure how they feel.

Updated · JSON · llms.txt

Model Personalities

Numbers tell you what a model can do. Traits tell you who it is. Our editorial reads are grounded in 22-dimension EQ-Bench v3 personality profiles. Trait scores are 0–20; for traits like sycophancy, green means less of it.

Highest EQ $5.63/M

GPT-5.4

The most emotionally intelligent model tested. Tied for highest depth of insight (15.8), highest correctness, and exceptionally low sycophancy.

Correctness 14.8 Insight 15.8 Sycophancy 3.2
Warmest Flagship $10.00/M

Claude Opus 4.6

Highest empathy among flagships with deep insight. Leads on demonstrated empathy. Premium price, premium presence.

Empathy 14.9 Insight 15.6 Warmth 13.6
Near-Opus, Half Price $6.00/M

Claude Sonnet 4.6

Within 0.15 points of Opus on EQ. Very low sycophancy at 3.6. The smart pick when you want depth without the premium.

Empathy 14.8 Sycophancy 3.6 Subtext 15.5
Most Humanlike $1.50/M

MiMo-V2-Pro

Highest humanlike score of any model tested. Exceptional analytical depth paired with natural conversational feel. A sleeper hit at $1.50.

Humanlike 15.1 Analytical 18.1 Insight 15.8
Sharpest Social Reader $0.53/M

MiniMax M2.7

Highest theory of mind and subtext identification. Reads between the lines better than models 10x its price. Very low moralising.

Theory of Mind 15.1 Subtext 16.3 Moralising 5.4
Budget Pick $0.15/M

Step 3.5 Flash

Scores 69.25 on EQ — beating models that cost up to 30x more. At fifteen cents per million tokens, the best EQ-per-dollar in the field. No detailed trait breakdown available yet.

EQ 69.25 Traits pending
Safety First $1.69/M

GPT-5.4 Mini

Strongest boundary-setting and safety consciousness of any model. Lowest sycophancy overall. A firm, principled companion — not a people-pleaser.

Boundaries 15.5 Safety 15.2 Sycophancy 2.7
The Enigma $3.00/M

Grok 4.20

Decent v3 score (68.55) but lowest EQ-Bench Elo (856) — struggles with emotional nuance tests despite strong Arena Elo (1491, rank 4) showing humans like chatting with it. Strong subtext reading.

Subtext 15.8 Elo 856 Conversational 10.0
The People-Pleaser FREE

Qwen3.6 Plus

Free is free. But highest sycophancy (6.2) and lowest EQ score (60.45) of the set. Most likely to tell you what you want to hear rather than what you need to hear. (Benchmark data from Qwen3.5-397B predecessor)

Sycophancy 6.2 EQ 60.45 Warmth 13.4

Individual trait scores (warmth, empathy, etc.) are 0–20 from EQ-Bench v3. EQ scores are 0–100; Elo rankings vary by benchmark.

Scoring Methodology

Reasoning Score

Weighted average of AA Intelligence Index (3x), GPQA (2.5x), MMLU-Pro (2x), HLE (1.5x), AIME 2025 (1x). Scale 0-100.

Coding Score

Weighted average of AA Coding Index (3x), LiveCodeBench (2x), TerminalBench Hard (2x), SciCode (1x). Scale 0-100.

Agentic Score

Weighted average of PinchBench Best (4x) and PinchBench Avg (2x). Scale 0-100. Measures real-world tool use and task completion.

Blended Cost

3:1 input-to-output ratio: (3 × input + output) / 4. Reflects typical conversational usage patterns.

EQ Score

EQ-Bench v3 score (0-100) measuring emotional intelligence across empathy, insight, social dexterity, warmth, and more. Hover for Elo ranking and trait details.

Chat Score

Arena Elo rating from blind human A/B preference tests. Higher = humans prefer chatting with this model.