# HeartCentered AI — LLM Model Benchmarks # Generated: 2026-04-07T05:56:53Z # Full data: https://heartcentered.ai/model-benchmarks/data/model-data.json # Web UI: https://heartcentered.ai/model-benchmarks/ # Current-generation LLM models ranked by community usage on OpenRouter. # Pricing is per million tokens. Capabilities derived from API metadata. ## Claude Haiku 4.5 Provider: Anthropic ID: anthropic/claude-haiku-4.5 Context: 200,000 tokens Max output: 64,000 tokens Pricing: $1.00 input / $5.00 output per 1M tokens (blended: $2.00) Capabilities: tool use, vision, structured output Scores (0-100): Reasoning: 31, Coding: 35.0, Agentic: 85.7 Speed: 93 tokens/sec, TTFT 0.4s Benchmarks (Artificial Analysis): intelligence_index: 31.1, coding_index: 29.6, gpqa: 0.646, hle: 0.043, scicode: 0.344, ifbench: 0.42, terminalbench_hard: 0.273, livecodebench: 0.511, aime_25: 0.39, mmlu_pro: 0.8 EQ-Bench v3 Score: 68.65/100 PinchBench: 89.5% best, 78.1% avg (? runs) ## Claude Opus 4.6 Provider: Anthropic ID: anthropic/claude-opus-4.6 Context: 1,000,000 tokens Max output: 128,000 tokens Pricing: $5.00 input / $25.00 output per 1M tokens (blended: $10.00) Capabilities: tool use, reasoning, vision, web search, structured output Scores (0-100): Reasoning: 53.9, Coding: 47.6, Agentic: 89.9 Speed: 46.1 tokens/sec, TTFT 1.655s Available via: Amazon Bedrock, Anthropic, Azure, Google Benchmarks (Artificial Analysis): intelligence_index: 46.5, coding_index: 47.6, gpqa: 0.84, hle: 0.186, scicode: 0.457, ifbench: 0.446, terminalbench_hard: 0.485 EQ-Bench v3 Score: 71.85/100 EQ-Bench Elo: 1857.8 PinchBench: 93.3% best, 83.1% avg (19 runs) Description: Anthropic's strongest model for coding and long-running professional tasks. Built for agents that operate across entire workflows. ## Claude Sonnet 4.6 Provider: Anthropic ID: anthropic/claude-sonnet-4.6 Context: 1,000,000 tokens Max output: 128,000 tokens Pricing: $3.00 input / $15.00 output per 1M tokens (blended: $6.00) Capabilities: tool use, reasoning, vision, web search, structured output Scores (0-100): Reasoning: 50.4, Coding: 46.4, Agentic: 85.7 Speed: 47.0 tokens/sec, TTFT 0.9s Available via: Amazon Bedrock, Anthropic, Azure, Google Benchmarks (Artificial Analysis): intelligence_index: 44.4, coding_index: 46.4, gpqa: 0.799, hle: 0.132, scicode: 0.469, ifbench: 0.412, terminalbench_hard: 0.462 EQ-Bench v3 Score: 71.7/100 EQ-Bench Elo: 1876.8 PinchBench: 88.0% best, 81.1% avg (19 runs) Description: Anthropic's most capable Sonnet-class model. Frontier performance across coding, agents, and professional work. ## Gemini 3.1 Pro Provider: Google ID: google/gemini-3.1-pro-preview-20260219 Context: 1,048,576 tokens Max output: 65,536 tokens Pricing: $2.00 input / $12.00 output per 1M tokens (blended: $4.50) Capabilities: tool use, reasoning, vision, structured output Scores (0-100): Reasoning: 57.0, Coding: 56, Agentic: 83.5 Speed: 127.0 tokens/sec, TTFT 30.66s Available via: Google Benchmarks (Artificial Analysis): intelligence_index: 57.0, note: Coding index unavailable — AA API rate limited during collection EQ-Bench v3 Score: 68.95/100 EQ-Bench Elo: 1548.7 PinchBench: 86.7% best, 77.0% avg (15 runs) Description: Google's frontier reasoning model with enhanced software engineering performance, improved agentic reliability, and multimodal input support. ## Gemma 4 31B Provider: Google ID: google/gemma-4-31b-it Context: 262,144 tokens Max output: 131,072 tokens Pricing: $0.14 input / $0.40 output per 1M tokens (blended: $0.20) Capabilities: tool use, reasoning, vision, structured output Scores (0-100): Reasoning: 52.3, Coding: 38.7, Agentic: 75.6 Speed: 35.9 tokens/sec Available via: AkashML, Novita, Parasail, Venice Benchmarks (Artificial Analysis): intelligence_index: 39.2, coding_index: 38.7, gpqa: 0.857, hle: 0.227, scicode: 0.434, ifbench: 0.756, terminalbench_hard: 0.364 EQ-Bench v3 Score: 66.1/100 Description: Gemma 4 31B Instruct is Google DeepMind's 30.7B dense multimodal model supporting text and image input with text output. Features a 256K token context window, configurable thinking/reasoning mode, nat ## GLM 5 Turbo Provider: Z.ai ID: z-ai/glm-5-turbo Context: 202,752 tokens Max output: 131,072 tokens Pricing: $1.20 input / $4.00 output per 1M tokens (blended: $1.90) Capabilities: tool use, reasoning, structured output Scores (0-100): Reasoning: 55.8, Coding: 36.8, Agentic: 84.9 Speed: 42.0 tokens/sec, TTFT 1.77s Available via: AtlasCloud, Z.AI Benchmarks (Artificial Analysis): intelligence_index: 46.8, coding_index: 36.8, gpqa: 0.847, hle: 0.254, scicode: 0.436, ifbench: 0.732, terminalbench_hard: 0.333 EQ-Bench v3 Score: 67.7/100 EQ-Bench Elo: 1631.9 PinchBench: 86.5% best, 81.6% avg (11 runs) Description: Fast inference model from Z.ai designed for agent-driven environments. Deeply optimized for real-world agent workflows. ## GPT-5.4 Provider: OpenAI ID: openai/gpt-5.4 Context: 1,050,000 tokens Max output: 128,000 tokens Pricing: $2.50 input / $15.00 output per 1M tokens (blended: $5.62) Capabilities: tool use, reasoning, vision, web search, structured output Scores (0-100): Reasoning: 57.0, Coding: 57, Agentic: 87.6 Speed: 74.0 tokens/sec, TTFT 152.2s Available via: OpenAI Benchmarks (Artificial Analysis): intelligence_index: 57.0, note: Coding index unavailable — AA API rate limited during collection EQ-Bench v3 Score: 73.2/100 EQ-Bench Elo: 1687.5 PinchBench: 90.5% best, 81.7% avg (17 runs) Description: OpenAI's latest frontier model, unifying the Codex and GPT lines into a single system with 1M+ context window. ## GPT-5.4 Mini Provider: OpenAI ID: openai/gpt-5.4-mini Context: 400,000 tokens Max output: 128,000 tokens Pricing: $0.75 input / $4.50 output per 1M tokens (blended: $1.69) Capabilities: tool use, reasoning, vision, web search, structured output Scores (0-100): Reasoning: 48, Coding: 51, Agentic: 56 Speed: 186 tokens/sec EQ-Bench v3 Score: 68.65/100 ## Grok 4.20 Provider: xAI ID: x-ai/grok-4.20-20260309 Context: 2,000,000 tokens Max output: 128,000 tokens Pricing: $2.00 input / $6.00 output per 1M tokens (blended: $3.00) Capabilities: tool use, reasoning, vision, web search, structured output Scores (0-100): Reasoning: 48.0, Coding: 42, Agentic: 78.9 Speed: 271.0 tokens/sec, TTFT 10.72s Available via: xAI Benchmarks (Artificial Analysis): intelligence_index: 48.0, note: Coding index unavailable — AA API rate limited during collection EQ-Bench v3 Score: 68.55/100 EQ-Bench Elo: 856.4 PinchBench: 82.4% best, 71.8% avg (18 runs) Description: xAI's newest flagship model with industry-leading speed and agentic tool calling capabilities. Lowest hallucination rate on market. ## MiMo-V2-Pro Provider: Xiaomi ID: xiaomi/mimo-v2-pro Context: 1,048,576 tokens Max output: 131,072 tokens Pricing: $1.00 input / $3.00 output per 1M tokens (blended: $1.50) Capabilities: tool use, reasoning, structured output Scores (0-100): Reasoning: 58.2, Coding: 41.4, Agentic: 82.9 Speed: 35.0 tokens/sec, TTFT 2.07s Available via: Xiaomi Benchmarks (Artificial Analysis): intelligence_index: 49.2, coding_index: 41.4, gpqa: 0.87, hle: 0.283, scicode: 0.425, ifbench: 0.688, terminalbench_hard: 0.409 EQ-Bench v3 Score: 70.55/100 PinchBench: 83.95% best, 80.7% avg (15 runs) Description: Xiaomi's flagship foundation model with 1T+ parameters and 1M context length, deeply optimized for agentic scenarios. ## MiniMax M2.7 Provider: MiniMax ID: minimax/minimax-m2.7 Context: 204,800 tokens Max output: 131,072 tokens Pricing: $0.30 input / $1.20 output per 1M tokens (blended: $0.53) Capabilities: tool use, reasoning, structured output Scores (0-100): Reasoning: 58.5, Coding: 41.9, Agentic: 87.6 Speed: 41.9 tokens/sec, TTFT 1.385s Available via: Minimax Benchmarks (Artificial Analysis): intelligence_index: 49.6, coding_index: 41.9, gpqa: 0.874, hle: 0.281, scicode: 0.47, ifbench: 0.757, terminalbench_hard: 0.394 EQ-Bench v3 Score: 68.75/100 PinchBench: 89.8% best, 83.2% avg (11 runs) Description: Next-generation LLM designed for autonomous, real-world productivity. Advanced agentic capabilities through multi-agent architecture. ## Qwen3.6 Plus Provider: Qwen ID: qwen/qwen3.6-plus:free Context: 1,000,000 tokens Max output: 65,536 tokens Pricing: FREE Capabilities: tool use, reasoning, vision, structured output Scores (0-100): Reasoning: 57.0, Coding: 41.3, Agentic: 87.1 Speed: 44.0 tokens/sec, TTFT 1.59s Available via: Qwen Benchmarks (Artificial Analysis): intelligence_index: 45.0, coding_index: 41.3, gpqa: 0.893, hle: 0.273, scicode: 0.42, ifbench: 0.788, terminalbench_hard: 0.409, note: Data from Qwen3.5-397B (predecessor) EQ-Bench v3 Score: 60.45/100 EQ-Bench Elo: 1417.4 PinchBench: 88.6% best, 84.0% avg (5 runs) Description: Hybrid architecture combining linear attention with sparse MoE routing. Strong scalability and high-performance inference. Free on OpenRouter. ## Step 3.5 Flash Provider: StepFun ID: stepfun/step-3.5-flash Context: 262,144 tokens Max output: 65,536 tokens Pricing: $0.10 input / $0.30 output per 1M tokens (blended: $0.15) Capabilities: tool use, reasoning, structured output Scores (0-100): Reasoning: 50.0, Coding: 31.6, Agentic: 82.5 Speed: 85.7 tokens/sec, TTFT 1.271s Available via: DeepInfra, SiliconFlow, StepFun Benchmarks (Artificial Analysis): intelligence_index: 37.8, coding_index: 31.6, gpqa: 0.831, hle: 0.191, scicode: 0.404, ifbench: 0.646, terminalbench_hard: 0.273 EQ-Bench v3 Score: 69.25/100 PinchBench: 85.3% best, 76.9% avg (18 runs) Description: StepFun's most capable open-source model. Sparse MoE architecture activating 11B of 196B parameters per token. --- Source: https://heartcentered.ai/model-benchmarks/ Data: https://heartcentered.ai/model-benchmarks/data/model-data.json