Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 12 additions & 10 deletions model-benchmarks/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -429,7 +429,7 @@ <h1 class="text-3xl md:text-4xl lg:text-5xl font-display font-semibold mb-4">
class="text-sm text-of-muted text-center max-w-2xl mx-auto mb-8"
data-aos="fade-up">
Numbers tell you <em>what</em> a model can do. Traits tell you <em>who</em>
it is. Our editorial reads, grounded in 22-dimension
it is. Our editorial reads are grounded in 22-dimension
<a
href="https://eqbench.com"
target="_blank"
Expand All @@ -449,12 +449,12 @@ <h1 class="text-3xl md:text-4xl lg:text-5xl font-display font-semibold mb-4">
<div class="insight-card">
<div class="insight-header">
<span class="insight-tag insight-tag--top">Highest EQ</span>
<span class="insight-cost">$5.62/M</span>
<span class="insight-cost">$5.63/M</span>
</div>
<h3 class="insight-model">GPT-5.4</h3>
<p class="insight-read">
The most emotionally intelligent model tested. Highest correctness and
depth of insight, with exceptionally low sycophancy.
The most emotionally intelligent model tested. Tied for highest depth of
insight (15.8), highest correctness, and exceptionally low sycophancy.
</p>
<div class="insight-traits">
<span class="insight-trait insight-trait--positive"
Expand All @@ -476,7 +476,7 @@ <h3 class="insight-model">GPT-5.4</h3>
<h3 class="insight-model">Claude Opus 4.6</h3>
<p class="insight-read">
Highest empathy among flagships with deep insight. Leads on demonstrated
empathy and emotional reasoning. Premium price, premium presence.
empathy. Premium price, premium presence.
</p>
<div class="insight-traits">
<span class="insight-trait insight-trait--positive">Empathy 14.9</span>
Expand Down Expand Up @@ -559,7 +559,7 @@ <h3 class="insight-model">MiniMax M2.7</h3>
</div>
<h3 class="insight-model">Step 3.5 Flash</h3>
<p class="insight-read">
Scores 69.25 on EQ — beating models that cost 30-60x more. At fifteen
Scores 69.25 on EQ — beating models that cost 10-40x more. At fifteen
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Correct Step 3.5 Flash cost-multiplier claim

The updated copy still overstates the supported range: with current benchmark data, Step 3.5 Flash ($0.15/M, EQ 69.25) only beats models up to 30x its cost (e.g., Gemini 3.1 Pro at $4.50/M and EQ 68.95 in model-benchmarks/data/model-data.json), so saying "10-40x more" is factually inconsistent and can mislead readers about the price/performance spread.

Useful? React with 👍 / 👎.

cents per million tokens, the best EQ-per-dollar in the field. No
detailed trait breakdown available yet.
</p>
Expand Down Expand Up @@ -599,9 +599,9 @@ <h3 class="insight-model">GPT-5.4 Mini</h3>
</div>
<h3 class="insight-model">Grok 4.20</h3>
<p class="insight-read">
Decent v3 score (68.55) but the lowest Elo ranking (856) by far — humans
don't enjoy chatting with it. Strong subtext reading, but something gets
lost in delivery.
Decent v3 score (68.55) but lowest EQ-Bench Elo (856) — struggles with
emotional nuance tests despite strong Arena Elo (1491, rank 4) showing
humans like chatting with it. Strong subtext reading.
</p>
<div class="insight-traits">
<span class="insight-trait insight-trait--positive">Subtext 15.8</span>
Expand All @@ -623,6 +623,7 @@ <h3 class="insight-model">Qwen3.6 Plus</h3>
Free is free. But highest sycophancy (6.2) and lowest EQ score (60.45)
of the set. Most likely to tell you what you want to hear rather than
what you need to hear.
<em>(Benchmark data from Qwen3.5-397B predecessor)</em>
</p>
<div class="insight-traits">
<span class="insight-trait insight-trait--negative"
Expand All @@ -635,7 +636,8 @@ <h3 class="insight-model">Qwen3.6 Plus</h3>
</div>

<p class="text-xs text-of-muted text-center mt-6">
All trait scores are 0–20 from EQ-Bench v3.
Individual trait scores (warmth, empathy, etc.) are 0–20 from EQ-Bench v3.
EQ scores are 0–100; Elo rankings vary by benchmark.
</p>
</div>
</section>
Expand Down
Loading