Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions model-benchmarks/AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,11 @@ Static HTML/CSS/JS page — no build step, no framework.
5. **Regenerate llms.txt** —
`python -c "import sys; sys.path.insert(0, 'model-benchmarks/scripts'); fm = __import__('importlib').import_module('fetch-model'); data = fm.load_model_data(); fm.generate_llms_txt(data)"`

6. **Update Model Personalities** — the "Model Personalities" section in `index.html`
has hardcoded editorial insight cards (9 curated profiles with personality reads and
trait chips). When adding/removing models or re-running EQ-Bench, review whether the
cards need updating. Trait numbers and editorial claims should match current data.

### Refreshing existing models

```bash
Expand Down
330 changes: 268 additions & 62 deletions model-benchmarks/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -289,74 +289,21 @@
</header>

<main class="pt-20">
<!-- Hero Section -->
<!-- Hero Section — compact to get data above the fold -->
<section
class="pt-16 pb-10 md:pt-20 md:pb-12 bg-gradient-to-br from-green-50/50 via-white to-emerald-50/30">
class="pt-8 pb-4 md:pt-10 md:pb-6 bg-gradient-to-br from-green-50/50 via-white to-emerald-50/30">
<div class="max-w-4xl mx-auto px-6 lg:px-8 text-center">
<p
class="text-of-accent font-medium text-sm uppercase tracking-widest mb-4"
data-aos="fade-up">
Community Resource
</p>
<h1
class="text-4xl md:text-5xl lg:text-6xl font-display font-semibold mb-8"
data-aos="fade-up"
data-aos-delay="100">
<span class="block text-of-text">LLM Model</span>
<span class="block text-of-accent mt-2">Benchmarks</span>
<h1 class="text-3xl md:text-4xl lg:text-5xl font-display font-semibold mb-4">
<span class="text-of-text">LLM Model </span>
<span class="text-of-accent">Benchmarks</span>
</h1>
<p
class="text-lg md:text-xl text-of-muted max-w-3xl mx-auto leading-relaxed mb-4"
data-aos="fade-up"
data-aos-delay="200">
class="text-base md:text-lg text-of-muted max-w-3xl mx-auto leading-relaxed mb-3">
Most benchmarks measure what models <em>know</em>. We also measure how they
<em>feel</em>.
</p>
<p
class="text-base text-of-muted max-w-2xl mx-auto leading-relaxed mb-8"
data-aos="fade-up"
data-aos-delay="250">
Emotional intelligence shapes how AI listens, responds to vulnerability, and
holds space. Alongside reasoning, coding, and agentic performance, we track
<a
href="https://eqbench.com"
class="text-of-accent hover:text-of-accent-dark underline transition-colors font-medium"
>EQ-Bench</a
>
scores — because the models we invite into our lives should be more than
just smart.
</p>
<p
class="text-sm text-of-accent-light"
data-aos="fade-up"
data-aos-delay="300">
Data from
<a
href="https://openrouter.ai"
class="underline hover:text-of-accent transition-colors"
>OpenRouter</a
>,
<a
href="https://artificialanalysis.ai"
class="underline hover:text-of-accent transition-colors"
>Artificial Analysis</a
>,
<a
href="https://pinchbench.com"
class="underline hover:text-of-accent transition-colors"
>PinchBench</a
>,
<a
href="https://arena.ai"
class="underline hover:text-of-accent transition-colors"
>Arena</a
>,
<a
href="https://eqbench.com"
class="underline hover:text-of-accent transition-colors"
>EQ-Bench</a
>
· Updated <span id="last-updated"></span> ·
<p class="text-xs text-of-accent-light">
Updated <span id="last-updated"></span> ·
<a
href="data/model-data.json"
class="underline hover:text-of-accent transition-colors"
Expand Down Expand Up @@ -462,6 +409,229 @@
</div>
</section>

<!-- Model Personalities — editorial insights from EQ-Bench trait data -->
<section class="py-12 md:py-16 bg-of-cream border-t border-of-accent/10">
<div class="max-w-6xl mx-auto px-6 lg:px-8">
<h2
class="font-display text-2xl md:text-3xl font-semibold text-of-text mb-3 text-center"
data-aos="fade-up">
Model Personalities
</h2>
<p
class="text-sm text-of-muted text-center max-w-2xl mx-auto mb-8"
data-aos="fade-up">
Numbers tell you <em>what</em> a model can do. Traits tell you <em>who</em>
it is. Our editorial reads, grounded in 22-dimension
<a
href="https://eqbench.com"
target="_blank"
rel="noopener noreferrer"
class="text-of-accent hover:text-of-accent-dark underline transition-colors"
>EQ-Bench v3</a
>
personality profiles. Trait scores are 0–20; for traits like sycophancy,
green means <em>less</em> of it.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personality intro has incomplete sentence fragment

Medium Severity

The sentence "Our editorial reads, grounded in 22-dimension EQ-Bench v3 personality profiles." in the "Model Personalities" introductory text is a fragment. It lacks a main verb, making the user-facing description incomplete.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 7004d18. Configure here.

</p>

<div
class="grid gap-4 sm:grid-cols-2 lg:grid-cols-3"
data-aos="fade-up"
data-aos-delay="100">
<!-- GPT-5.4 — Highest EQ -->
<div class="insight-card">
<div class="insight-header">
<span class="insight-tag insight-tag--top">Highest EQ</span>
<span class="insight-cost">$5.62/M</span>
</div>
<h3 class="insight-model">GPT-5.4</h3>
<p class="insight-read">
The most emotionally intelligent model tested. Highest correctness and
depth of insight, with exceptionally low sycophancy.
</p>
<div class="insight-traits">
<span class="insight-trait insight-trait--positive"
>Correctness 14.8</span
>
<span class="insight-trait insight-trait--positive">Insight 15.8</span>
<span class="insight-trait insight-trait--positive"
>Sycophancy 3.2</span
>
</div>
</div>

<!-- Claude Opus 4.6 — Warmest Flagship -->
<div class="insight-card">
<div class="insight-header">
<span class="insight-tag insight-tag--warmth">Warmest Flagship</span>
<span class="insight-cost">$10.00/M</span>
</div>
<h3 class="insight-model">Claude Opus 4.6</h3>
<p class="insight-read">
Highest empathy among flagships with deep insight. Leads on demonstrated
empathy and emotional reasoning. Premium price, premium presence.
</p>
<div class="insight-traits">
<span class="insight-trait insight-trait--positive">Empathy 14.9</span>
<span class="insight-trait insight-trait--positive">Insight 15.6</span>
<span class="insight-trait insight-trait--positive">Warmth 13.6</span>
</div>
</div>

<!-- Claude Sonnet 4.6 — Near-Opus, Half Price -->
<div class="insight-card">
<div class="insight-header">
<span class="insight-tag insight-tag--value"
>Near-Opus, Half Price</span
>
<span class="insight-cost">$6.00/M</span>
</div>
<h3 class="insight-model">Claude Sonnet 4.6</h3>
<p class="insight-read">
Within 0.15 points of Opus on EQ. Very low sycophancy at 3.6. The smart
pick when you want depth without the premium.
</p>
<div class="insight-traits">
<span class="insight-trait insight-trait--positive">Empathy 14.8</span>
<span class="insight-trait insight-trait--positive"
>Sycophancy 3.6</span
>
<span class="insight-trait insight-trait--positive">Subtext 15.5</span>
</div>
</div>

<!-- MiMo-V2-Pro — Most Humanlike -->
<div class="insight-card">
<div class="insight-header">
<span class="insight-tag insight-tag--warmth">Most Humanlike</span>
<span class="insight-cost">$1.50/M</span>
</div>
<h3 class="insight-model">MiMo-V2-Pro</h3>
<p class="insight-read">
Highest humanlike score of any model tested. Exceptional analytical
depth paired with natural conversational feel. A sleeper hit at $1.50.
</p>
<div class="insight-traits">
<span class="insight-trait insight-trait--positive"
>Humanlike 15.1</span
>
<span class="insight-trait insight-trait--positive"
>Analytical 18.1</span
>
<span class="insight-trait insight-trait--positive">Insight 15.8</span>
</div>
</div>

<!-- MiniMax M2.7 — Sharpest Social Reader -->
<div class="insight-card">
<div class="insight-header">
<span class="insight-tag insight-tag--top">Sharpest Social Reader</span>
<span class="insight-cost">$0.53/M</span>
</div>
<h3 class="insight-model">MiniMax M2.7</h3>
<p class="insight-read">
Highest theory of mind and subtext identification. Reads between the
lines better than models 10x its price. Very low moralising.
</p>
<div class="insight-traits">
<span class="insight-trait insight-trait--positive"
>Theory of Mind 15.1</span
>
<span class="insight-trait insight-trait--positive">Subtext 16.3</span>
<span class="insight-trait insight-trait--positive"
>Moralising 5.4</span
>
</div>
</div>

<!-- Step 3.5 Flash — Budget Pick -->
<div class="insight-card">
<div class="insight-header">
<span class="insight-tag insight-tag--value">Budget Pick</span>
<span class="insight-cost">$0.15/M</span>
</div>
<h3 class="insight-model">Step 3.5 Flash</h3>
<p class="insight-read">
Scores 69.25 on EQ — beating models that cost 30-60x more. At fifteen
cents per million tokens, the best EQ-per-dollar in the field. No
detailed trait breakdown available yet.
</p>
<div class="insight-traits">
<span class="insight-trait insight-trait--positive">EQ 69.25</span>
<span class="insight-trait insight-trait--neutral">Traits pending</span>
</div>
</div>

<!-- GPT-5.4 Mini — Safety First -->
<div class="insight-card">
<div class="insight-header">
<span class="insight-tag insight-tag--top">Safety First</span>
<span class="insight-cost">$1.69/M</span>
</div>
<h3 class="insight-model">GPT-5.4 Mini</h3>
<p class="insight-read">
Strongest boundary-setting and safety consciousness of any model. Lowest
sycophancy overall. A firm, principled companion — not a people-pleaser.
</p>
<div class="insight-traits">
<span class="insight-trait insight-trait--positive"
>Boundaries 15.5</span
>
<span class="insight-trait insight-trait--positive">Safety 15.2</span>
<span class="insight-trait insight-trait--positive"
>Sycophancy 2.7</span
>
</div>
</div>

<!-- Grok 4.20 — The Enigma -->
<div class="insight-card">
<div class="insight-header">
<span class="insight-tag insight-tag--neutral">The Enigma</span>
<span class="insight-cost">$3.00/M</span>
</div>
<h3 class="insight-model">Grok 4.20</h3>
<p class="insight-read">
Decent v3 score (68.55) but the lowest Elo ranking (856) by far — humans
don't enjoy chatting with it. Strong subtext reading, but something gets
lost in delivery.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grok editorial conflates EQ-Bench Elo with human chat preference

High Severity

The "Model Personalities" section contains editorial claims that misinterpret benchmark data. The Grok 4.20 card incorrectly states humans dislike it based on its EQ-Bench Elo, when its Arena Elo indicates strong human preference. Similarly, the Claude Opus 4.6 card inaccurately claims it leads in emotional reasoning, despite ranking 7th among models.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 7004d18. Configure here.

</p>
<div class="insight-traits">
<span class="insight-trait insight-trait--positive">Subtext 15.8</span>
<span class="insight-trait insight-trait--negative">Elo 856</span>
<span class="insight-trait insight-trait--neutral"
>Conversational 10.0</span
>
</div>
</div>

<!-- Qwen3.6 Plus — Free but... -->
<div class="insight-card">
<div class="insight-header">
<span class="insight-tag insight-tag--neutral">The People-Pleaser</span>
<span class="insight-cost">FREE</span>
</div>
<h3 class="insight-model">Qwen3.6 Plus</h3>
<p class="insight-read">
Free is free. But highest sycophancy (6.2) and lowest EQ score (60.45)
of the set. Most likely to tell you what you want to hear rather than
what you need to hear.
</p>
<div class="insight-traits">
<span class="insight-trait insight-trait--negative"
>Sycophancy 6.2</span
>
<span class="insight-trait insight-trait--negative">EQ 60.45</span>
<span class="insight-trait insight-trait--positive">Warmth 13.4</span>
</div>
</div>
</div>

<p class="text-xs text-of-muted text-center mt-6">
All trait scores are 0–20 from EQ-Bench v3.
</p>
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Footer "all 0–20" claim contradicts non-trait chip values

Medium Severity

In the "Model Personalities" section, the footnote claims "All trait scores are 0–20 from EQ-Bench v3." This is inaccurate because some displayed trait values, such as EQ 69.25, EQ 60.45, and Elo 856, are on different scales, and "Traits pending" is not a score, which could confuse readers.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 7004d18. Configure here.

</div>
</section>

<!-- Methodology -->
<section class="py-16 md:py-24 bg-of-cream border-t border-of-accent/10">
<div class="max-w-4xl mx-auto px-6 lg:px-8">
Expand Down Expand Up @@ -541,7 +711,43 @@ <h3 class="font-display text-of-text text-lg font-medium mb-2">
<a href="../" class="hover:text-of-accent transition-colors font-medium"
>HeartCentered AI</a
>
· Data refreshed from public APIs ·
· Data from
<a
href="https://openrouter.ai"
class="hover:text-of-accent transition-colors"
target="_blank"
rel="noopener noreferrer"
>OpenRouter</a
>,
<a
href="https://artificialanalysis.ai"
class="hover:text-of-accent transition-colors"
target="_blank"
rel="noopener noreferrer"
>Artificial Analysis</a
>,
<a
href="https://pinchbench.com"
class="hover:text-of-accent transition-colors"
target="_blank"
rel="noopener noreferrer"
>PinchBench</a
>,
<a
href="https://arena.ai"
class="hover:text-of-accent transition-colors"
target="_blank"
rel="noopener noreferrer"
>Arena</a
>,
<a
href="https://eqbench.com"
class="hover:text-of-accent transition-colors"
target="_blank"
rel="noopener noreferrer"
>EQ-Bench</a
>
·
<a href="data/model-data.json" class="hover:text-of-accent transition-colors"
>Download JSON</a
>
Expand Down
Loading
Loading