From a7f72a0fd0c7359c32913710c5f1c1e27cfc6db3 Mon Sep 17 00:00:00 2001 From: Nick Sullivan Date: Tue, 7 Apr 2026 08:30:11 -0500 Subject: [PATCH 1/2] Add Model Personalities section, compress hero for data above fold MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Compress hero from ~400px to ~150px so the benchmark table is visible on first load. Add editorial "Model Personalities" section with 9 curated insight cards based on EQ-Bench v3 trait data, placed between the data table and methodology. Cards highlight personality profiles (Highest EQ, Warmest Flagship, Most Humanlike, etc.) with trait chips grounded in the 22-dimension EQ-Bench data. Reviewed by 6 parallel agents (logic, UX/empathy, coding, security, design, architecture) — fixed data accuracy bugs (Opus warmth, Sonnet sycophancy claim), softened editorial tone on caution tags, restored data source attribution to footer, improved trait footnote clarity, and added maintenance docs to AGENTS.md. Co-Authored-By: Claude Opus 4.6 --- model-benchmarks/AGENTS.md | 5 + model-benchmarks/index.html | 332 +++++++++++++++++++++++++++++------- model-benchmarks/styles.css | 108 ++++++++++++ 3 files changed, 383 insertions(+), 62 deletions(-) diff --git a/model-benchmarks/AGENTS.md b/model-benchmarks/AGENTS.md index 806c1dc..7d33ac6 100644 --- a/model-benchmarks/AGENTS.md +++ b/model-benchmarks/AGENTS.md @@ -42,6 +42,11 @@ Static HTML/CSS/JS page — no build step, no framework. 5. **Regenerate llms.txt** — `python -c "import sys; sys.path.insert(0, 'model-benchmarks/scripts'); fm = __import__('importlib').import_module('fetch-model'); data = fm.load_model_data(); fm.generate_llms_txt(data)"` +6. **Update Model Personalities** — the "Model Personalities" section in `index.html` + has hardcoded editorial insight cards (9 curated profiles with personality reads and + trait chips). When adding/removing models or re-running EQ-Bench, review whether the + cards need updating. Trait numbers and editorial claims should match current data. + ### Refreshing existing models ```bash diff --git a/model-benchmarks/index.html b/model-benchmarks/index.html index 23f4228..58352cd 100644 --- a/model-benchmarks/index.html +++ b/model-benchmarks/index.html @@ -289,74 +289,21 @@
- +
+ class="pt-8 pb-4 md:pt-10 md:pb-6 bg-gradient-to-br from-green-50/50 via-white to-emerald-50/30">
-

- Community Resource -

-

- LLM Model - Benchmarks +

+ LLM Model + Benchmarks

+ class="text-base md:text-lg text-of-muted max-w-3xl mx-auto leading-relaxed mb-3"> Most benchmarks measure what models know. We also measure how they feel.

-

- Emotional intelligence shapes how AI listens, responds to vulnerability, and - holds space. Alongside reasoning, coding, and agentic performance, we track - EQ-Bench - scores — because the models we invite into our lives should be more than - just smart. -

-

- Data from - OpenRouter, - Artificial Analysis, - PinchBench, - Arena, - EQ-Bench - · Updated · +

+ Updated ·

+ +
+
+

+ Model Personalities +

+

+ Numbers tell you what a model can do. Traits tell you who + it is. Our editorial reads, grounded in 22-dimension + EQ-Bench v3 + personality profiles. +

+ +
+ +
+
+ Highest EQ + $5.62/M +
+

GPT-5.4

+

+ The most emotionally intelligent model tested. Highest correctness and + depth of insight, with exceptionally low sycophancy. +

+
+ Correctness 14.8 + Insight 15.8 + Sycophancy 3.2 +
+
+ + +
+
+ Warmest Flagship + $10.00/M +
+

Claude Opus 4.6

+

+ Highest empathy among flagships with deep insight. Leads on demonstrated + empathy and emotional reasoning. Premium price, premium presence. +

+
+ Empathy 14.9 + Insight 15.6 + Warmth 13.6 +
+
+ + +
+
+ Near-Opus, Half Price + $6.00/M +
+

Claude Sonnet 4.6

+

+ Within 0.15 points of Opus on EQ. Very low sycophancy at 3.6. The smart + pick when you want depth without the premium. +

+
+ Empathy 14.8 + Sycophancy 3.6 + Subtext 15.5 +
+
+ + +
+
+ Most Humanlike + $1.50/M +
+

MiMo-V2-Pro

+

+ Highest humanlike score of any model tested. Exceptional analytical + depth paired with natural conversational feel. A sleeper hit at $1.50. +

+
+ Humanlike 15.1 + Analytical 18.1 + Insight 15.8 +
+
+ + +
+
+ Sharpest Social Reader + $0.53/M +
+

MiniMax M2.7

+

+ Highest theory of mind and subtext identification. Reads between the + lines better than models 10x its price. Very low moralising. +

+
+ Theory of Mind 15.1 + Subtext 16.3 + Moralising 5.4 +
+
+ + +
+
+ Budget Pick + $0.15/M +
+

Step 3.5 Flash

+

+ Scores 69.25 on EQ — beating models that cost 30-60x more. At fifteen + cents per million tokens, the best EQ-per-dollar in the field. No + detailed trait breakdown available yet. +

+
+ EQ 69.25 + 85 t/s + $0.15/M +
+
+ + +
+
+ Safety First + $1.69/M +
+

GPT-5.4 Mini

+

+ Strongest boundary-setting and safety consciousness of any model. Lowest + sycophancy overall. A firm, principled companion — not a people-pleaser. +

+
+ Boundaries 15.5 + Safety 15.2 + Sycophancy 2.7 +
+
+ + +
+
+ The Enigma + $3.00/M +
+

Grok 4.20

+

+ Decent v3 score (68.55) but the lowest Elo ranking (856) by far — humans + don't enjoy chatting with it. Strong subtext reading, but something gets + lost in delivery. +

+
+ Subtext 15.8 + Elo 856 + Conversational 10.0 +
+
+ + +
+
+ The People-Pleaser + FREE +
+

Qwen3.6 Plus

+

+ Free is free. But highest sycophancy (6.2) and lowest EQ score (60.45) + of the set. Most likely to tell you what you want to hear rather than + what you need to hear. +

+
+ Sycophancy 6.2 + EQ 60.45 + Warmth 13.4 +
+
+
+ +

+ Trait scores are 0-20 from EQ-Bench v3. Traits like sycophancy, moralising, + compliance, and reactivity are scored where lower is better — green means + less of it. +

+
+
+
@@ -541,7 +713,43 @@

HeartCentered AI - · Data refreshed from public APIs · + · Data from + OpenRouter, + Artificial Analysis, + PinchBench, + Arena, + EQ-Bench + · Download JSON diff --git a/model-benchmarks/styles.css b/model-benchmarks/styles.css index e8a6360..f4653f9 100644 --- a/model-benchmarks/styles.css +++ b/model-benchmarks/styles.css @@ -413,6 +413,114 @@ display: block; } +/* Model Personality insight cards */ +.insight-card { + background: var(--of-surface); + border: 1px solid rgba(93, 123, 111, 0.12); + border-radius: 0.75rem; + padding: 1.25rem; + transition: + box-shadow 0.2s, + transform 0.2s; +} + +.insight-card:hover { + box-shadow: 0 4px 16px rgba(42, 58, 42, 0.08); + transform: translateY(-1px); +} + +.insight-header { + display: flex; + justify-content: space-between; + align-items: center; + margin-bottom: 0.5rem; +} + +.insight-tag { + font-size: 0.6875rem; + font-weight: 600; + text-transform: uppercase; + letter-spacing: 0.04em; + padding: 0.2rem 0.5rem; + border-radius: 999px; +} + +.insight-tag--top { + background: rgba(45, 106, 79, 0.15); + color: #1b4332; +} + +.insight-tag--warmth { + background: rgba(212, 184, 150, 0.25); + color: #8b6914; +} + +.insight-tag--value { + background: rgba(93, 123, 111, 0.1); + color: var(--of-accent-dark); +} + +.insight-tag--neutral { + background: rgba(93, 123, 111, 0.08); + color: var(--of-muted); +} + +.insight-tag--caution { + background: rgba(180, 83, 9, 0.1); + color: #92400e; +} + +.insight-cost { + font-size: 0.75rem; + font-weight: 500; + color: var(--of-muted); + font-variant-numeric: tabular-nums; +} + +.insight-model { + font-family: var(--font-display); + font-size: 1.125rem; + font-weight: 600; + color: var(--of-text); + margin-bottom: 0.375rem; +} + +.insight-read { + font-size: 0.8125rem; + line-height: 1.5; + color: var(--of-muted); + margin-bottom: 0.75rem; +} + +.insight-traits { + display: flex; + flex-wrap: wrap; + gap: 0.375rem; +} + +.insight-trait { + font-size: 0.6875rem; + font-weight: 500; + padding: 0.2rem 0.5rem; + border-radius: 999px; + font-variant-numeric: tabular-nums; +} + +.insight-trait--positive { + background: rgba(45, 106, 79, 0.08); + color: #2d6a4f; +} + +.insight-trait--negative { + background: rgba(180, 83, 9, 0.08); + color: #92400e; +} + +.insight-trait--neutral { + background: rgba(93, 123, 111, 0.06); + color: var(--of-muted); +} + /* Animations — rows start visible, animate in subtly */ .bench-table tbody tr { animation: fadeInRow 0.3s ease forwards; From 7004d182f6466d7ba29a70972d26cbd555e0ab42 Mon Sep 17 00:00:00 2001 From: Nick Sullivan Date: Tue, 7 Apr 2026 08:35:13 -0500 Subject: [PATCH 2/2] Address Claude bot review: dead CSS, Step Flash chips, Safety First tag, sycophancy hint MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Remove dead .insight-tag--caution CSS (softened to neutral in PR, never used) - Step 3.5 Flash: simplify trait chips to EQ 69.25 + "Traits pending" - GPT-5.4 Mini "Safety First" tag: neutral → top (positive framing deserves positive color) - Move sycophancy inversion hint into section intro (before chip grid, not just footnote) - Trim redundant footnote now that intro covers it Co-Authored-By: Claude Opus 4.6 --- model-benchmarks/index.html | 12 +++++------- model-benchmarks/styles.css | 5 ----- 2 files changed, 5 insertions(+), 12 deletions(-) diff --git a/model-benchmarks/index.html b/model-benchmarks/index.html index 58352cd..b8005bb 100644 --- a/model-benchmarks/index.html +++ b/model-benchmarks/index.html @@ -429,7 +429,8 @@

class="text-of-accent hover:text-of-accent-dark underline transition-colors" >EQ-Bench v3 - personality profiles. + personality profiles. Trait scores are 0–20; for traits like sycophancy, + green means less of it.

Step 3.5 Flash

EQ 69.25 - 85 t/s - $0.15/M + Traits pending

- Safety First + Safety First $1.69/M

GPT-5.4 Mini

@@ -627,9 +627,7 @@

Qwen3.6 Plus

- Trait scores are 0-20 from EQ-Bench v3. Traits like sycophancy, moralising, - compliance, and reactivity are scored where lower is better — green means - less of it. + All trait scores are 0–20 from EQ-Bench v3.

diff --git a/model-benchmarks/styles.css b/model-benchmarks/styles.css index f4653f9..1db3436 100644 --- a/model-benchmarks/styles.css +++ b/model-benchmarks/styles.css @@ -465,11 +465,6 @@ color: var(--of-muted); } -.insight-tag--caution { - background: rgba(180, 83, 9, 0.1); - color: #92400e; -} - .insight-cost { font-size: 0.75rem; font-weight: 500;