diff --git a/model-benchmarks/AGENTS.md b/model-benchmarks/AGENTS.md index 806c1dc..7d33ac6 100644 --- a/model-benchmarks/AGENTS.md +++ b/model-benchmarks/AGENTS.md @@ -42,6 +42,11 @@ Static HTML/CSS/JS page — no build step, no framework. 5. **Regenerate llms.txt** — `python -c "import sys; sys.path.insert(0, 'model-benchmarks/scripts'); fm = __import__('importlib').import_module('fetch-model'); data = fm.load_model_data(); fm.generate_llms_txt(data)"` +6. **Update Model Personalities** — the "Model Personalities" section in `index.html` + has hardcoded editorial insight cards (9 curated profiles with personality reads and + trait chips). When adding/removing models or re-running EQ-Bench, review whether the + cards need updating. Trait numbers and editorial claims should match current data. + ### Refreshing existing models ```bash diff --git a/model-benchmarks/index.html b/model-benchmarks/index.html index 23f4228..b8005bb 100644 --- a/model-benchmarks/index.html +++ b/model-benchmarks/index.html @@ -289,74 +289,21 @@
- +
+ class="pt-8 pb-4 md:pt-10 md:pb-6 bg-gradient-to-br from-green-50/50 via-white to-emerald-50/30">
-

- Community Resource -

-

- LLM Model - Benchmarks +

+ LLM Model + Benchmarks

+ class="text-base md:text-lg text-of-muted max-w-3xl mx-auto leading-relaxed mb-3"> Most benchmarks measure what models know. We also measure how they feel.

-

- Emotional intelligence shapes how AI listens, responds to vulnerability, and - holds space. Alongside reasoning, coding, and agentic performance, we track - EQ-Bench - scores — because the models we invite into our lives should be more than - just smart. -

-

- Data from - OpenRouter, - Artificial Analysis, - PinchBench, - Arena, - EQ-Bench - · Updated · +

+ Updated ·

+ +
+
+

+ Model Personalities +

+

+ Numbers tell you what a model can do. Traits tell you who + it is. Our editorial reads, grounded in 22-dimension + EQ-Bench v3 + personality profiles. Trait scores are 0–20; for traits like sycophancy, + green means less of it. +

+ +
+ +
+
+ Highest EQ + $5.62/M +
+

GPT-5.4

+

+ The most emotionally intelligent model tested. Highest correctness and + depth of insight, with exceptionally low sycophancy. +

+
+ Correctness 14.8 + Insight 15.8 + Sycophancy 3.2 +
+
+ + +
+
+ Warmest Flagship + $10.00/M +
+

Claude Opus 4.6

+

+ Highest empathy among flagships with deep insight. Leads on demonstrated + empathy and emotional reasoning. Premium price, premium presence. +

+
+ Empathy 14.9 + Insight 15.6 + Warmth 13.6 +
+
+ + +
+
+ Near-Opus, Half Price + $6.00/M +
+

Claude Sonnet 4.6

+

+ Within 0.15 points of Opus on EQ. Very low sycophancy at 3.6. The smart + pick when you want depth without the premium. +

+
+ Empathy 14.8 + Sycophancy 3.6 + Subtext 15.5 +
+
+ + +
+
+ Most Humanlike + $1.50/M +
+

MiMo-V2-Pro

+

+ Highest humanlike score of any model tested. Exceptional analytical + depth paired with natural conversational feel. A sleeper hit at $1.50. +

+
+ Humanlike 15.1 + Analytical 18.1 + Insight 15.8 +
+
+ + +
+
+ Sharpest Social Reader + $0.53/M +
+

MiniMax M2.7

+

+ Highest theory of mind and subtext identification. Reads between the + lines better than models 10x its price. Very low moralising. +

+
+ Theory of Mind 15.1 + Subtext 16.3 + Moralising 5.4 +
+
+ + +
+
+ Budget Pick + $0.15/M +
+

Step 3.5 Flash

+

+ Scores 69.25 on EQ — beating models that cost 30-60x more. At fifteen + cents per million tokens, the best EQ-per-dollar in the field. No + detailed trait breakdown available yet. +

+
+ EQ 69.25 + Traits pending +
+
+ + +
+
+ Safety First + $1.69/M +
+

GPT-5.4 Mini

+

+ Strongest boundary-setting and safety consciousness of any model. Lowest + sycophancy overall. A firm, principled companion — not a people-pleaser. +

+
+ Boundaries 15.5 + Safety 15.2 + Sycophancy 2.7 +
+
+ + +
+
+ The Enigma + $3.00/M +
+

Grok 4.20

+

+ Decent v3 score (68.55) but the lowest Elo ranking (856) by far — humans + don't enjoy chatting with it. Strong subtext reading, but something gets + lost in delivery. +

+
+ Subtext 15.8 + Elo 856 + Conversational 10.0 +
+
+ + +
+
+ The People-Pleaser + FREE +
+

Qwen3.6 Plus

+

+ Free is free. But highest sycophancy (6.2) and lowest EQ score (60.45) + of the set. Most likely to tell you what you want to hear rather than + what you need to hear. +

+
+ Sycophancy 6.2 + EQ 60.45 + Warmth 13.4 +
+
+
+ +

+ All trait scores are 0–20 from EQ-Bench v3. +

+
+
+
@@ -541,7 +711,43 @@

HeartCentered AI - · Data refreshed from public APIs · + · Data from + OpenRouter, + Artificial Analysis, + PinchBench, + Arena, + EQ-Bench + · Download JSON diff --git a/model-benchmarks/styles.css b/model-benchmarks/styles.css index e8a6360..1db3436 100644 --- a/model-benchmarks/styles.css +++ b/model-benchmarks/styles.css @@ -413,6 +413,109 @@ display: block; } +/* Model Personality insight cards */ +.insight-card { + background: var(--of-surface); + border: 1px solid rgba(93, 123, 111, 0.12); + border-radius: 0.75rem; + padding: 1.25rem; + transition: + box-shadow 0.2s, + transform 0.2s; +} + +.insight-card:hover { + box-shadow: 0 4px 16px rgba(42, 58, 42, 0.08); + transform: translateY(-1px); +} + +.insight-header { + display: flex; + justify-content: space-between; + align-items: center; + margin-bottom: 0.5rem; +} + +.insight-tag { + font-size: 0.6875rem; + font-weight: 600; + text-transform: uppercase; + letter-spacing: 0.04em; + padding: 0.2rem 0.5rem; + border-radius: 999px; +} + +.insight-tag--top { + background: rgba(45, 106, 79, 0.15); + color: #1b4332; +} + +.insight-tag--warmth { + background: rgba(212, 184, 150, 0.25); + color: #8b6914; +} + +.insight-tag--value { + background: rgba(93, 123, 111, 0.1); + color: var(--of-accent-dark); +} + +.insight-tag--neutral { + background: rgba(93, 123, 111, 0.08); + color: var(--of-muted); +} + +.insight-cost { + font-size: 0.75rem; + font-weight: 500; + color: var(--of-muted); + font-variant-numeric: tabular-nums; +} + +.insight-model { + font-family: var(--font-display); + font-size: 1.125rem; + font-weight: 600; + color: var(--of-text); + margin-bottom: 0.375rem; +} + +.insight-read { + font-size: 0.8125rem; + line-height: 1.5; + color: var(--of-muted); + margin-bottom: 0.75rem; +} + +.insight-traits { + display: flex; + flex-wrap: wrap; + gap: 0.375rem; +} + +.insight-trait { + font-size: 0.6875rem; + font-weight: 500; + padding: 0.2rem 0.5rem; + border-radius: 999px; + font-variant-numeric: tabular-nums; +} + +.insight-trait--positive { + background: rgba(45, 106, 79, 0.08); + color: #2d6a4f; +} + +.insight-trait--negative { + background: rgba(180, 83, 9, 0.08); + color: #92400e; +} + +.insight-trait--neutral { + background: rgba(93, 123, 111, 0.06); + color: var(--of-muted); +} + /* Animations — rows start visible, animate in subtly */ .bench-table tbody tr { animation: fadeInRow 0.3s ease forwards;