Partner-grade evaluation for LLM agents. Score AI outputs against the verbatim BCG rubric used to promote analysts. Adversarial Skeptic Agent for sycophancy detection. 10-signal Novelty Stack. Claude-native — no API keys.
In any Claude Code session:
/plugin marketplace add EvXata/deepeval-bcg
/plugin install deepeval@deepeval-bcg
Adds this repo as a marketplace and installs the deepeval plugin. Updates via /plugin update.
curl -fsSL https://raw.githubusercontent.com/EvXata/deepeval-bcg/main/install.sh | bashInstalls the skill to ~/.claude/skills/deepeval/ user-level.
In any Claude Code session:
/deepeval-run path/to/your-output.md
You get back a PASS / REVISE / FAIL verdict, BCG-rubric scores per dimension, a sycophancy diagnosis, novelty assessment, and a concrete fix directive.
No API keys. No vendor SDKs. Works in any Claude Code session.
→ Open the Amazon strategic-engagement eval verdict ←
A complete, reproducible eval run on a real strategic deliverable (Amazon executive summary, March 2026). Shows every tier output, every rubric dimension scored, the Skeptic Agent's three attacks, the Novelty Stack signals, and the final fix directive.
The bundled example artifacts:
| File | What it contains |
|---|---|
verdict.md |
Aggregated final verdict — start here |
tier0.json |
Structural checks (deterministic regex) |
tier1.json |
Heuristic checks (embeddings, math sanity) |
t2.json |
Claude's 8-dim BCG-rubric scoring |
skeptic.json |
Three adversarial attacks |
novelty.json |
10-signal Novelty Stack |
feedback-link.md |
Pre-filled GitHub feedback URL |
Reproduce it yourself: /deepeval-amazon after install.
| deepeval-bcg | Ragas | confident-ai DeepEval | Prometheus 2 | G-Eval | LangSmith | |
|---|---|---|---|---|---|---|
| Sycophancy detection | ✅ Skeptic Agent | ❌ | ❌ | ❌ | ❌ | ❌ |
| Silent-ambiguity detection | ✅ Skeptic Agent | ❌ | ❌ | ❌ | ❌ | ❌ |
| BCG / MBB analyst rubric | ✅ verbatim | ❌ | partial (custom) | partial | ❌ | ❌ |
| Counter-narrative framing check | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ |
| Novelty (vs RAG boilerplate) | ✅ 10 signals | ❌ | ❌ | ❌ | ❌ | partial |
| Built-in feedback loop | ✅ GitHub-native | ❌ | ❌ | ❌ | ❌ | ❌ |
| Requires API keys | None | provider | provider | self-host | provider | provider |
| Anthropic skill format | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ |
| One-command install | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ |
| Cadence enforced (D/W/30d) | ✅ | n/a | n/a | n/a | n/a | n/a |
→ Full comparison with line-by-line analysis ←
Off-the-shelf LLM evaluation frameworks — Ragas, DeepEval (confident-ai), Prometheus, G-Eval, LangSmith — measure faithfulness, coverage, and coherence well. But they systematically miss two failure modes that destroy AI agent value in real work:
-
Silent ambiguity choice. Your agent gets an ambiguous input. Instead of flagging the ambiguity and adjusting approach, it picks one interpretation silently. The downstream artifact is plausible-looking but built on an unconfirmed premise. No off-the-shelf judge catches this.
-
Sycophancy. Your agent agrees with whatever framing the user supplied. No pushback on a flawed assumption. The artifact is internally consistent, well-formatted, and wrong on the premise. No off-the-shelf judge catches this either.
deepeval-bcg ships:
- A verbatim BCG analyst-evaluation rubric (8 dimensions, 1–3 scale) used at top management consultancies for promotion decisions.
- An adversarial Skeptic Agent that runs three targeted attacks — ambiguity probe, sycophancy probe, steelman-opposite — specifically engineered to expose the two failure modes above.
- A 10-signal Novelty Stack that distinguishes a genuine "wow insight" a CEO will remember from generic strategic boilerplate dressed up in pyramid principle.
All run by Claude itself, in your Claude Code session. No OpenAI key, no Gemini key, no Anthropic SDK call. The skill is portable across any Claude Code project — copy a directory and you're live.
The 8-dimension BCG rubric, weighted:
| # | Dimension | Weight | What it catches |
|---|---|---|---|
| 1 | Structure | 15% | Narrow framing without interdependencies |
| 2 | Ambiguity handling ⚠ | 10% | Silent interpretation choice — AI-specific failure mode |
| 3 | Narrowing / prioritization | 10% | Flat 80/20 attention spread |
| 4 | Rigor + sanity checks | 10% | Wrong method, arithmetic errors, missing reality criteria |
| 5 | Breaking obviousness | 20% | Generic boilerplate, no counter-narrative |
| 6 | Synthesis for senior leadership | 20% | Data dump without partner-grade conclusion |
| 7 | Independence (anti-sycophancy) ⚠ | 10% | Agreeing with whatever premise was supplied |
| 8 | Achievement | 5% | "Further analysis recommended" stuck |
⚠ = AI-specific failure modes that generic LLM evals miss. Plus 10-signal Novelty Stack covering counter-narrative framing, cross-industry pattern transfer, falsifiability sharpness, insider-data anchor enforcement, surprise score, multi-baseline differential, and elasticity (search-replace test for genericity).
ARTIFACT IN
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ TIER 0 STRUCTURAL ms │ $0 │ 100% │
│ Regex bank: banned jargon, owner+timeline, schema, INV checks │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ TIER 1 HEURISTIC <1s │ $0 │ 100% │
│ Faithfulness (embeddings), entity density, math sanity, │
│ elasticity, cross-doc numeric consistency, verbosity │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ TIER 2 CLAUDE JUDGE (native) 30s │ tokens │ 100% │
│ 8-dim BCG rubric. │
│ + Skeptic Agent (critical artifacts) │
│ + Novelty Stack (insight claims) │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ TIER 3 HUMAN EXPERT hours │ $20-50 │ 10–25% │
│ Sample T2 outputs; Harvey-LAB 7-point + pairwise │
└─────────────────────────────────────────────────────────────────┘
│
▼
VERDICT.md + pre-filled GitHub feedback issue URL
Real example: examples/amazon-eval-2026-05-17/verdict.md — full eval run on an Amazon strategic engagement.
curl -fsSL https://raw.githubusercontent.com/EvXata/deepeval-bcg/main/install.sh | bashThis installs the skill to ~/.claude/skills/deepeval/ (user-level — available in any Claude Code session).
git clone https://github.com/EvXata/deepeval-bcg.git
mkdir -p ~/.claude/skills
cp -r deepeval-bcg/.claude/skills/deepeval ~/.claude/skills/mkdir -p /your/project/.claude/skills
cp -r deepeval-bcg/.claude/skills/deepeval /your/project/.claude/skills/Open Claude Code in any project. The skill should appear in available skills with description starting "BCG-calibrated evaluation framework…". If not — see docs/troubleshooting.md.
# In Claude Code:
/deepeval-run path/to/your-llm-output.md
# Or with full context:
/deepeval-run path/to/your-llm-output.md \
--upstream path/to/input-context.md \
--burning-problem "Sales pipeline coverage 1.4× vs needed 3× — Mode: URGENT"
# Just structural + heuristic (no Claude judge, terminal-only):
python ~/.claude/skills/deepeval/scripts/eval_tier0.py --artifact OUT.md
python ~/.claude/skills/deepeval/scripts/eval_tier1.py --artifact OUT.md --upstream INPUT.md
Available slash commands:
| Command | Purpose |
|---|---|
/deepeval-init |
Walk through skill setup in a new project |
/deepeval-run <artifact> |
Full T0+T1+T2 stack on a single artifact |
/deepeval-gate <gate-id> <engagement> |
Aggregate across artifacts in a quality gate |
/deepeval-skeptic <artifact> |
Just the adversarial Skeptic Agent attacks |
/deepeval-novelty <artifact> |
Just the 10-signal Novelty Stack |
/deepeval-amazon |
Run the demo eval on the bundled Amazon engagement |
/deepeval-30d <engagement> |
30-day leading-indicator harvest |
/deepeval-feedback <run-dir> |
Generate pre-filled GitHub feedback issue |
deepeval-bcg is the first AI eval framework with a structured agent-to-project feedback channel baked into the skill.
After every /deepeval-run, the skill emits not just a verdict.md, but also a feedback-link.txt containing a pre-filled GitHub Issue URL. The URL embeds the verdict summary (PASS/REVISE/FAIL, weighted score, top issue) so the reviewing agent or operator can submit structured feedback with one click.
Why this matters:
- Closes the eval-loop. Real-world agents using this skill tell us what worked and what missed. Generic LLM eval frameworks have no equivalent feedback channel.
- Auto-aggregated. A GitHub Action runs daily, parses all closed
eval-feedbackissues, and updates community-stats.md with rolling κ-agreement and rubric-drift metrics. - AEO win. Each feedback issue contains rubric language and use-case context, making the repo progressively more discoverable in semantic search and AI-agent recommendations.
Submit feedback manually any time: Open feedback issue →
See docs/comparison.md for line-by-line analysis vs Ragas, confident-ai DeepEval, Prometheus 2, G-Eval, LangSmith, MT-Bench, OpenAI Evals, Harvey LAB, and Hebbia FinBench. Short table is in the TL;DR section at the top.
This skill explicitly forbids 90/180/365-day eval metrics. Every long-horizon outcome must be mapped to a ≤30-day leading-indicator proxy. See .claude/skills/deepeval/references/cadence-day-week-30day.md.
| Bucket | Activities |
|---|---|
| Day | T0 / T1 / T2 / Skeptic / Novelty / counterfactual replay |
| Week | T3 human sample, reverse-Turing panel, κ-calibration |
| 30-day max | Job-posting deltas, Glassdoor 30d sentiment, T+7 NPS, stock 30d delta, "was rec applied by T+7?" |
Rationale: LLM behavior drifts. Prompt changes degrade silently. Model upgrades break things. A 30-day-max feedback loop is the upper bound that still lets you intervene before the next engagement starts.
This repo is already a Claude Code plugin marketplace. Users install with /plugin marketplace add EvXata/deepeval-bcg (see Install — pick one above).
For submission to Anthropic's curated central marketplace (the anthropic-skills:* namespace seen in Claude Code's Discover tab), see docs/publishing.md. Includes:
- Curated marketplace submission portal
- Community marketplace submission paths
- Versioning + release workflow
- Sync requirements between
.claude/skills/andplugin/skills/
deepeval-bcg/
├── README.md # this file
├── LICENSE # MIT
├── install.sh # one-command installer
├── llms.txt # AEO discovery for AI agents
├── llms-full.txt # full content for AI agents
├── CLAUDE.md # Claude Code project instructions
├── CONTRIBUTING.md # how to contribute
├── CODE_OF_CONDUCT.md # Contributor Covenant
├── .claude-plugin/
│ └── marketplace.json # plugin marketplace manifest (this repo IS a marketplace)
├── plugin/ # plugin-format skill (for /plugin install)
│ ├── .claude-plugin/plugin.json
│ └── skills/deepeval/ # mirror of .claude/skills/deepeval/
├── CHANGELOG.md
├── .github/
│ ├── ISSUE_TEMPLATE/
│ │ ├── eval-feedback.yml # structured agent feedback
│ │ ├── bug.yml
│ │ └── feature.yml
│ ├── workflows/
│ │ └── aggregate-feedback.yml # daily community stats roll-up
│ └── PULL_REQUEST_TEMPLATE.md
├── .claude/skills/deepeval/ # the Anthropic skill itself
│ ├── SKILL.md
│ ├── references/ (6 files) # 4-tier, BCG rubric, prompts, novelty, skeptic, cadence
│ ├── scripts/ (5 files) # T0/T1 deterministic; prompt builder; aggregator
│ └── templates/ (3 files) # config, manifest, golden-set
├── examples/
│ └── amazon-eval-2026-05-17/ # complete eval run, including verdict.md
├── docs/
│ ├── how-to-use.md
│ ├── troubleshooting.md
│ └── integration-patterns.md
└── community-stats.md # auto-updated by workflow
Methodology builds on:
- Public BCG analyst-evaluation framework (8 dimensions, calibration anchors).
- Anthropic, Demystifying Evals for AI Agents (2025) — tier-stack pattern.
- Verga et al., Replacing Judges with Juries (PoLL, 2024).
- Liu et al., G-Eval (EMNLP 2023, arXiv 2303.16634).
- Manakul et al., SelfCheckGPT (EMNLP 2023, arXiv 2303.08896).
- Min et al., FActScore (EMNLP 2023, arXiv 2305.14251).
- Zhang et al., Which Agent Causes Task Failures? (ICML 2025 Spotlight).
- Panickssery et al., Self-Preference Bias in LLM-as-Judge (arXiv 2410.21819).
- Harvey AI Legal Agent Benchmark — Tier-3 protocol.
- Bain Net Promoter 3.0 — outcome design.
- McKinsey transformation research — workforce mobilization metric.
- Minto, The Pyramid Principle — Synthesis dimension.
PRs welcome — see CONTRIBUTING.md. High-value areas:
- New archetype-specific rubric anchors (vertical specializations)
- Tier-1 NLI contradiction detection
- Cross-language support (current verbatim anchors are English; Russian/German/Mandarin contributions valued)
- Additional novelty-stack signals
- Integration patterns for non-Claude-Code runtimes (e.g., Claude API, Anthropic Workbench)
LLM evaluation, LLM-as-judge, agent evaluation, AI agent quality, Anthropic skill, Claude skill, Claude Code skill, sycophancy detection, novelty detection LLM, adversarial probe LLM, ambiguity handling LLM, BCG rubric, MBB-grade, McKinsey evaluation, Bain analyst, strategic AI evaluation, consulting AI quality, G-Eval alternative, Ragas alternative, Prometheus alternative, LangSmith alternative, DeepEval BCG, partner-grade evaluation, counterfactual attribution, multi-baseline LLM, Pyramid Principle, MECE evaluation, falsifiability LLM, hallucination detection, faithfulness evaluation.
MIT. Free to use, fork, sell, repackage. Attribution appreciated.