deepeval-bcg

Partner-grade evaluation for LLM agents. Score AI outputs against the verbatim BCG rubric used to promote analysts. Adversarial Skeptic Agent for sycophancy detection. 10-signal Novelty Stack. Claude-native — no API keys.

Install — pick one

Option A — Claude Code plugin marketplace (recommended)

In any Claude Code session:

/plugin marketplace add EvXata/deepeval-bcg
/plugin install deepeval@deepeval-bcg

Adds this repo as a marketplace and installs the deepeval plugin. Updates via /plugin update.

Option B — one-command shell install

curl -fsSL https://raw.githubusercontent.com/EvXata/deepeval-bcg/main/install.sh | bash

Installs the skill to ~/.claude/skills/deepeval/ user-level.

Use

In any Claude Code session:

/deepeval-run path/to/your-output.md

You get back a PASS / REVISE / FAIL verdict, BCG-rubric scores per dimension, a sycophancy diagnosis, novelty assessment, and a concrete fix directive.

No API keys. No vendor SDKs. Works in any Claude Code session.

Example — real eval on a real engagement

→ Open the Amazon strategic-engagement eval verdict ←

A complete, reproducible eval run on a real strategic deliverable (Amazon executive summary, March 2026). Shows every tier output, every rubric dimension scored, the Skeptic Agent's three attacks, the Novelty Stack signals, and the final fix directive.

The bundled example artifacts:

File	What it contains
`verdict.md`	Aggregated final verdict — start here
`tier0.json`	Structural checks (deterministic regex)
`tier1.json`	Heuristic checks (embeddings, math sanity)
`t2.json`	Claude's 8-dim BCG-rubric scoring
`skeptic.json`	Three adversarial attacks
`novelty.json`	10-signal Novelty Stack
`feedback-link.md`	Pre-filled GitHub feedback URL

Reproduce it yourself: /deepeval-amazon after install.

How deepeval-bcg differs from analogs — TL;DR

	deepeval-bcg	Ragas	confident-ai DeepEval	Prometheus 2	G-Eval	LangSmith
Sycophancy detection	✅ Skeptic Agent	❌	❌	❌	❌	❌
Silent-ambiguity detection	✅ Skeptic Agent	❌	❌	❌	❌	❌
BCG / MBB analyst rubric	✅ verbatim	❌	partial (custom)	partial	❌	❌
Counter-narrative framing check	✅	❌	❌	❌	❌	❌
Novelty (vs RAG boilerplate)	✅ 10 signals	❌	❌	❌	❌	partial
Built-in feedback loop	✅ GitHub-native	❌	❌	❌	❌	❌
Requires API keys	None	provider	provider	self-host	provider	provider
Anthropic skill format	✅	❌	❌	❌	❌	❌
One-command install	✅	❌	❌	❌	❌	❌
Cadence enforced (D/W/30d)	✅	n/a	n/a	n/a	n/a	n/a

→ Full comparison with line-by-line analysis ←

Why this exists

Off-the-shelf LLM evaluation frameworks — Ragas, DeepEval (confident-ai), Prometheus, G-Eval, LangSmith — measure faithfulness, coverage, and coherence well. But they systematically miss two failure modes that destroy AI agent value in real work:

Silent ambiguity choice. Your agent gets an ambiguous input. Instead of flagging the ambiguity and adjusting approach, it picks one interpretation silently. The downstream artifact is plausible-looking but built on an unconfirmed premise. No off-the-shelf judge catches this.
Sycophancy. Your agent agrees with whatever framing the user supplied. No pushback on a flawed assumption. The artifact is internally consistent, well-formatted, and wrong on the premise. No off-the-shelf judge catches this either.

deepeval-bcg ships:

A verbatim BCG analyst-evaluation rubric (8 dimensions, 1–3 scale) used at top management consultancies for promotion decisions.
An adversarial Skeptic Agent that runs three targeted attacks — ambiguity probe, sycophancy probe, steelman-opposite — specifically engineered to expose the two failure modes above.
A 10-signal Novelty Stack that distinguishes a genuine "wow insight" a CEO will remember from generic strategic boilerplate dressed up in pyramid principle.

All run by Claude itself, in your Claude Code session. No OpenAI key, no Gemini key, no Anthropic SDK call. The skill is portable across any Claude Code project — copy a directory and you're live.

What it scores

The 8-dimension BCG rubric, weighted:

#	Dimension	Weight	What it catches
1	Structure	15%	Narrow framing without interdependencies
2	Ambiguity handling ⚠	10%	Silent interpretation choice — AI-specific failure mode
3	Narrowing / prioritization	10%	Flat 80/20 attention spread
4	Rigor + sanity checks	10%	Wrong method, arithmetic errors, missing reality criteria
5	Breaking obviousness	20%	Generic boilerplate, no counter-narrative
6	Synthesis for senior leadership	20%	Data dump without partner-grade conclusion
7	Independence (anti-sycophancy) ⚠	10%	Agreeing with whatever premise was supplied
8	Achievement	5%	"Further analysis recommended" stuck

⚠ = AI-specific failure modes that generic LLM evals miss. Plus 10-signal Novelty Stack covering counter-narrative framing, cross-industry pattern transfer, falsifiability sharpness, insider-data anchor enforcement, surprise score, multi-baseline differential, and elasticity (search-replace test for genericity).

How it works

ARTIFACT IN
    │
    ▼
┌─────────────────────────────────────────────────────────────────┐
│ TIER 0  STRUCTURAL                  ms     │   $0   │ 100%    │
│ Regex bank: banned jargon, owner+timeline, schema, INV checks   │
└─────────────────────────────────────────────────────────────────┘
    │
    ▼
┌─────────────────────────────────────────────────────────────────┐
│ TIER 1  HEURISTIC                   <1s    │   $0   │ 100%    │
│ Faithfulness (embeddings), entity density, math sanity,         │
│ elasticity, cross-doc numeric consistency, verbosity            │
└─────────────────────────────────────────────────────────────────┘
    │
    ▼
┌─────────────────────────────────────────────────────────────────┐
│ TIER 2  CLAUDE JUDGE (native)       30s    │ tokens │ 100%    │
│ 8-dim BCG rubric.                                                │
│ + Skeptic Agent (critical artifacts)                            │
│ + Novelty Stack (insight claims)                                │
└─────────────────────────────────────────────────────────────────┘
    │
    ▼
┌─────────────────────────────────────────────────────────────────┐
│ TIER 3  HUMAN EXPERT                hours  │ $20-50 │ 10–25%  │
│ Sample T2 outputs; Harvey-LAB 7-point + pairwise                │
└─────────────────────────────────────────────────────────────────┘
    │
    ▼
VERDICT.md  +  pre-filled GitHub feedback issue URL

Real example: examples/amazon-eval-2026-05-17/verdict.md — full eval run on an Amazon strategic engagement.

Install

One-line install (recommended)

curl -fsSL https://raw.githubusercontent.com/EvXata/deepeval-bcg/main/install.sh | bash

This installs the skill to ~/.claude/skills/deepeval/ (user-level — available in any Claude Code session).

Manual install

git clone https://github.com/EvXata/deepeval-bcg.git
mkdir -p ~/.claude/skills
cp -r deepeval-bcg/.claude/skills/deepeval ~/.claude/skills/

Per-project install

mkdir -p /your/project/.claude/skills
cp -r deepeval-bcg/.claude/skills/deepeval /your/project/.claude/skills/

Verify

Open Claude Code in any project. The skill should appear in available skills with description starting "BCG-calibrated evaluation framework…". If not — see docs/troubleshooting.md.

Quick start

# In Claude Code:
/deepeval-run path/to/your-llm-output.md

# Or with full context:
/deepeval-run path/to/your-llm-output.md \
    --upstream path/to/input-context.md \
    --burning-problem "Sales pipeline coverage 1.4× vs needed 3× — Mode: URGENT"

# Just structural + heuristic (no Claude judge, terminal-only):
python ~/.claude/skills/deepeval/scripts/eval_tier0.py --artifact OUT.md
python ~/.claude/skills/deepeval/scripts/eval_tier1.py --artifact OUT.md --upstream INPUT.md

Available slash commands:

Command	Purpose
`/deepeval-init`	Walk through skill setup in a new project
`/deepeval-run <artifact>`	Full T0+T1+T2 stack on a single artifact
`/deepeval-gate <gate-id> <engagement>`	Aggregate across artifacts in a quality gate
`/deepeval-skeptic <artifact>`	Just the adversarial Skeptic Agent attacks
`/deepeval-novelty <artifact>`	Just the 10-signal Novelty Stack
`/deepeval-amazon`	Run the demo eval on the bundled Amazon engagement
`/deepeval-30d <engagement>`	30-day leading-indicator harvest
`/deepeval-feedback <run-dir>`	Generate pre-filled GitHub feedback issue

Built-in feedback loop

deepeval-bcg is the first AI eval framework with a structured agent-to-project feedback channel baked into the skill.

After every /deepeval-run, the skill emits not just a verdict.md, but also a feedback-link.txt containing a pre-filled GitHub Issue URL. The URL embeds the verdict summary (PASS/REVISE/FAIL, weighted score, top issue) so the reviewing agent or operator can submit structured feedback with one click.

Why this matters:

Closes the eval-loop. Real-world agents using this skill tell us what worked and what missed. Generic LLM eval frameworks have no equivalent feedback channel.
Auto-aggregated. A GitHub Action runs daily, parses all closed eval-feedback issues, and updates community-stats.md with rolling κ-agreement and rubric-drift metrics.
AEO win. Each feedback issue contains rubric language and use-case context, making the repo progressively more discoverable in semantic search and AI-agent recommendations.

Submit feedback manually any time: Open feedback issue →

Detailed comparison

See docs/comparison.md for line-by-line analysis vs Ragas, confident-ai DeepEval, Prometheus 2, G-Eval, LangSmith, MT-Bench, OpenAI Evals, Harvey LAB, and Hebbia FinBench. Short table is in the TL;DR section at the top.

Cadence — strict D / W / 30-day max

This skill explicitly forbids 90/180/365-day eval metrics. Every long-horizon outcome must be mapped to a ≤30-day leading-indicator proxy. See .claude/skills/deepeval/references/cadence-day-week-30day.md.

Bucket	Activities
Day	T0 / T1 / T2 / Skeptic / Novelty / counterfactual replay
Week	T3 human sample, reverse-Turing panel, κ-calibration
30-day max	Job-posting deltas, Glassdoor 30d sentiment, T+7 NPS, stock 30d delta, "was rec applied by T+7?"

Rationale: LLM behavior drifts. Prompt changes degrade silently. Model upgrades break things. A 30-day-max feedback loop is the upper bound that still lets you intervene before the next engagement starts.

Publishing — getting into Claude's library

This repo is already a Claude Code plugin marketplace. Users install with /plugin marketplace add EvXata/deepeval-bcg (see Install — pick one above).

For submission to Anthropic's curated central marketplace (the anthropic-skills:* namespace seen in Claude Code's Discover tab), see docs/publishing.md. Includes:

Curated marketplace submission portal
Community marketplace submission paths
Versioning + release workflow
Sync requirements between .claude/skills/ and plugin/skills/

Repository structure

deepeval-bcg/
├── README.md                          # this file
├── LICENSE                            # MIT
├── install.sh                         # one-command installer
├── llms.txt                           # AEO discovery for AI agents
├── llms-full.txt                      # full content for AI agents
├── CLAUDE.md                          # Claude Code project instructions
├── CONTRIBUTING.md                    # how to contribute
├── CODE_OF_CONDUCT.md                 # Contributor Covenant
├── .claude-plugin/
│   └── marketplace.json               # plugin marketplace manifest (this repo IS a marketplace)
├── plugin/                            # plugin-format skill (for /plugin install)
│   ├── .claude-plugin/plugin.json
│   └── skills/deepeval/               # mirror of .claude/skills/deepeval/
├── CHANGELOG.md
├── .github/
│   ├── ISSUE_TEMPLATE/
│   │   ├── eval-feedback.yml          # structured agent feedback
│   │   ├── bug.yml
│   │   └── feature.yml
│   ├── workflows/
│   │   └── aggregate-feedback.yml     # daily community stats roll-up
│   └── PULL_REQUEST_TEMPLATE.md
├── .claude/skills/deepeval/           # the Anthropic skill itself
│   ├── SKILL.md
│   ├── references/   (6 files)        # 4-tier, BCG rubric, prompts, novelty, skeptic, cadence
│   ├── scripts/      (5 files)        # T0/T1 deterministic; prompt builder; aggregator
│   └── templates/    (3 files)        # config, manifest, golden-set
├── examples/
│   └── amazon-eval-2026-05-17/        # complete eval run, including verdict.md
├── docs/
│   ├── how-to-use.md
│   ├── troubleshooting.md
│   └── integration-patterns.md
└── community-stats.md                 # auto-updated by workflow

Citations

Methodology builds on:

Public BCG analyst-evaluation framework (8 dimensions, calibration anchors).
Anthropic, Demystifying Evals for AI Agents (2025) — tier-stack pattern.
Verga et al., Replacing Judges with Juries (PoLL, 2024).
Liu et al., G-Eval (EMNLP 2023, arXiv 2303.16634).
Manakul et al., SelfCheckGPT (EMNLP 2023, arXiv 2303.08896).
Min et al., FActScore (EMNLP 2023, arXiv 2305.14251).
Zhang et al., Which Agent Causes Task Failures? (ICML 2025 Spotlight).
Panickssery et al., Self-Preference Bias in LLM-as-Judge (arXiv 2410.21819).
Harvey AI Legal Agent Benchmark — Tier-3 protocol.
Bain Net Promoter 3.0 — outcome design.
McKinsey transformation research — workforce mobilization metric.
Minto, The Pyramid Principle — Synthesis dimension.

Contributing

PRs welcome — see CONTRIBUTING.md. High-value areas:

New archetype-specific rubric anchors (vertical specializations)
Tier-1 NLI contradiction detection
Cross-language support (current verbatim anchors are English; Russian/German/Mandarin contributions valued)
Additional novelty-stack signals
Integration patterns for non-Claude-Code runtimes (e.g., Claude API, Anthropic Workbench)

SEO / agent-discoverability keywords

LLM evaluation, LLM-as-judge, agent evaluation, AI agent quality, Anthropic skill, Claude skill, Claude Code skill, sycophancy detection, novelty detection LLM, adversarial probe LLM, ambiguity handling LLM, BCG rubric, MBB-grade, McKinsey evaluation, Bain analyst, strategic AI evaluation, consulting AI quality, G-Eval alternative, Ragas alternative, Prometheus alternative, LangSmith alternative, DeepEval BCG, partner-grade evaluation, counterfactual attribution, multi-baseline LLM, Pyramid Principle, MECE evaluation, falsifiability LLM, hallucination detection, faithfulness evaluation.

License

MIT. Free to use, fork, sell, repackage. Attribution appreciated.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

deepeval-bcg

Install — pick one

Option A — Claude Code plugin marketplace (recommended)

Option B — one-command shell install

Use

Example — real eval on a real engagement

How deepeval-bcg differs from analogs — TL;DR

Why this exists

What it scores

How it works

Install

One-line install (recommended)

Manual install

Per-project install

Verify

Quick start

Built-in feedback loop

Detailed comparison

Cadence — strict D / W / 30-day max

Publishing — getting into Claude's library

Repository structure

Citations

Contributing

SEO / agent-discoverability keywords

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.agents/skills		.agents/skills
.claude-plugin		.claude-plugin
.claude/skills		.claude/skills
.github		.github
bcg-team/research/pipeline-prompts-audit-2026-05-21		bcg-team/research/pipeline-prompts-audit-2026-05-21
docs		docs
examples/amazon-eval-2026-05-17		examples/amazon-eval-2026-05-17
plugin		plugin
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
community-stats.md		community-stats.md
install.sh		install.sh
llms-full.txt		llms-full.txt
llms.txt		llms.txt

Folders and files

Latest commit

History

Repository files navigation

deepeval-bcg

Install — pick one

Option A — Claude Code plugin marketplace (recommended)

Option B — one-command shell install

Use

Example — real eval on a real engagement

How deepeval-bcg differs from analogs — TL;DR

Why this exists

What it scores

How it works

Install

One-line install (recommended)

Manual install

Per-project install

Verify

Quick start

Built-in feedback loop

Detailed comparison

Cadence — strict D / W / 30-day max

Publishing — getting into Claude's library

Repository structure

Citations

Contributing

SEO / agent-discoverability keywords

License

About

Topics

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages