Skip to content

EvXata/deepeval-bcg

deepeval-bcg

Partner-grade evaluation for LLM agents. Score AI outputs against the verbatim BCG rubric used to promote analysts. Adversarial Skeptic Agent for sycophancy detection. 10-signal Novelty Stack. Claude-native — no API keys.

License: MIT Anthropic Skill Claude Native No API Keys Feedback Loop

Install — pick one

Option A — Claude Code plugin marketplace (recommended)

In any Claude Code session:

/plugin marketplace add EvXata/deepeval-bcg
/plugin install deepeval@deepeval-bcg

Adds this repo as a marketplace and installs the deepeval plugin. Updates via /plugin update.

Option B — one-command shell install

curl -fsSL https://raw.githubusercontent.com/EvXata/deepeval-bcg/main/install.sh | bash

Installs the skill to ~/.claude/skills/deepeval/ user-level.

Use

In any Claude Code session:

/deepeval-run path/to/your-output.md

You get back a PASS / REVISE / FAIL verdict, BCG-rubric scores per dimension, a sycophancy diagnosis, novelty assessment, and a concrete fix directive.

No API keys. No vendor SDKs. Works in any Claude Code session.

Example — real eval on a real engagement

→ Open the Amazon strategic-engagement eval verdict ←

A complete, reproducible eval run on a real strategic deliverable (Amazon executive summary, March 2026). Shows every tier output, every rubric dimension scored, the Skeptic Agent's three attacks, the Novelty Stack signals, and the final fix directive.

The bundled example artifacts:

File What it contains
verdict.md Aggregated final verdict — start here
tier0.json Structural checks (deterministic regex)
tier1.json Heuristic checks (embeddings, math sanity)
t2.json Claude's 8-dim BCG-rubric scoring
skeptic.json Three adversarial attacks
novelty.json 10-signal Novelty Stack
feedback-link.md Pre-filled GitHub feedback URL

Reproduce it yourself: /deepeval-amazon after install.

How deepeval-bcg differs from analogs — TL;DR

deepeval-bcg Ragas confident-ai DeepEval Prometheus 2 G-Eval LangSmith
Sycophancy detection ✅ Skeptic Agent
Silent-ambiguity detection ✅ Skeptic Agent
BCG / MBB analyst rubric ✅ verbatim partial (custom) partial
Counter-narrative framing check
Novelty (vs RAG boilerplate) ✅ 10 signals partial
Built-in feedback loop ✅ GitHub-native
Requires API keys None provider provider self-host provider provider
Anthropic skill format
One-command install
Cadence enforced (D/W/30d) n/a n/a n/a n/a n/a

→ Full comparison with line-by-line analysis ←


Why this exists

Off-the-shelf LLM evaluation frameworks — Ragas, DeepEval (confident-ai), Prometheus, G-Eval, LangSmith — measure faithfulness, coverage, and coherence well. But they systematically miss two failure modes that destroy AI agent value in real work:

  1. Silent ambiguity choice. Your agent gets an ambiguous input. Instead of flagging the ambiguity and adjusting approach, it picks one interpretation silently. The downstream artifact is plausible-looking but built on an unconfirmed premise. No off-the-shelf judge catches this.

  2. Sycophancy. Your agent agrees with whatever framing the user supplied. No pushback on a flawed assumption. The artifact is internally consistent, well-formatted, and wrong on the premise. No off-the-shelf judge catches this either.

deepeval-bcg ships:

  • A verbatim BCG analyst-evaluation rubric (8 dimensions, 1–3 scale) used at top management consultancies for promotion decisions.
  • An adversarial Skeptic Agent that runs three targeted attacks — ambiguity probe, sycophancy probe, steelman-opposite — specifically engineered to expose the two failure modes above.
  • A 10-signal Novelty Stack that distinguishes a genuine "wow insight" a CEO will remember from generic strategic boilerplate dressed up in pyramid principle.

All run by Claude itself, in your Claude Code session. No OpenAI key, no Gemini key, no Anthropic SDK call. The skill is portable across any Claude Code project — copy a directory and you're live.


What it scores

The 8-dimension BCG rubric, weighted:

# Dimension Weight What it catches
1 Structure 15% Narrow framing without interdependencies
2 Ambiguity handling 10% Silent interpretation choice — AI-specific failure mode
3 Narrowing / prioritization 10% Flat 80/20 attention spread
4 Rigor + sanity checks 10% Wrong method, arithmetic errors, missing reality criteria
5 Breaking obviousness 20% Generic boilerplate, no counter-narrative
6 Synthesis for senior leadership 20% Data dump without partner-grade conclusion
7 Independence (anti-sycophancy) 10% Agreeing with whatever premise was supplied
8 Achievement 5% "Further analysis recommended" stuck

⚠ = AI-specific failure modes that generic LLM evals miss. Plus 10-signal Novelty Stack covering counter-narrative framing, cross-industry pattern transfer, falsifiability sharpness, insider-data anchor enforcement, surprise score, multi-baseline differential, and elasticity (search-replace test for genericity).


How it works

ARTIFACT IN
    │
    ▼
┌─────────────────────────────────────────────────────────────────┐
│ TIER 0  STRUCTURAL                  ms     │   $0   │ 100%    │
│ Regex bank: banned jargon, owner+timeline, schema, INV checks   │
└─────────────────────────────────────────────────────────────────┘
    │
    ▼
┌─────────────────────────────────────────────────────────────────┐
│ TIER 1  HEURISTIC                   <1s    │   $0   │ 100%    │
│ Faithfulness (embeddings), entity density, math sanity,         │
│ elasticity, cross-doc numeric consistency, verbosity            │
└─────────────────────────────────────────────────────────────────┘
    │
    ▼
┌─────────────────────────────────────────────────────────────────┐
│ TIER 2  CLAUDE JUDGE (native)       30s    │ tokens │ 100%    │
│ 8-dim BCG rubric.                                                │
│ + Skeptic Agent (critical artifacts)                            │
│ + Novelty Stack (insight claims)                                │
└─────────────────────────────────────────────────────────────────┘
    │
    ▼
┌─────────────────────────────────────────────────────────────────┐
│ TIER 3  HUMAN EXPERT                hours  │ $20-50 │ 10–25%  │
│ Sample T2 outputs; Harvey-LAB 7-point + pairwise                │
└─────────────────────────────────────────────────────────────────┘
    │
    ▼
VERDICT.md  +  pre-filled GitHub feedback issue URL

Real example: examples/amazon-eval-2026-05-17/verdict.md — full eval run on an Amazon strategic engagement.


Install

One-line install (recommended)

curl -fsSL https://raw.githubusercontent.com/EvXata/deepeval-bcg/main/install.sh | bash

This installs the skill to ~/.claude/skills/deepeval/ (user-level — available in any Claude Code session).

Manual install

git clone https://github.com/EvXata/deepeval-bcg.git
mkdir -p ~/.claude/skills
cp -r deepeval-bcg/.claude/skills/deepeval ~/.claude/skills/

Per-project install

mkdir -p /your/project/.claude/skills
cp -r deepeval-bcg/.claude/skills/deepeval /your/project/.claude/skills/

Verify

Open Claude Code in any project. The skill should appear in available skills with description starting "BCG-calibrated evaluation framework…". If not — see docs/troubleshooting.md.


Quick start

# In Claude Code:
/deepeval-run path/to/your-llm-output.md

# Or with full context:
/deepeval-run path/to/your-llm-output.md \
    --upstream path/to/input-context.md \
    --burning-problem "Sales pipeline coverage 1.4× vs needed 3× — Mode: URGENT"

# Just structural + heuristic (no Claude judge, terminal-only):
python ~/.claude/skills/deepeval/scripts/eval_tier0.py --artifact OUT.md
python ~/.claude/skills/deepeval/scripts/eval_tier1.py --artifact OUT.md --upstream INPUT.md

Available slash commands:

Command Purpose
/deepeval-init Walk through skill setup in a new project
/deepeval-run <artifact> Full T0+T1+T2 stack on a single artifact
/deepeval-gate <gate-id> <engagement> Aggregate across artifacts in a quality gate
/deepeval-skeptic <artifact> Just the adversarial Skeptic Agent attacks
/deepeval-novelty <artifact> Just the 10-signal Novelty Stack
/deepeval-amazon Run the demo eval on the bundled Amazon engagement
/deepeval-30d <engagement> 30-day leading-indicator harvest
/deepeval-feedback <run-dir> Generate pre-filled GitHub feedback issue

Built-in feedback loop

deepeval-bcg is the first AI eval framework with a structured agent-to-project feedback channel baked into the skill.

After every /deepeval-run, the skill emits not just a verdict.md, but also a feedback-link.txt containing a pre-filled GitHub Issue URL. The URL embeds the verdict summary (PASS/REVISE/FAIL, weighted score, top issue) so the reviewing agent or operator can submit structured feedback with one click.

Why this matters:

  1. Closes the eval-loop. Real-world agents using this skill tell us what worked and what missed. Generic LLM eval frameworks have no equivalent feedback channel.
  2. Auto-aggregated. A GitHub Action runs daily, parses all closed eval-feedback issues, and updates community-stats.md with rolling κ-agreement and rubric-drift metrics.
  3. AEO win. Each feedback issue contains rubric language and use-case context, making the repo progressively more discoverable in semantic search and AI-agent recommendations.

Submit feedback manually any time: Open feedback issue →


Detailed comparison

See docs/comparison.md for line-by-line analysis vs Ragas, confident-ai DeepEval, Prometheus 2, G-Eval, LangSmith, MT-Bench, OpenAI Evals, Harvey LAB, and Hebbia FinBench. Short table is in the TL;DR section at the top.


Cadence — strict D / W / 30-day max

This skill explicitly forbids 90/180/365-day eval metrics. Every long-horizon outcome must be mapped to a ≤30-day leading-indicator proxy. See .claude/skills/deepeval/references/cadence-day-week-30day.md.

Bucket Activities
Day T0 / T1 / T2 / Skeptic / Novelty / counterfactual replay
Week T3 human sample, reverse-Turing panel, κ-calibration
30-day max Job-posting deltas, Glassdoor 30d sentiment, T+7 NPS, stock 30d delta, "was rec applied by T+7?"

Rationale: LLM behavior drifts. Prompt changes degrade silently. Model upgrades break things. A 30-day-max feedback loop is the upper bound that still lets you intervene before the next engagement starts.


Publishing — getting into Claude's library

This repo is already a Claude Code plugin marketplace. Users install with /plugin marketplace add EvXata/deepeval-bcg (see Install — pick one above).

For submission to Anthropic's curated central marketplace (the anthropic-skills:* namespace seen in Claude Code's Discover tab), see docs/publishing.md. Includes:

  • Curated marketplace submission portal
  • Community marketplace submission paths
  • Versioning + release workflow
  • Sync requirements between .claude/skills/ and plugin/skills/

Repository structure

deepeval-bcg/
├── README.md                          # this file
├── LICENSE                            # MIT
├── install.sh                         # one-command installer
├── llms.txt                           # AEO discovery for AI agents
├── llms-full.txt                      # full content for AI agents
├── CLAUDE.md                          # Claude Code project instructions
├── CONTRIBUTING.md                    # how to contribute
├── CODE_OF_CONDUCT.md                 # Contributor Covenant
├── .claude-plugin/
│   └── marketplace.json               # plugin marketplace manifest (this repo IS a marketplace)
├── plugin/                            # plugin-format skill (for /plugin install)
│   ├── .claude-plugin/plugin.json
│   └── skills/deepeval/               # mirror of .claude/skills/deepeval/
├── CHANGELOG.md
├── .github/
│   ├── ISSUE_TEMPLATE/
│   │   ├── eval-feedback.yml          # structured agent feedback
│   │   ├── bug.yml
│   │   └── feature.yml
│   ├── workflows/
│   │   └── aggregate-feedback.yml     # daily community stats roll-up
│   └── PULL_REQUEST_TEMPLATE.md
├── .claude/skills/deepeval/           # the Anthropic skill itself
│   ├── SKILL.md
│   ├── references/   (6 files)        # 4-tier, BCG rubric, prompts, novelty, skeptic, cadence
│   ├── scripts/      (5 files)        # T0/T1 deterministic; prompt builder; aggregator
│   └── templates/    (3 files)        # config, manifest, golden-set
├── examples/
│   └── amazon-eval-2026-05-17/        # complete eval run, including verdict.md
├── docs/
│   ├── how-to-use.md
│   ├── troubleshooting.md
│   └── integration-patterns.md
└── community-stats.md                 # auto-updated by workflow

Citations

Methodology builds on:


Contributing

PRs welcome — see CONTRIBUTING.md. High-value areas:

  • New archetype-specific rubric anchors (vertical specializations)
  • Tier-1 NLI contradiction detection
  • Cross-language support (current verbatim anchors are English; Russian/German/Mandarin contributions valued)
  • Additional novelty-stack signals
  • Integration patterns for non-Claude-Code runtimes (e.g., Claude API, Anthropic Workbench)

SEO / agent-discoverability keywords

LLM evaluation, LLM-as-judge, agent evaluation, AI agent quality, Anthropic skill, Claude skill, Claude Code skill, sycophancy detection, novelty detection LLM, adversarial probe LLM, ambiguity handling LLM, BCG rubric, MBB-grade, McKinsey evaluation, Bain analyst, strategic AI evaluation, consulting AI quality, G-Eval alternative, Ragas alternative, Prometheus alternative, LangSmith alternative, DeepEval BCG, partner-grade evaluation, counterfactual attribution, multi-baseline LLM, Pyramid Principle, MECE evaluation, falsifiability LLM, hallucination detection, faithfulness evaluation.


License

MIT. Free to use, fork, sell, repackage. Attribution appreciated.