Measure whether an agent skill actually improves a model's task performance.
Most published SKILL.md files have never been tested. You can't tell whether they
help your model or are just decoration. Skillcheck answers that with a controlled
experiment instead of a vibe check.
Point it at any Markdown skill file and it runs an A/B test: it generates fresh tasks for the skill's declared domain, has the model solve every task with and without the skill injected, grades both arms blind, and reports the measured effect with a bootstrap confidence interval and a 0–100 quality score.
$ skillcheck
✓ Evaluation tasks ready · 12s
✓ Trials complete (30/30) · 1m 38s
✓ Grading complete (30/30) · 41s
✓ Analysis complete · 2m 33s
╭────────────────────────────────────────────────────────╮
│ SKILLCHECK RESULT │
├────────────────────────────────────────────────────────┤
│ Skill API Documentation │
│ Run size 5 tasks × 3 trials │
│ │
│ Verdict HELPS │
│ The skill HELPED — model passed 80% of tasks with it │
│ vs 55% without. │
│ │
│ With skill 80.0% of tasks passed │
│ Without skill 55.0% of tasks passed │
│ Skill effect +25.0 pp change in pass rate │
│ Confidence +8.0 pp to +42.0 pp (95% range) │
│ Token cost +480 tokens to include the skill │
├────────────────────────────────────────────────────────┤
│ Satisfaction ██████████████░░░░ 75.0/100 GOOD │
╰────────────────────────────────────────────────────────╯
- Install
- Quick start
- How it works
- Architecture
- Commands
- Reading the result
- Effort levels
- Terminal experience
- Configuration
- Model choice
- Self-hosting
- Development
- License
npm install -g @sx4im/skillcheck
# or run it without installing:
npx @sx4im/skillcheckRequires Node.js 20+. Works on Linux, macOS, and Windows.
Skillcheck checks npm for a newer version about once a day and offers to update
(like the Codex and Gemini CLIs). Disable with SKILLCHECK_NO_UPDATE_CHECK=1.
skillcheckOn first run it asks for your Skillcheck API key (grab a free one from the dashboard —
the URL is built in, and the free tier includes 10 checks). Key entry is masked, the
key is verified before it's saved, then a full-screen file picker opens: navigate
folders with the arrow keys, pick any .md file, choose an effort level from the
arrow-key menu, and watch the live progress tracker until the result card lands.
The picker runs on the terminal's alternate screen, so quitting hands your
scrollback right back.
Point it straight at a file or folder to skip the picker:
skillcheck check ./SKILL.md
skillcheck ./my-skill-folder # a folder containing a .md
skillcheck check ./SKILL.md --json # machine-readable output
skillcheck check ./SKILL.md --output result.jsonFully headless (CI, scripts) — set the key via environment variable:
export SKILLCHECK_TOKEN=chk_live_...
skillcheck check ./SKILL.md --tasks 5 --trials 3 --jsonSkillcheck treats a skill like a drug trial treats a drug:
- Normalize — the skill file is parsed; its declared domain is read from
front matter (
domain:/description:) or the first heading. - Generate — a task generator sees only the domain, never the skill body, so the tasks can't leak the skill's instructions. It produces 2× candidate tasks; a seeded shuffle picks the final set.
- Run — every task runs
Ktrials in two arms: with the skill injected as a system prompt, and without it. Same model, same temperature. - Grade — a blind grader scores each output against the task's pass/fail criterion. It never knows which arm produced the output (outputs are shuffled), and grades at temperature 0 in JSON mode.
- Score — pass rates are compared pairwise and a 1000-iteration paired
bootstrap produces the effect size, a 95% confidence interval, and the verdict:
HELPS(CI fully above zero),HARMS(fully below), orPLACEBO(overlaps zero).
Every run is fresh: tasks and outputs are generated anew each time and check
stores nothing locally, so a repeated check is an independent measurement. Full
methodology in METHODOLOGY.md.
How a check flows from your terminal to the result card:
flowchart TD
subgraph CLI["skillcheck CLI (local)"]
A["User input<br/>skillcheck check ./SKILL.md"] --> B{API key<br/>configured?}
B -- no --> C["Interactive setup<br/>paste key → verified via /key/verify → saved to ~/.config/skillcheck"]
B -- yes --> D
C --> D["Normalize skill<br/>name · domain · instructions"]
D --> E["Generate tasks<br/>domain only — never the skill body"]
E --> F["Run trials<br/>each task × K trials × 2 arms"]
F --> G["Blind grading<br/>shuffled outputs · temp 0 · JSON verdict"]
G --> H["Paired bootstrap<br/>1000 resamples → effect, 95% CI, verdict"]
H --> I{Output mode}
I -->|terminal| J["Result card<br/>+ animated satisfaction bar"]
I -->|"--json or --output"| K["JSON result<br/>reproducible: task suite + transcript hashes"]
end
subgraph Cloud["Skillcheck Cloud (dashboard/, Vercel)"]
P["Metered proxy /api/chat/completions<br/>authenticates chk_live key · counts 1 run per check<br/>pins model · caps max_tokens"]
V["/api/key/verify"]
P2["NVIDIA key stays server-side"]
end
subgraph NIM["NVIDIA NIM"]
M["openai/gpt-oss-120b<br/>default for all three roles"]
end
C -.-> V
E -->|model calls| P
F -->|model calls| P
G -->|model calls| P
P --> M
E -.->|"direct mode (NVIDIA_API_KEY)"| M
style CLI fill:#0b2942,stroke:#2d7dd2,color:#e8f0fe
style Cloud fill:#102a12,stroke:#3fa34d,color:#e8f5e9
style NIM fill:#2a2210,stroke:#d2a52d,color:#fdf6e3
Key properties:
- One metered run per check — every model call in a check shares a run id, so the hosted proxy counts the whole check as a single run.
- No provider key on your machine (hosted mode) — the CLI talks to the proxy; the NVIDIA key lives only on the server.
- Direct mode — set
NVIDIA_API_KEYto bypass the proxy entirely and call NVIDIA NIM with your own key.
skillcheck # interactive: pick a file, pick effort, run
skillcheck check <path> [--tasks N] [--trials K] [--output file.json] [--json]
skillcheck setup # connect / change your API key
skillcheck logout # remove your saved API key
skillcheck eval <path> [--tasks N] [--trials K] [--output file.json] # raw JSON evaluator
skillcheck verify <result.json> [--sample n] # independently re-measure a published result
skillcheck corpus run --corpus corpus.yaml [--results dir] # batch-evaluate many skills
skillcheck rot [--results dir] [--output report.json] # detect skills that stopped helping
skillcheck --versionAccepted inputs: any Markdown (.md) file — SKILL.md, AGENTS.md, CLAUDE.md,
or any other .md — or a folder containing one. --tasks is capped at 50 and
--trials at 10; mistyped options are rejected rather than silently ignored.
-
Verdict —
HELPS/PLACEBO/HARMS, decided by whether the 95% confidence interval clears zero.PLACEBOmeans no measurable difference, not necessarily a bad skill. -
Skill effect — the change in pass rate, in percentage points (pp).
-
Confidence — the 95% range for the true effect. A wide range means the run was inconclusive; re-run at a higher effort for a clearer signal.
-
Token cost — the prompt-token overhead of including the skill.
-
Satisfaction — a 0–100 quality score where 50 = no effect:
Score Band Score Band ≤10 Very bad 51–60 Decent 11–30 Bad 61–80 Good 31–50 Normal 81–100 Excellent
Each run is an independent experiment — tasks and model outputs are generated fresh every time, so results vary run to run. That variance is what the confidence interval quantifies.
The interactive run asks how thorough to be — more tasks/trials means a tighter confidence interval but a longer run:
| Level | Tasks × trials | Typical time |
|---|---|---|
| Quick | 2 × 1 | ~30 sec |
| Standard | 3 × 2 | ~1–2 min |
| Thorough | 5 × 3 | ~3–4 min |
For scripted runs, set it explicitly: skillcheck check ./SKILL.md --tasks 5 --trials 3.
The CLI is built to feel like a first-class developer tool:
- Live step tracker — each phase persists as a receipt line
(
✓ Trials complete (30/30) · 1m 38s) while the active phase shows a spinner, a progress bar, and elapsed time. Progress renders on stderr, so piping stdout still gives you a clean result; piped stderr gets plain log lines instead of spinner frames. - Animated result card — the satisfaction bar sweeps to its score on interactive terminals; non-TTY output is the same card, static.
- Adaptive colour — truecolor gradients where supported, 256/16-colour
fallbacks elsewhere.
NO_COLOR(any non-empty value) disables colour entirely;FORCE_COLOR=1|2|3forces it on for piped output. - Quiet cancellation — backing out of a menu with
q/Ctrl+Cexits with code130and a one-line note, not an error dump. Run failures print a concise✗block on stderr. - Masked secrets — API-key entry never echoes; keys are stored at
~/.config/skillcheck/config.jsonwith0600permissions.
Credential precedence (highest wins):
| Setting | Mode | Effect |
|---|---|---|
NVIDIA_API_KEY |
direct | Call NVIDIA NIM with your own key, bypassing the proxy |
SKILLCHECK_TOKEN |
hosted | Use a Skillcheck Cloud key without saving anything |
skillcheck setup |
hosted | Verifies and saves your key to ~/.config/skillcheck/config.json |
Optional environment variables:
| Variable | Default | Purpose |
|---|---|---|
SKILLCHECK_API_URL |
hosted cloud URL | Point at a self-hosted proxy deployment |
SKILLCHECK_MODEL |
openai/gpt-oss-120b |
Override the model for all three roles |
NVIDIA_GENERATOR_MODEL / NVIDIA_RUNNER_MODEL / NVIDIA_GRADER_MODEL |
— | Per-role model overrides (direct mode) |
NVIDIA_BASE_URL |
https://integrate.api.nvidia.com/v1 |
Direct-mode endpoint |
NVIDIA_TIMEOUT_MS |
120000 |
Per-request timeout |
NVIDIA_REQUEST_DELAY_MS |
750 |
Minimum delay between requests (rate-limit safety) |
NVIDIA_MAX_ATTEMPTS |
8 |
Retry budget for retryable failures (429/5xx) |
NVIDIA_MAX_RETRY_DELAY_MS |
60000 |
Backoff cap |
SKILLCHECK_NO_UPDATE_CHECK |
— | 1 disables the daily update check |
SKILLCHECK_DEBUG |
— | 1 enables verbose per-call logging |
NO_COLOR |
— | Any non-empty value disables colour (spec) |
FORCE_COLOR |
— | 1/2/3 forces colour on, even when piped |
All three roles (task generator, runner, blind grader) default to
openai/gpt-oss-120b, selected by live benchmarking of the NIM catalog
(MiniMax M2.7, DeepSeek V4 Flash, Qwen3-Next/3.5, Llama 3.3 70B, Nemotron Nano):
- It's the only large model in the fast lane. A Medium check makes ~25 sequential model calls; a Strong check ~60, so per-call latency dominates UX. gpt-oss-120b answers in ~1–5 s on NIM. The other large models (MiniMax M2.7, DeepSeek V4, Qwen3-Next, Llama 3.3) queue for 60–110+ seconds per call on the shared endpoint — a single check would take hours.
- Grading and generation need capability. The verdict is only as good as the
blind grader's judgment and the generator's task quality. The 120B model grades
and generates noticeably more reliably than the sub-second
nvidia/nemotron-3-nanoalternative, which under-delivers task batches often enough that the CLI needs its retry path. - JSON mode must be dependable. Generator and grader run with
response_format: json_object; gpt-oss-120b returns clean JSON consistently (its built-in reasoning stays inreasoning_content, never the answer). Qwen 3.5's NIM endpoint rejectsresponse_formatoutright, ruling it out.
Prefer raw speed over judgment quality? Pin the nano model:
SKILLCHECK_MODEL=nvidia/nemotron-3-nano-omni-30b-a3b-reasoning. And you can
measure your production model in direct mode —
NVIDIA_RUNNER_MODEL=<model> skillcheck eval ./SKILL.md keeps the capable
grader while running trials on the model you actually ship with.
Skillcheck's hosted tier runs behind a metered proxy so end users never need a
provider key. The dashboard/ folder is a deployable Vercel app
(Clerk sign-in, free-tier metering, optional Stripe upgrade) that issues
chk_live_… keys and forwards completions to your server-side NVIDIA key. See
dashboard/SETUP.md for the walkthrough.
To skip the proxy entirely, set NVIDIA_API_KEY (see
.env.example).
npm ci
npm run build # compile to dist/
npm test # vitest (73 tests)
npm run typecheck # strict TS, src + testsThe CLI lives in packages/cli (bin/skillcheck.ts → src/cli.ts).
packages/site is the static leaderboard site; dashboard/ is the hosted cloud.
A scheduled rot workflow re-runs the corpus weekly and
opens a PR when a skill's verdict regresses.