A single, portable, cross-agent skill that audits any git repository for logical bugs, typos, formula/calculation mistakes, inconsistencies, runtime errors, and compile-time/type/syntax errors — and writes a categorized
BUG_REPORT.md. Works in Claude Code, OpenCode, Codex CLI, and any other agent that follows the open Agent Skills standard.
- The problem
- What this repo is
- How it works
- What it finds
- Repository layout
- Quick start
- The 8-phase audit pipeline
- Design choices worth knowing about
- Customization
- Limitations
- Relationship between SPEC, BUILD_GUIDE, and PROMPT
- Running it on your own repo
- Acceptance checklist
- License
Every modern CLI coding agent — Claude Code, OpenCode, Codex CLI, Cursor, Gemini CLI, Copilot — already has a notion of a "skill", "slash command", or "agent". And every senior engineer has, at some point, asked an agent "find bugs in the codebase" and gotten back a shallow, hand-wavy answer that misses the categories that actually matter: typos that silently fail comparisons, formulas that compute the wrong number, near-duplicate functions that have drifted apart, code that compiles but blows up at runtime.
The problem is that bug-hunting quality scales directly with how carefully the model reasons, and most agents default to "good enough" reasoning unless you explicitly ask for the deepest mode. Worse, a less capable model will silently drop findings it can't hold in context, and return a report that looks complete but isn't.
This repo is the build/audit harness for a single skill — find-bugs —
that fixes both problems:
- It forces maximum reasoning depth via three independent levers
(frontmatter
effort: max+ the trigger wordultrathink+ explicit in-body instructions), so a plain request like "find bugs in the codebase" gets the same depth as if you'd typedultrathinkyourself. - It externalizes the audit's working state to a scratch directory
(
.bugaudit/) using a fixed JSONL schema, so a weaker model can audit a 200-file repo without losing track of earlier findings.
The result: a reproducible, file-line-precise BUG_REPORT.md for any
repository, regardless of which CLI agent is running the audit.
This repo is not a typical code project. It is the build harness
for the find-bugs skill itself. After BUILD_GUIDE.md runs, the
deliverable is the find-bugs skill — placed under .agents/skills/,
.claude/skills/, and .opencode/skill/, .opencode/skills/, plus a
short pointer in AGENTS.md.
┌──────────────────────────────────────────────────────────────┐
│ THIS REPO (codeaudit) │
│ │
│ ┌─────────────┐ ┌──────────────┐ ┌──────────────────┐ │
│ │ SPEC.md │ │ BUILD_GUIDE │ │ PROMPT_FOR_AGENT │ │
│ │ (the why) │ │ (the what) │ │ (the prompt) │ │
│ └──────┬──────┘ └──────┬───────┘ └────────┬─────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌────────────────────────────────────────────────────┐ │
│ │ After running BUILD_GUIDE.md Steps 1-7: │ │
│ │ .agents/skills/find-bugs/SKILL.md (canonical) │ │
│ │ .agents/skills/find-bugs/references/bug-taxonomy │ │
│ │ .agents/skills/find-bugs/assets/report_template │ │
│ │ .agents/skills/find-bugs/scripts/run_static_* │ │
│ │ + 3 byte-identical mirrors │ │
│ │ + AGENTS.md pointer block │ │
│ └────────────────────────────────────────────────────┘ │
│ │
│ The skill then audits any repo it's pointed at. │
└──────────────────────────────────────────────────────────────┘
If you only want to use the skill, copy the four files in
.agents/skills/find-bugs/ (plus the three mirrors and the AGENTS.md
block) to your repo. If you want to build/modify the skill, edit
SPEC.md and BUILD_GUIDE.md together — they form the design and
implementation contract.
You (any CLI coding agent) codeaudit skill
┌────────────┐ ┌────────────────────┐
│ user: │ "find bugs in the │ SKILL.md (loaded) │
│ "find bugs │ ── codebase" ────────▶ │ + references/ │
│ in the │ │ + assets/ │
│ codebase" │ │ + scripts/ │
└────────────┘ └─────────┬──────────┘
│
▼
┌─────────────────────────────────┐
│ Phase 0 Setup │ .bugaudit/
│ Phase 1 Inventory & triage │ ├── inventory.md
│ Phase 2 Static analysis │ ├── static-analysis.md
│ Phase 3 Per-file review │ ├── findings.jsonl
│ Phase 4 Cross-file sweep │ └── notes.md
│ Phase 5 Formula verification │
│ Phase 6 Triage & dedup │
│ Phase 7 Write BUG_REPORT.md │ ┌───────────────┐
│ Phase 8 Chat summary │ │ BUG_REPORT.md │
└─────────────────────────────────┘ └───────────────┘
When the skill triggers, it:
- Creates a
.bugaudit/scratch directory in the target repo (the one being audited, not this one) and never touches any source file there. - Enumerates the target repo's files and classifies them into Tier 1/2/3 (entry points + calculation-shaped files get deep-reviewed; tests and generated code are skimmed).
- Runs read-only static analysis (the bundled
run_static_checks.shscript detects the toolchain — Node, Python, Go, Rust, Java, .NET, PHP, Ruby — and runs whatever checkers are installed). - Walks every Tier 1/2 file against the six bug categories, appending one
JSON line per finding to
findings.jsonl. - Performs cross-file consistency checks (drifted duplicates, mismatched API contracts, stale docs).
- Verifies any arithmetic with business/scientific meaning term-by-term against the intended formula.
- Triages/dedups/severities the findings, then writes
BUG_REPORT.mdusingassets/report_template.mdas the skeleton. - Gives you a 1-paragraph chat summary (counts by severity + the single most important issue).
The whole thing is read-only against the audited repo by design.
The taxonomy has six categories, mapped 1:1 to the findings.jsonl
schema and to the six section headers in BUG_REPORT.md:
| # | Category | Definition (short) |
|---|---|---|
| 1 | Logical | Control flow / boolean logic doesn't match the intent implied by naming, comments, or surrounding code. |
| 2 | Typos | A misspelled literal, key, identifier, or pattern that should byte-for-byte match another one but doesn't, causing a silent mismatch. |
| 3 | Formula / Calculation | Arithmetic, statistical, or unit-conversion logic that runs fine but computes the wrong number. |
| 4 | Inconsistencies | Two or more places that should agree (behavior, format, naming, validation, docs) but have drifted apart. |
| 5 | Runtime | Code that parses/compiles fine but can throw, crash, hang, or behave unsafely on realistic inputs. |
| 6 | Compile-time / Syntax / Type | The code would fail to compile/parse or be rejected by a type checker. |
Full checklists and worked examples for each category are in
.agents/skills/find-bugs/references/bug-taxonomy.md.
| Severity | Meaning |
|---|---|
| Critical | Crashes, data loss/corruption, security issue, or completely wrong output on a common/primary path. |
| High | Incorrect results or failures on common inputs/paths — materially wrong behavior, not necessarily a crash. |
| Medium | Wrong only on edge cases/uncommon inputs, or a real inconsistency likely to cause a bug later. |
| Low | Cosmetic — typos in comments/logs/UI copy, minor style inconsistencies, very-low-probability edge cases. |
confidence (high/medium/low) is tracked separately from severity —
a finding can be severe but uncertain (goes to the "Possible Issues"
appendix) or minor but certain (goes in the main tables as Low).
codeaudit/
├── README.md ← you are here
├── AGENTS.md ← short pointer block (always-loaded context)
├── SPEC.md ← design rationale (the "why")
├── BUILD_GUIDE.md ← runnable shell that builds the skill
├── PROMPT_FOR_AGENT.md ← prompt to hand to a coding agent
├── BUG_REPORT.md ← self-audit report of this repo
│
├── .agents/skills/find-bugs/ ← canonical skill (4 files)
│ ├── SKILL.md ← main instructions, YAML frontmatter
│ ├── references/
│ │ └── bug-taxonomy.md ← full category checklist
│ ├── assets/
│ │ └── report_template.md ← BUG_REPORT.md skeleton
│ └── scripts/
│ └── run_static_checks.sh ← read-only multi-toolchain runner
│
├── .claude/skills/find-bugs/ ← mirror (Claude Code discovers here)
├── .opencode/skill/find-bugs/ ← mirror (OpenCode singular form)
└── .opencode/skills/find-bugs/ ← mirror (OpenCode plural form)
The three mirror directories under .claude/ and .opencode/ are
byte-identical to the canonical .agents/skills/find-bugs/ (verified
by diff -r after every build). They are present because the open Agent
Skills standard does not yet mandate a single scan path — different
agents and different OpenCode releases have used both singular and
plural forms. Mirroring the same folder into all of them is the safest
cross-tool contract.
- Copy this whole directory (or at minimum the four
.agents/skills/find-bugs/files) to the root of your target repo. - Open your CLI coding agent in that repo.
- Say "find bugs in the codebase" — or invoke explicitly with
/find-bugs(Claude Code, OpenCode) or$find-bugs(Codex CLI). - A
BUG_REPORT.mdwill be written to your repo root when the audit finishes.
- Copy
SPEC.md,BUILD_GUIDE.md, andPROMPT_FOR_AGENT.mdto the target repo. - Open your CLI coding agent in that repo.
- Paste the contents of
PROMPT_FOR_AGENT.md(everything below the---------------divider) as your prompt. - The agent will execute
BUILD_GUIDE.mdSteps 1-7 in order, then run Step 8's verification commands and report PASS/FAIL.
- Edit
SPEC.mdfirst (it's the rationale). Update the section that describes the change. - Edit
BUILD_GUIDE.mdto match (every shell code block in Steps 1-7 is the literal final content to write to disk). - Re-run
BUILD_GUIDE.mdSteps 1-7 to regenerate the skill. - Re-run Step 6 to re-mirror the canonical copy to
.claude/,.opencode/skill/, and.opencode/skills/.
┌────────────┐
│ Phase 0 │ Setup — create .bugaudit/ scratch dir (and gitignore it)
└─────┬──────┘
▼
┌────────────┐
│ Phase 1 │ Inventory & triage — Tier 1/2/3 classification
└─────┬──────┘
▼
┌────────────┐
│ Phase 2 │ Static analysis — run read-only checkers (tsc, ruff, etc.)
└─────┬──────┘
▼
┌────────────┐
│ Phase 3 │ Per-file deep review — every Tier 1/2 file × 6 categories
└─────┬──────┘
▼
┌────────────┐
│ Phase 4 │ Cross-file consistency sweep — signature mismatches,
└─────┬──────┘ drifted duplicates, stale docs, API-contract gaps
▼
┌────────────┐
│ Phase 5 │ Formula & calculation verification — derive the intended
└─────┬──────┘ formula symbolically, compare term-by-term
▼
┌────────────┐
│ Phase 6 │ Triage, dedup, severity — merge same-root-cause findings,
└─────┬──────┘ confirm severities, drop false positives
▼
┌────────────┐
│ Phase 7 │ Write BUG_REPORT.md — fill assets/report_template.md
└─────┬──────┘
▼
┌────────────┐
│ Phase 8 │ Chat summary — counts by severity + top issue, no re-paste
└────────────┘
The full details of each phase live in
.agents/skills/find-bugs/SKILL.md.
The full checklists for each category are in
.agents/skills/find-bugs/references/bug-taxonomy.md.
Anthropic's Agent Skills docs, OpenAI's Codex Skills docs, and
third-party compatibility surveys all converge on the same minimum
viable SKILL.md: a YAML frontmatter block with name and
description, followed by Markdown instructions. Anything beyond
that (allowed-tools, context: fork, hooks, effort) is
tool-specific but safely ignored by agents that don't support them.
So the skill writes one canonical SKILL.md using only the universal
fields plus effort: max (Claude Code's adaptive-effort system) and
mirrors it everywhere.
SKILL.md itself is kept well under the ~5,000-word guidance so the
model isn't overwhelmed when the skill loads. The 6,000-word bug
checklist lives in references/bug-taxonomy.md (loaded in Phase 3).
The report skeleton lives in assets/report_template.md. The
static-analysis runner lives in scripts/run_static_checks.sh.
SKILL.md is the orchestration layer.
A weaker model auditing a 200-file repo will lose track of earlier
findings if it has to hold them all in context. The skill writes
everything to .bugaudit/ as it goes:
.bugaudit/
├── inventory.md # Phase 1: file tiers + counts
├── static-analysis.md # Phase 2: tool output
├── findings.jsonl # Phases 3-5: one JSON line per finding
└── notes.md # running progress log
Phase 6 (triage/dedup) becomes a mechanical "read all lines, group by root cause" operation instead of relying on recall.
| Layer | Mechanism | Applies to |
|---|---|---|
| 1 | effort: max in SKILL.md frontmatter |
Claude Code's adaptive-effort system |
| 2 | Literal word ultrathink in §0 |
Claude Code's trigger-word preprocessing |
| 3 | Explicit in-body instructions | Any model, including DeepSeek-class and OpenCode |
Why three? The skill is supposed to auto-trigger on a plain "find bugs in the codebase" — which won't contain any thinking keyword. The frontmatter and in-body instructions make the skill self-sufficient regardless of how it was invoked.
A bug-finding skill that also starts editing files is a different (riskier) product. This skill explicitly never modifies source files. "Fix the bugs" is a deliberate follow-up task, not part of this skill.
| Knob | Where to change it |
|---|---|
| Output filename / location | SKILL.md §1 (and assets/report_template.md if you want it reflected in the skeleton) |
| Large-repo threshold (~150 files) | SKILL.md Phase 1 step 5 — raise for faster/shallower, lower to force explicit "Methodology & Limitations" disclosure |
| Tier 1 patterns (entry points, calc-shaped files) | SKILL.md Phase 1 step 3 — add project-specific globs (e.g. monorepo package names) |
| Static-analysis coverage | scripts/run_static_checks.sh — add another if [ -f <manifest> ]; then ... fi block following the existing pattern |
findings.jsonl schema |
SKILL.md §3.3 — add fields like tags: [] and update assets/report_template.md if they should surface in the report |
- Very large monorepos (thousands of files) will hit the Tier 1/2 budget quickly. The skill is designed to disclose this in "Methodology & Limitations" rather than solve it. For true monorepos, scope invocations to one package at a time (the skill supports a user-specified path).
- Formula verification (Phase 5) depends on domain inference — for
highly specialized math (actuarial, cryptographic, ML-numerical) the
model may not know the "correct" formula to compare against; such
findings should naturally land at
confidence: low. - Static-analysis script coverage is best-effort and assumes common
CLI tool names/flags. Some projects use wrapper scripts (
make lint,npm run check) — the script falls back to ad-hoc commands if the bundled script doesn't cover the project's toolchain. - No fix mode — by design. A natural follow-up skill
(
fix-bugsorapply-bug-fixes) could readBUG_REPORT.mdand apply fixes one at a time with user confirmation. Out of scope here.
The three top-level files form a three-layer contract:
┌─────────────────────────────────────────────────────────────┐
│ SPEC.md (the why) │
│ ───────── │
│ Design rationale, compatibility matrix, severity rubric, │
│ acceptance checklist. Read once when building or │
│ modifying the skill. │
└──────────────────────────┬──────────────────────────────────┘
│ informs
▼
┌─────────────────────────────────────────────────────────────┐
│ BUILD_GUIDE.md (the what) │
│ ──────────────── │
│ Literal, runnable shell blocks that write the skill to │
│ disk exactly as given. Source of truth for file contents │
│ and paths. Safe to re-run (idempotent: mkdir -p, cat > │
│ overwrite, rm -rf before cp -r). │
└──────────────────────────┬──────────────────────────────────┘
│ executed by
▼
┌─────────────────────────────────────────────────────────────┐
│ PROMPT_FOR_AGENT.md (the prompt) │
│ ────────────────────── │
│ Short instruction to hand to a coding agent (e.g. DeepSeek │
│ V4 Flash) so it executes BUILD_GUIDE.md mechanically, │
│ without reinterpreting or "improving" it. │
└─────────────────────────────────────────────────────────────┘
None of these three files are part of the skill itself — once built,
the skill is just .agents/skills/find-bugs/ (+ mirrors + the
AGENTS.md pointer). They can be deleted from the target repo after
a successful build, or kept under e.g. docs/dev/ for future
maintenance.
# 1. Copy the skill into your repo
cp -r .agents/skills/find-bugs YOUR_REPO/.agents/skills/find-bugs
cp -r .claude/skills/find-bugs YOUR_REPO/.claude/skills/find-bugs
cp -r .opencode/skill/find-bugs YOUR_REPO/.opencode/skill/find-bugs
cp -r .opencode/skills/find-bugs YOUR_REPO/.opencode/skills/find-bugs
# 2. Drop the AGENTS.md pointer into YOUR_REPO/AGENTS.md
cat AGENTS.md >> YOUR_REPO/AGENTS.md
# 3. Open YOUR_REPO in your CLI agent and say:
# "find bugs in the codebase"When the audit finishes, you'll get:
YOUR_REPO/BUG_REPORT.md— the categorized reportYOUR_REPO/.bugaudit/— the scratch directory (gitignore it)
After running BUILD_GUIDE.md, confirm:
-
.agents/skills/find-bugs/SKILL.mdexists and begins with valid YAML frontmatter containingname: find-bugs, a non-emptydescription, andeffort: max. -
references/bug-taxonomy.mdandassets/report_template.mdexist and are non-empty. -
scripts/run_static_checks.shexists, is executable (chmod +xapplied), and runs cleanly against an empty directory. -
.claude/skills/find-bugs/,.opencode/skill/find-bugs/, and.opencode/skills/find-bugs/each contain the same four files —diff -r .agents/skills/find-bugs .claude/skills/find-bugsshould print nothing. -
AGENTS.mdends with the "Bug audit skill" block. - Smoke test: run the skill on a real or small test repo.
Confirm
BUG_REPORT.mdis created with all six category headers present, a filled-in summary table, and a "Methodology & Limitations" section. Confirm.bugaudit/was created and containsfindings.jsonl. - No side effects:
git statusbefore and after — the only new paths should beBUG_REPORT.md,.bugaudit/(if not gitignored yet), and whatever Step 6/7 added during the build.
If you want concrete proof each category is detected, create a throwaway file with one deliberate bug per category and re-run the skill against it. Examples:
- Logical:
if (count > 10)where the comment says "trigger when 10 or more" (should be>=). - Typo: compare
status == "complete"against a constant defined as"completed". - Formula:
total = price - price * 20where20should be0.20(20% discount). - Inconsistency: two near-identical helper functions where only one
handles a
None/nullname. - Runtime:
items[0].nameon a list that can be empty. - Compile/type: call a function with one fewer argument than its definition requires.
A correct run produces one finding per category referencing the right line.
This repo ran the find-bugs skill against itself. The result is
BUG_REPORT.md — 1 critical, 3 high, 3 medium, 1
low finding (plus 3 low-confidence possible issues), all of which
were fixed before pushing. See the report for the full list, including
a real critical bug in the skill's own YAML frontmatter that a strict
parser would have rejected.
MIT. See LICENSE (add one if missing).