Skip to content

feat: add codebase-documentor plugin#138

Open
XinyuQu wants to merge 1 commit intoawslabs:mainfrom
XinyuQu:feat/codebase-documentor
Open

feat: add codebase-documentor plugin#138
XinyuQu wants to merge 1 commit intoawslabs:mainfrom
XinyuQu:feat/codebase-documentor

Conversation

@XinyuQu
Copy link
Copy Markdown
Contributor

@XinyuQu XinyuQu commented Apr 17, 2026

RFC: #79

Summary

Add codebase-documentor plugin — deep codebase analysis that produces a single CODEBASE_ANALYSIS.md with source-of-truth citations.

This plugin addresses two growing problems identified in the RFC: tribal knowledge loss when engineers leave teams, and the documentation gap created by AI-assisted coding where thousands of lines are generated faster than teams can document them. Engineers inherit codebases where original authors are unavailable, design decisions exist only in someone's head, and AI-generated code works but nobody documented why it's structured that way. The gap between code production speed and documentation speed is widening.

The plugin produces structured, verifiable documentation — not one-time chat responses. Every finding links back to the specific file and line it was derived from, so readers can verify claims and identify stale documentation when code changes. It uses an iterative deepening approach (scan → question → search → write) rather than a single-pass skim, and is designed to run for extended time to produce deep analysis. The output goes significantly beyond what a naive "explain this code" prompt produces: it traces end-to-end request flows, detects discrepancies between documentation and actual code, documents failure modes with recovery commands for oncall engineers, and flags implicit knowledge (hardcoded values, magic numbers, undocumented assumptions) that would otherwise disappear when teams rotate.

While the plugin works with any codebase, it is optimized for AWS-deployed services. It parses CDK constructs, CloudFormation resources, and Terraform blocks as first-class application code — recognizing that in CDK, the infrastructure IS the application logic. It consults awsknowledge and awsiac MCP servers for AWS service enrichment and IaC validation, and integrates with the aws-architecture-diagram skill (deploy-on-aws plugin) to produce validated draw.io diagrams with official AWS4 icons. Failure modes include AWS-specific detection methods and recovery commands. The plugin is tool-agnostic and works on Claude Code, Cursor, Codex, and other coding assistants.

What's included

Plugin infrastructure:

  • Plugin manifest (.claude-plugin/plugin.json) and MCP server config (.mcp.json)
  • Codex marketplace entry and Codex plugin manifest (.codex-plugin/plugin.json)
  • CODEOWNERS entry and root README listing

Skill — document-service:

  • Outline-driven pipeline: file tree → outline → iterative 3-pass analysis → assembly
  • Clickable citations: every finding links to source code via markdown [file:line](./file#Lline) links
  • Discrepancy detection: cross-references README/metadata claims vs actual code
  • Actionable failure modes: detection methods + recovery commands for oncall engineers
  • Architecture diagrams: delegates to aws-architecture-diagram skill (deploy-on-aws plugin) for draw.io output; Mermaid fallback for flow diagrams and architecture overview
  • Large codebase support: tracked sequential analysis with resumable progress file; optional parallel workers when environment supports them

Output sections: Architecture Overview, Code Analysis, Request Lifecycle, Domain Logic Deep-Dive, Startup & Initialization, Components, API Contracts, Data Models, Deployment, Configuration, Monitoring & Observability, Security, Local Development, Discrepancies, Failure Modes, Timeout/Dependency Chain, Runbook Hints, Business Context.

MCP servers:

  • awsknowledge (HTTP) — AWS service descriptions, architecture guidance
  • awsiac (stdio) — CDK/CloudFormation resource schema validation

Changes

  • Plugin manifest (.claude-plugin/plugin.json): metadata, keywords, Apache-2.0 license
  • MCP config (.mcp.json): awsknowledge (HTTP) + awsiac (stdio/uvx)
  • Skill (skills/document-service/SKILL.md): 6-step autonomous workflow with iterative deepening
  • 8 reference files: progressive disclosure for citation format, project detection, code extraction patterns, exclusion patterns, templates, error scenarios, and large codebase strategy
  • Marketplace entries in .claude-plugin/marketplace.json and .agents/plugins/marketplace.json
  • Codex manifest in .codex-plugin/plugin.json and .agents/plugins/marketplace.json
  • CODEOWNERS entry for plugins/codebase-documentor
  • README.md table entry, install command, and detailed plugin section

Evaluation

Tested blind against aws-samples/sample-deepseek-ocr-selfhost — a CDK TypeScript + Python project with 6 CDK stacks, ECS GPU inference, Lambda processing, and API Gateway. The README was removed before analysis to simulate a legacy handoff.

The plugin produced a 571-line CODEBASE_ANALYSIS.md with a draw.io architecture diagram that:

  • Found 15 discrepancies between CLAUDE.md/package.json claims and actual code (including phantom A2I/StepFunctions/DynamoDB dependencies that were declared but never implemented)
  • Traced 2 end-to-end request lifecycles with Mermaid sequence diagrams
  • Generated a draw.io architecture diagram with 11 AWS services using official AWS4 icons
  • Documented 11 failure modes with AWS-specific detection and recovery commands
  • Identified a critical timeout mismatch (29s API Gateway vs multi-minute OCR inference)

Sample output (analysis report + draw.io diagram + SVG render): https://gist.github.com/XinyuQu/2001dff63cc5c5ab12c2f0eb1ea2a78a

Test plan

  • Trigger skill by asking to "analyze this codebase" — produces CODEBASE_ANALYSIS.md
  • Verify clickable citations in [file:line](./file#Lline) format
  • Verify Mermaid flow diagrams present (architecture + sequence diagrams)
  • Verify draw.io architecture diagram generated with AWS4 icons
  • Verify all required sections present
  • mise run lint:manifests — all 5 schemas valid
  • mise run lint:cross-refs — 0 errors, 0 warnings
  • gitleaks — no leaks found
  • bandit — 0 findings
  • semgrep — 0 findings (with repo exclusions)
  • checkov — clean
  • dprint check — clean

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of the project license.

Add a documentation plugin that analyzes codebases to produce a single
CODEBASE_ANALYSIS.md with source-of-truth citations. Designed for legacy
and AI-generated codebases where engineers need deep understanding to
operate, debug, and extend the system.

Key capabilities:
- Outline-driven pipeline: file tree → outline → iterative analysis → assembly
- Clickable citations: every finding links to source code via markdown links
- Discrepancy detection: cross-references README/metadata vs actual code
- Actionable failure modes: detection methods + recovery commands for oncall
- Architecture diagrams: delegates to aws-architecture-diagram skill
  (deploy-on-aws plugin) for draw.io output; Mermaid fallback for flow
  diagrams and architecture overview when skill unavailable
- Deep analysis: iterative deepening (scan → question → search → write)
- Tool-agnostic: works on Claude Code, Cursor, Codex, and other tools
- Large codebase support: tracked sequential analysis with resumable
  progress file; optional parallel workers when environment supports them

Output sections: Architecture Overview, Code Analysis, Request Lifecycle,
Domain Logic Deep-Dive, Startup & Initialization, Components, API
Contracts, Data Models, Deployment, Configuration, Monitoring &
Observability, Security, Local Development, Discrepancies, Failure Modes,
Timeout/Dependency Chain, Runbook Hints, Business Context.

Plugin structure:
- One skill: document-service (auto-triggers on documentation requests)
- Two MCP servers: awsknowledge (HTTP) and awsiac (stdio/uvx)
- 8 reference files for progressive disclosure
- Codex and Claude Code marketplace support
@krokoko
Copy link
Copy Markdown
Contributor

krokoko commented Apr 26, 2026

Thanks ! Automated review first pass:

Critical Issues (4 found)

  1. bin directory exclusion contradicts CDK/Ruby entry point discovery — exclusion-patterns.md excludes bin/ as "Compiled binaries", but discovery-patterns.md lists bin/*.ts (CDK) and bin/rails
    (Ruby) as entry points. Since exclusions run before discovery (Step 2), CDK and Rails entry points would be silently filtered out.
    - Fix: Remove bin from exclusions or qualify it (skip only when it contains compiled outputs, not source files).
  2. .proto files incorrectly excluded as "compiled" — exclusion-patterns.md excludes *.pb, *.proto (compiled), but .proto files are human-readable service contract definitions — high-value for
    documentation. Additionally, discovery-patterns.md explicitly lists "Protobuf/Avro definitions" as something to extract.
    - Fix: Exclude only *.pb (actual compiled output); remove *.proto from exclusions.
  3. packages directory exclusion is self-contradictory — Listed as excluded with the note "scan each individually, not the container", but the "Applying Exclusions" section says "Remove all paths
    matching excluded directories." Following this literally removes the entire monorepo source tree.
    - Fix: Remove packages from exclusions; document the monorepo scanning strategy separately.
  4. Potentially incorrect drawio CLI flag — Step 5 uses drawio -x -f png -e -b 10 -o ... but -e is not a documented flag in the draw.io desktop CLI.
    - Fix: Remove the -e flag from the command.

Important Issues (3 found)

  1. README install command breaks alphabetical ordering — The /plugin install codebase-documentor@... block is placed after sagemaker-ai instead of between aws-serverless and databases-on-aws
    (where the table entry correctly appears).
    - Fix: Move the install block to the correct alphabetical position.
  2. CODEOWNERS missing plugin-specific team — Every other plugin has a third team (e.g., @awslabs/agent-plugins-dsql). This entry only has admins + maintainers.
    - Fix: Add @awslabs/agent-plugins-codebase-documentor or explain the omission in the PR.
  3. SKILL.md missing license frontmatter field — Other skills (e.g., dsql) include license: Apache-2.0 in YAML frontmatter for consistency, even though the schema marks it optional.
    - Fix: Add license: Apache-2.0 to the SKILL.md frontmatter.

Suggestions (8 found)

  1. SKILL.md H1 heading mismatch — Frontmatter says document-service, plugin is codebase-documentor, but H1 reads "Codebase Analyzer" — a third name.
  2. business-context.md template uses H1 headings — But SKILL.md says it's a section within CODEBASE_ANALYSIS.md. Should use H2 to match technical-doc-template.md.
  3. Outline section names diverge from template — Ampersand vs "and" (Deployment & IaC vs Deployment), slash vs "and" (Timeout/Dependency Chain vs Timeout and Dependency Chain), and missing
    sections (Components, Runbook Hints).
  4. Framework coverage gap — discovery-patterns.md detects 13+ frameworks but framework-patterns.md only has extraction patterns for 6. Flask, Next.js, Rust, Serverless Framework, and others
    have no extraction guidance.
  5. awsiac Terraform support inconsistency — SKILL.md claims Terraform support for awsiac, but README tables omit it. Should verify actual aws-iac-mcp-server capability.
  6. Elixir missing from Entry Points table — Detectable via mix.exs but no entry point guidance provided.
  7. discovery-patterns.md subtitle is misleading — Says "Framework-specific patterns for extracting information" but the file covers broader project type detection.
  8. recursive-analysis.md references "Step 3" without cross-link — Readers accessing this file directly won't know what Step 3 refers to.

Strengths

  • Directory structure follows project conventions exactly
  • Plugin manifests are complete and consistent across all three files (Claude, Codex, marketplace)
  • SKILL.md description is well-crafted for auto-triggering with good positive/negative examples
  • MCP server config correctly matches deploy-on-aws patterns
  • Progressive disclosure is well-executed — 8 reference files add detail without duplicating SKILL.md
  • Citation format is internally consistent and well-documented
  • Cross-references from SKILL.md to all 8 reference files are accurate
  • Error scenarios are thorough and correctly reference workflow steps
  • Category casing correctly follows each marketplace's convention (lowercase for Claude, Title Case for Codex)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants