feat: add codebase-documentor plugin by XinyuQu · Pull Request #138 · awslabs/agent-plugins

XinyuQu · 2026-04-17T15:47:25Z

RFC: #79

Summary

Add codebase-documentor plugin — deep codebase analysis that produces a single CODEBASE_ANALYSIS.md with source-of-truth citations.

This plugin addresses two growing problems identified in the RFC: tribal knowledge loss when engineers leave teams, and the documentation gap created by AI-assisted coding where thousands of lines are generated faster than teams can document them. Engineers inherit codebases where original authors are unavailable, design decisions exist only in someone's head, and AI-generated code works but nobody documented why it's structured that way. The gap between code production speed and documentation speed is widening.

The plugin produces structured, verifiable documentation — not one-time chat responses. Every finding links back to the specific file and line it was derived from, so readers can verify claims and identify stale documentation when code changes. It uses an iterative deepening approach (scan → question → search → write) rather than a single-pass skim, and is designed to run for extended time to produce deep analysis. The output goes significantly beyond what a naive "explain this code" prompt produces: it traces end-to-end request flows, detects discrepancies between documentation and actual code, documents failure modes with recovery commands for oncall engineers, and flags implicit knowledge (hardcoded values, magic numbers, undocumented assumptions) that would otherwise disappear when teams rotate.

While the plugin works with any codebase, it is optimized for AWS-deployed services. It parses CDK constructs, CloudFormation resources, and Terraform blocks as first-class application code — recognizing that in CDK, the infrastructure IS the application logic. It consults awsknowledge and awsiac MCP servers for AWS service enrichment and IaC validation, and integrates with the aws-architecture-diagram skill (deploy-on-aws plugin) to produce validated draw.io diagrams with official AWS4 icons. Failure modes include AWS-specific detection methods and recovery commands. The plugin is tool-agnostic and works on Claude Code, Cursor, Codex, and other coding assistants.

What's included

Plugin infrastructure:

Plugin manifest (.claude-plugin/plugin.json) and MCP server config (.mcp.json)
Codex marketplace entry and Codex plugin manifest (.codex-plugin/plugin.json)
CODEOWNERS entry and root README listing

Skill — document-service:

Outline-driven pipeline: file tree → outline → iterative 3-pass analysis → assembly
Clickable citations: every finding links to source code via markdown [file:line](./file#Lline) links
Discrepancy detection: cross-references README/metadata claims vs actual code
Actionable failure modes: detection methods + recovery commands for oncall engineers
Architecture diagrams: delegates to aws-architecture-diagram skill (deploy-on-aws plugin) for draw.io output; Mermaid fallback for flow diagrams and architecture overview
Large codebase support: tracked sequential analysis with resumable progress file; optional parallel workers when environment supports them

Output sections: Architecture Overview, Code Analysis, Request Lifecycle, Domain Logic Deep-Dive, Startup & Initialization, Components, API Contracts, Data Models, Deployment, Configuration, Monitoring & Observability, Security, Local Development, Discrepancies, Failure Modes, Timeout/Dependency Chain, Runbook Hints, Business Context.

MCP servers:

awsknowledge (HTTP) — AWS service descriptions, architecture guidance
awsiac (stdio) — CDK/CloudFormation resource schema validation

Changes

Plugin manifest (.claude-plugin/plugin.json): metadata, keywords, Apache-2.0 license
MCP config (.mcp.json): awsknowledge (HTTP) + awsiac (stdio/uvx)
Skill (skills/document-service/SKILL.md): 6-step autonomous workflow with iterative deepening
8 reference files: progressive disclosure for citation format, project detection, code extraction patterns, exclusion patterns, templates, error scenarios, and large codebase strategy
Marketplace entries in .claude-plugin/marketplace.json and .agents/plugins/marketplace.json
Codex manifest in .codex-plugin/plugin.json and .agents/plugins/marketplace.json
CODEOWNERS entry for plugins/codebase-documentor
README.md table entry, install command, and detailed plugin section

Evaluation

Tested blind against aws-samples/sample-deepseek-ocr-selfhost — a CDK TypeScript + Python project with 6 CDK stacks, ECS GPU inference, Lambda processing, and API Gateway. The README was removed before analysis to simulate a legacy handoff.

The plugin produced a 571-line CODEBASE_ANALYSIS.md with a draw.io architecture diagram that:

Found 15 discrepancies between CLAUDE.md/package.json claims and actual code (including phantom A2I/StepFunctions/DynamoDB dependencies that were declared but never implemented)
Traced 2 end-to-end request lifecycles with Mermaid sequence diagrams
Generated a draw.io architecture diagram with 11 AWS services using official AWS4 icons
Documented 11 failure modes with AWS-specific detection and recovery commands
Identified a critical timeout mismatch (29s API Gateway vs multi-minute OCR inference)

Sample output (analysis report + draw.io diagram + SVG render): https://gist.github.com/XinyuQu/2001dff63cc5c5ab12c2f0eb1ea2a78a

Test plan

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of the project license.

Add a documentation plugin that analyzes codebases to produce a single CODEBASE_ANALYSIS.md with source-of-truth citations. Designed for legacy and AI-generated codebases where engineers need deep understanding to operate, debug, and extend the system. Key capabilities: - Outline-driven pipeline: file tree → outline → iterative analysis → assembly - Clickable citations: every finding links to source code via markdown links - Discrepancy detection: cross-references README/metadata vs actual code - Actionable failure modes: detection methods + recovery commands for oncall - Architecture diagrams: delegates to aws-architecture-diagram skill (deploy-on-aws plugin) for draw.io output; Mermaid fallback for flow diagrams and architecture overview when skill unavailable - Deep analysis: iterative deepening (scan → question → search → write) - Tool-agnostic: works on Claude Code, Cursor, Codex, and other tools - Large codebase support: tracked sequential analysis with resumable progress file; optional parallel workers when environment supports them Output sections: Architecture Overview, Code Analysis, Request Lifecycle, Domain Logic Deep-Dive, Startup & Initialization, Components, API Contracts, Data Models, Deployment, Configuration, Monitoring & Observability, Security, Local Development, Discrepancies, Failure Modes, Timeout/Dependency Chain, Runbook Hints, Business Context. Plugin structure: - One skill: document-service (auto-triggers on documentation requests) - Two MCP servers: awsknowledge (HTTP) and awsiac (stdio/uvx) - 8 reference files for progressive disclosure - Codex and Claude Code marketplace support

krokoko · 2026-04-26T17:52:02Z

Thanks ! Automated review first pass:

Critical Issues (4 found)

bin directory exclusion contradicts CDK/Ruby entry point discovery — exclusion-patterns.md excludes bin/ as "Compiled binaries", but discovery-patterns.md lists bin/*.ts (CDK) and bin/rails
(Ruby) as entry points. Since exclusions run before discovery (Step 2), CDK and Rails entry points would be silently filtered out.
- Fix: Remove bin from exclusions or qualify it (skip only when it contains compiled outputs, not source files).
.proto files incorrectly excluded as "compiled" — exclusion-patterns.md excludes *.pb, *.proto (compiled), but .proto files are human-readable service contract definitions — high-value for
documentation. Additionally, discovery-patterns.md explicitly lists "Protobuf/Avro definitions" as something to extract.
- Fix: Exclude only *.pb (actual compiled output); remove *.proto from exclusions.
packages directory exclusion is self-contradictory — Listed as excluded with the note "scan each individually, not the container", but the "Applying Exclusions" section says "Remove all paths
matching excluded directories." Following this literally removes the entire monorepo source tree.
- Fix: Remove packages from exclusions; document the monorepo scanning strategy separately.
Potentially incorrect drawio CLI flag — Step 5 uses drawio -x -f png -e -b 10 -o ... but -e is not a documented flag in the draw.io desktop CLI.
- Fix: Remove the -e flag from the command.

Important Issues (3 found)

README install command breaks alphabetical ordering — The /plugin install codebase-documentor@... block is placed after sagemaker-ai instead of between aws-serverless and databases-on-aws
(where the table entry correctly appears).
- Fix: Move the install block to the correct alphabetical position.
CODEOWNERS missing plugin-specific team — Every other plugin has a third team (e.g., @awslabs/agent-plugins-dsql). This entry only has admins + maintainers.
- Fix: Add @awslabs/agent-plugins-codebase-documentor or explain the omission in the PR.
SKILL.md missing license frontmatter field — Other skills (e.g., dsql) include license: Apache-2.0 in YAML frontmatter for consistency, even though the schema marks it optional.
- Fix: Add license: Apache-2.0 to the SKILL.md frontmatter.

Suggestions (8 found)

SKILL.md H1 heading mismatch — Frontmatter says document-service, plugin is codebase-documentor, but H1 reads "Codebase Analyzer" — a third name.
business-context.md template uses H1 headings — But SKILL.md says it's a section within CODEBASE_ANALYSIS.md. Should use H2 to match technical-doc-template.md.
Outline section names diverge from template — Ampersand vs "and" (Deployment & IaC vs Deployment), slash vs "and" (Timeout/Dependency Chain vs Timeout and Dependency Chain), and missing
sections (Components, Runbook Hints).
Framework coverage gap — discovery-patterns.md detects 13+ frameworks but framework-patterns.md only has extraction patterns for 6. Flask, Next.js, Rust, Serverless Framework, and others
have no extraction guidance.
awsiac Terraform support inconsistency — SKILL.md claims Terraform support for awsiac, but README tables omit it. Should verify actual aws-iac-mcp-server capability.
Elixir missing from Entry Points table — Detectable via mix.exs but no entry point guidance provided.
discovery-patterns.md subtitle is misleading — Says "Framework-specific patterns for extracting information" but the file covers broader project type detection.
recursive-analysis.md references "Step 3" without cross-link — Readers accessing this file directly won't know what Step 3 refers to.

Strengths

Directory structure follows project conventions exactly
Plugin manifests are complete and consistent across all three files (Claude, Codex, marketplace)
SKILL.md description is well-crafted for auto-triggering with good positive/negative examples
MCP server config correctly matches deploy-on-aws patterns
Progressive disclosure is well-executed — 8 reference files add detail without duplicating SKILL.md
Citation format is internally consistent and well-documented
Cross-references from SKILL.md to all 8 reference files are accurate
Error scenarios are thorough and correctly reference workflow steps
Category casing correctly follows each marketplace's convention (lowercase for Claude, Title Case for Codex)

XinyuQu requested review from a team, krokoko, scottschreckengaust and theagenticguy April 17, 2026 15:47

XinyuQu added the new plugin label Apr 17, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add codebase-documentor plugin#138

feat: add codebase-documentor plugin#138
XinyuQu wants to merge 1 commit intoawslabs:mainfrom
XinyuQu:feat/codebase-documentor

XinyuQu commented Apr 17, 2026

Uh oh!

krokoko commented Apr 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

XinyuQu commented Apr 17, 2026

Summary

What's included

Changes

Evaluation

Test plan

Uh oh!

krokoko commented Apr 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants