codectx processes repositories through a structured analysis pipeline that ranks code by importance, compresses it intelligently, and emits a structured markdown document optimized for AI systems.
The pipeline consists of six stages: file discovery, parsing, graph construction, ranking, compression, and formatting.
Purpose: Discover repository files while respecting ignore rules.
The Walker recursively traverses the filesystem from the repository root and applies ignore rules in order:
ALWAYS_IGNORE— built-in patterns (.git,__pycache__,.venv, etc.).gitignore— Git standard ignore rules.ctxignore— codectx-specific ignore rules
The tool uses pathspec with gitwildmatch semantics to ensure exact behavioral parity with Git's ignore processing.
Output: List[Path] of files to analyze.
Purpose: Extract imports, symbols, and metadata from source files.
The Parser processes files in parallel using ProcessPoolExecutor (CPU-bound) and ThreadPoolExecutor (I/O-bound). For each file:
- Detect language from file extension
- Parse AST using tree-sitter
- Extract:
- Import statements (list of import strings)
- Top-level symbols (functions, classes, methods)
- Docstrings per symbol
- Code structure metadata
Tree-sitter provides a unified interface across six+ languages: Python, TypeScript, JavaScript, Go, Rust, Java, C, C++, and Ruby.
Output: Dict[Path, ParseResult] where each ParseResult contains imports, symbols, and source text.
Purpose: Build a directed graph representing module relationships.
The Graph Builder processes parse results to construct a rustworkx.DiGraph:
- For each import statement, resolve the import string to a file path using per-language import resolvers
- Create nodes for files and edges for import relationships
- Compute graph metrics:
- Fan-in — in-degree per node (how many files import this module)
- Fan-out — out-degree per node (how many modules this file imports)
- Strongly connected components — detect cyclic dependencies
The graph enables ranking algorithms to identify important modules based on structural position.
Output: rustworkx.DiGraph with computed metrics.
Purpose: Score files by importance using multiple signals.
The Ranker computes a composite importance score for each file:
score = (0.35 × git_frequency)
+ (0.35 × fan_in)
+ (0.20 × recency)
+ (0.10 × entry_proximity)
Git Frequency (0.35): Commit count touching the file. Frequently-modified files are typically more important.
Fan-in (0.35): Inverse-normalized in-degree. Files imported by many other modules are critical interfaces.
Recency (0.20): Days since last modification. Recently active files are prioritized.
Entry Proximity (0.10): Graph distance from identified entry points. Files close to main execution paths rank higher.
Scores are normalized to [0.0, 1.0] range for uniform compression tier assignment.
Output: Dict[Path, float] mapping file paths to scores.
Purpose: Fit code content within a token budget.
The Compressor assigns content tiers based on scores:
- Tier 1 (score ≥ 0.7) — Full source code
- Tier 2 (0.3 ≤ score < 0.7) — Function signatures and docstrings only
- Tier 3 (score < 0.3) — One-line summary
Files are emitted in order: Tier 1 by score descending, then Tier 2, then Tier 3.
If total token count exceeds the budget:
- Drop all Tier 3 files
- Truncate Tier 2 content (keep only signatures, remove docstrings)
- Truncate Tier 1 content (reduce line count progressively)
- If still over budget, drop lowest-scored Tier 1 files
This is a hard constraint. The tool does not emit context that exceeds the token limit.
Output: Dict[Path, CompressedContent] and usage statistics.
Purpose: Emit structured markdown optimized for AI agents.
The Formatter writes sections in fixed order:
- ARCHITECTURE — High-level project structure
- DEPENDENCY_GRAPH — Mermaid diagram of module relationships
- ENTRY_POINTS — Main files and public interfaces with full source
- CORE_MODULES — High-scoring modules with full source
- SUPPORTING_MODULES — Mid-scoring modules with signatures and docstrings
- PERIPHERY — Low-scoring files with one-line summaries
- RECENT_CHANGES — Optional diff section (if
--sinceflag provided)
Each section is preceded by a Markdown heading and terminated with metadata (token count, file count).
Output: Markdown string suitable for writing to disk as CONTEXT.md.
File System
│
├─→ [Walker]
│ ├ Respects .gitignore
│ ├ Respects .ctxignore
│ └ Output: List[Path]
│
├─→ [Parser] (Parallel)
│ ├ Per-language extraction
│ ├ tree-sitter AST processing
│ └ Output: Dict[Path, ParseResult]
│
├─→ [Graph Builder]
│ ├ Resolve imports
│ ├ Construct DiGraph
│ └ Output: rustworkx.DiGraph
│
├─→ [Git Metadata] (Parallel)
│ ├ Commit frequency per file
│ ├ Recency (last modification)
│ └ Output: Dict[Path, GitMeta]
│
├─→ [Ranker]
│ ├ Composite scoring
│ ├ Normalize to [0.0, 1.0]
│ └ Output: Dict[Path, float]
│
├─→ [Compressor]
│ ├ Tier assignment
│ ├ Token budget enforcement
│ └ Output: Dict[Path, CompressedContent]
│
└─→ [Formatter]
├ Section organization
├ Markdown generation
└ Output: CONTEXT.md
The tool caches expensive computations:
Cache key: (file_path, file_hash, git_commit_sha)
Cached items:
- Parsed AST and extracted symbols per file
- Git metadata (frequency, recency)
Cache location: .codectx_cache/ at repository root (gitignored)
Invalidation: Cache entries are invalidated when file content changes or HEAD commit changes.
This enables fast incremental updates in watch mode.
When running codectx watch ., the tool:
- Monitors filesystem with
watchfiles - On file change:
- Reparse only affected files
- Rebuild graph for changed nodes and dependents
- Re-rank affected subgraph
- Recompress to budget
- Re-emit output
This is significantly faster than full analysis on every change.
Token counting uses tiktoken, which accurately reflects OpenAI and Anthropic model tokenization.
Budget enforcement is hard: the tool does not emit context exceeding the specified limit.
Consumption order:
- Fixed overhead (section headers, metadata) — typically 500–1000 tokens
- Tier 1 files by score descending (full source)
- Tier 2 files by score descending (signatures only)
- Tier 3 files by score descending (one-line summaries)
Files omitted due to budget are logged with a note in the output.
The Parser uses tree-sitter for universal AST extraction. Each language requires:
- tree-sitter grammar — provided by
tree-sitter-LANGUAGEpackage - Import resolver — per-language logic to resolve import strings to file paths
Currently supported:
- Python —
import X,from X import Y - TypeScript/JavaScript —
import * from "X",require("X") - Go —
import "X" - Rust —
use X::{Y, Z} - Java —
import X.Y;
Adding a language requires implementing a resolver in src/codectx/graph/resolver.py and adding the grammar dependency to pyproject.toml.
Configuration is applied in this precedence order:
- CLI flags (highest priority)
.contextcraft.tomlin repository root- Built-in defaults (lowest priority)
Example .contextcraft.toml:
[codectx]
token_budget = 120000
output = "CONTEXT.md"
include_patterns = ["src/**", "lib/**"]
exclude_patterns = ["tests/**", "*.test.py"]CPU-bound tasks (Parser): ProcessPoolExecutor — parsing and AST extraction leverages tree-sitter C extension.
I/O-bound tasks (Git metadata, file I/O): ThreadPoolExecutor — reading git history and source files is I/O-bound.
Sync tasks: Graph construction, ranking, and compression are single-threaded because they are fast and maintain simple state.
This mixed-executor approach balances CPU and I/O contention.
On a typical 10k-file repository:
- Walker: ~500ms (filesystem traversal)
- Parser: ~2-5s (parallel tree-sitter parsing)
- Graph Builder: ~100ms (import resolution)
- Ranker: ~200ms (scoring and normalization)
- Compressor: ~50ms (tier assignment)
- Formatter: ~100ms (markdown generation)
Total: ~3-6 seconds for full analysis.
Incremental mode (watch) is typically 5-10x faster because it processes only changed files.