A pre-commit hook that enforces token count limits on markdown files to prevent unbounded growth and context window bloat in AI-assisted development workflows.
Markdown documentation files (like CLAUDE.md, memory files, project docs) grow unbounded over time, consuming valuable AI context windows. When your documentation files exceed LLM context limits, they become fragmented, outdated, or worse - silently truncated without your knowledge.
Manual monitoring is tedious and error-prone. You need automated guardrails that fail fast when documentation becomes too large.
mdtoken provides automated token counting checks during your git workflow with clear, actionable feedback when limits are exceeded. Think of it as a linter for your AI context consumption.
- ✅ Accurate token counting using tiktoken library
- ✅ Configurable tokenizers - Support for GPT-4, GPT-4o, Claude, Codex, and more
- ✅ Flexible configuration with per-file limits and glob patterns
- ✅ Fast execution (< 1 second for typical projects, 158 tests in < 2s)
- ✅ Clear error messages with actionable suggestions for remediation
- ✅ Pre-commit integration - Blocks commits that violate token limits
- ✅ Directory-level limits - Different limits for commands, skills, agents
- ✅ Total token budgets - Enforce aggregate limits across all files
- ✅ Dry-run mode - Preview violations without failing
- ✅ Zero dependencies - Only requires Python 3.8+ and standard libraries
pip install mdtokengit clone https://github.com/applied-artificial-intelligence/mdtoken.git
cd mdtoken
pip install -e .Create .mdtokenrc.yaml in your project root:
# .mdtokenrc.yaml
default_limit: 4000
model: "gpt-4" # Use GPT-4 tokenizer
limits:
"CLAUDE.md": 8000
"README.md": 6000
".claude/commands/**": 2000
".claude/skills/**": 3000
exclude:
- "node_modules/**"
- "**/archive/**"
- "venv/**"
total_limit: 50000
fail_on_exceed: true# Check all markdown files
mdtoken check
# Check specific files
mdtoken check README.md CLAUDE.md
# Dry run (don't fail on violations)
mdtoken check --dry-run
# Verbose output with suggestions
mdtoken check --verboseAdd to .pre-commit-config.yaml:
repos:
- repo: https://github.com/applied-artificial-intelligence/mdtoken
rev: v1.0.0 # Use latest release tag
hooks:
- id: markdown-token-limit
args: ['--config=.mdtokenrc.yaml']Then install the hook:
pre-commit installNow every commit will check markdown files against your token limits!
# Default token limit for all markdown files
default_limit: 4000
# Whether to fail (exit 1) when limits are exceeded
# Set to false for warning-only mode
fail_on_exceed: true
# Optional: Total token limit across all files
total_limit: 50000Choose your tokenizer based on the LLM you're using:
# Option 1: User-friendly model name (recommended)
model: "gpt-4"
# Option 2: Direct tiktoken encoding name
encoding: "cl100k_base"Supported Models:
gpt-4→ cl100k_base (100K token context, default)gpt-4o→ o200k_base (200K token context)gpt-3.5-turbo→ cl100k_baseclaude-3→ cl100k_base (Claude uses similar tokenization)claude-3.5→ cl100k_basecodex→ p50k_basetext-davinci-003→ p50k_basetext-davinci-002→ p50k_basegpt-3→ r50k_base
Note: If both model and encoding are specified, encoding takes precedence.
limits:
# Exact file match
"CLAUDE.md": 8000
# Pattern matching (substring)
"README.md": 6000 # Matches any path ending with README.md
# Directory patterns
".claude/commands/**": 2000
".claude/skills/**": 3000
".claude/agents/**": 4000
"docs/*.md": 5000Pattern Matching Rules:
- Exact path match has highest priority
- Substring match (e.g., "README.md" matches "docs/README.md")
- Falls back to
default_limitif no pattern matches
exclude:
# Default exclusions (automatically included)
- ".git/**"
- "node_modules/**"
- "venv/**"
- ".venv/**"
- "build/**"
- "dist/**"
- "__pycache__/**"
# Custom exclusions
- "**/archive/**"
- "**/old_docs/**"
- "README.md" # Exclude specific files entirelyFor a Claude Code project with commands, skills, and agents:
# .mdtokenrc.yaml
model: "claude-3.5"
default_limit: 4000
limits:
".claude/CLAUDE.md": 10000 # Main instruction file
".claude/commands/**": 2000 # Keep commands concise
".claude/skills/**": 3000
".claude/agents/**": 5000
"README.md": 8000
exclude:
- ".claude/memory/archived/**"
total_limit: 100000 # Aggregate limit# .mdtokenrc.yaml
model: "gpt-4"
default_limit: 5000
limits:
"README.md": 8000
"docs/api.md": 10000
"docs/tutorials/**": 6000
"docs/reference/**": 8000
exclude:
- "docs/archive/**"
- "docs/drafts/**"
fail_on_exceed: trueIf you're optimizing for different models:
# For GPT-4o with larger context window
model: "gpt-4o"
default_limit: 8000 # Can use larger limits
limits:
"CLAUDE.md": 15000
"docs/**": 10000✓ All files within token limits
3 files checked, 8,245 tokens total
✗ Token limit violations found:
docs/README.md: 5,234 tokens (limit: 4,000, over by 1,234)
Suggestions:
- Consider splitting this file into multiple smaller files
- Target: reduce by ~1,234 tokens to get under the limit
- Move detailed documentation to separate docs/ files
.claude/CLAUDE.md: 9,876 tokens (limit: 8,000, over by 1,876)
Suggestions:
- Review content and remove unnecessary sections
- Consider moving older content to an archived directory
2/3 files over limit, 15,110 tokens total
For programmatic usage in Python scripts, see the API Documentation for detailed information on:
- TokenCounter - Count tokens in text and files
- Config - Load and manage configuration
- LimitEnforcer - Check files against token limits
- FileMatcher - Discover and filter markdown files
- Reporter - Format and display results
Quick example:
from mdtoken.config import Config
from mdtoken.enforcer import LimitEnforcer
from mdtoken.reporter import Reporter
config = Config.from_file()
enforcer = LimitEnforcer(config)
result = enforcer.check_files()
reporter = Reporter(enforcer)
reporter.report(result, verbose=True)See docs/api.md for complete API reference with examples.
- API Reference - Complete API documentation for programmatic usage
- Usage Examples - Project-specific configurations and workflows
- Troubleshooting - Common issues and solutions
- FAQ - Frequently asked questions
Problem: mdtoken looks for .mdtokenrc.yaml in the current directory.
Solution: Either:
- Create
.mdtokenrc.yamlin your project root - Specify config path:
mdtoken check --config path/to/config.yaml - Use defaults (no config file needed)
Problem: Model name not recognized in MODEL_ENCODING_MAP.
Solution: Either:
- Use a supported model name (see "Supported Models" above)
- Use direct encoding:
encoding: "cl100k_base"
Problem: Hook doesn't execute on commit.
Solution:
- Verify
.pre-commit-config.yamlexists - Run
pre-commit install - Test with
pre-commit run --all-files
Problem: Slow execution on large repositories.
Current Status: mdtoken is highly optimized:
- 158 tests run in < 2 seconds
- Token counting: 6.15ms for 8K tokens (16x faster than requirement)
If you're experiencing slowness:
- Use exclusion patterns to skip unnecessary directories
- Limit to specific file patterns
- Consider running only on changed files (pre-commit does this automatically)
# Clone repository
git clone https://github.com/applied-artificial-intelligence/mdtoken.git
cd mdtoken
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install development dependencies
pip install -e ".[dev]"
# Run tests
pytest
# Run linting
make lint
# Run type checking
make typecheck
# Format code
make format# Run all tests
pytest
# Run with coverage
pytest --cov=src/mdtoken --cov-report=term
# Run specific test file
pytest tests/test_config.py
# Run verbose
pytest -xvsmdtoken/
├── src/mdtoken/
│ ├── __init__.py
│ ├── __version__.py
│ ├── cli.py # Command-line interface
│ ├── config.py # Configuration loading/validation
│ ├── counter.py # Token counting with tiktoken
│ ├── enforcer.py # Limit enforcement logic
│ ├── matcher.py # File pattern matching
│ ├── reporter.py # Output formatting
│ └── commands/
│ └── check.py # Check command implementation
├── tests/
│ ├── test_config.py # Config tests (64 tests)
│ ├── test_counter.py # Token counting tests
│ ├── test_enforcer.py # Enforcement logic tests (72 tests)
│ ├── test_matcher.py # File matching tests
│ ├── test_edge_cases.py # Edge case handling (27 tests)
│ └── integration/
│ └── test_git_workflow.py # End-to-end tests (12 tests)
├── .pre-commit-hooks.yaml # Pre-commit hook definition
├── pyproject.toml # Package configuration
└── README.md
Current Coverage: 84% (176 tests passing)
| Module | Coverage | Notes |
|---|---|---|
| config.py | 97% | Model/encoding resolution fully tested |
| counter.py | 90% | Token counting core |
| enforcer.py | 96% | Limit enforcement logic |
| matcher.py | 97% | File pattern matching |
| reporter.py | 90% | Output formatting |
| cli.py | 0% | Integration tests cover CLI |
| commands/check.py | 0% | Integration tests cover commands |
- ✅ Core token counting functionality
- ✅ Configurable tokenizers (encoding/model)
- ✅ YAML configuration support
- ✅ Pre-commit hook integration
- ✅ Per-file and pattern-based limits
- ✅ Total token budgets
- ✅ Clear error messages with suggestions
- ✅ Comprehensive test suite (176 tests, 84% coverage)
- ✅ CI/CD with GitHub Actions
- 🚧 PyPI distribution (pending)
- 🚧 Comprehensive documentation (in progress)
- Auto-fix/splitting suggestions with AI
- Token count caching for performance
- Parallel processing for large repos
- GitHub Action for PR checks
- IDE integrations (VS Code extension)
- Watch mode for live feedback
- HTML/JSON output formats
- Integration with documentation generators
Q: Why another markdown linter? A: mdtoken is specifically designed for AI-assisted workflows where token counts matter. Traditional linters check syntax/style; mdtoken checks token consumption.
Q: Does it work with Claude, GPT-4o, or other models?
A: Yes! Use the model config parameter to select the appropriate tokenizer. While exact tokenization may vary slightly between models, tiktoken provides excellent approximations.
Q: Can I use this without pre-commit?
A: Absolutely! Run mdtoken check manually as part of your CI/CD pipeline, as a git hook, or integrate it into your build process.
Q: What if I want different limits for different branches?
A: Create multiple config files (e.g., .mdtokenrc.production.yaml, .mdtokenrc.dev.yaml) and use --config flag to specify which one to use.
Q: How accurate is the token counting? A: mdtoken uses tiktoken, the same library used by OpenAI's models. Accuracy is within 1-2% of actual model tokenization.
Q: Does it support languages other than markdown? A: Currently markdown-only. Future versions may support additional file types.
Contributions are welcome! This project follows standard open-source practices:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Make your changes with tests
- Run the test suite (
pytest) - Ensure code quality (
make lint) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
See CONTRIBUTING.md (coming soon) for detailed guidelines.
MIT License - See LICENSE file for details.
Stefan Rummer (@stefanrmmr)
- Built with Claude Code as part of an AI-assisted development workflow
- Uses OpenAI's tiktoken library for accurate token counting
- Inspired by real-world challenges managing context windows in LLM-assisted development
- Thanks to the pre-commit framework team for excellent hook infrastructure
Status: Ready for v1.0.0 release. Star ⭐ and watch 👀 for updates!