Skip to content

applied-artificial-intelligence/mdtoken

Repository files navigation

mdtoken - Markdown Token Limit Pre-commit Hook

Tests codecov License: MIT Python 3.8+

A pre-commit hook that enforces token count limits on markdown files to prevent unbounded growth and context window bloat in AI-assisted development workflows.

Problem

Markdown documentation files (like CLAUDE.md, memory files, project docs) grow unbounded over time, consuming valuable AI context windows. When your documentation files exceed LLM context limits, they become fragmented, outdated, or worse - silently truncated without your knowledge.

Manual monitoring is tedious and error-prone. You need automated guardrails that fail fast when documentation becomes too large.

Solution

mdtoken provides automated token counting checks during your git workflow with clear, actionable feedback when limits are exceeded. Think of it as a linter for your AI context consumption.

Features

  • Accurate token counting using tiktoken library
  • Configurable tokenizers - Support for GPT-4, GPT-4o, Claude, Codex, and more
  • Flexible configuration with per-file limits and glob patterns
  • Fast execution (< 1 second for typical projects, 158 tests in < 2s)
  • Clear error messages with actionable suggestions for remediation
  • Pre-commit integration - Blocks commits that violate token limits
  • Directory-level limits - Different limits for commands, skills, agents
  • Total token budgets - Enforce aggregate limits across all files
  • Dry-run mode - Preview violations without failing
  • Zero dependencies - Only requires Python 3.8+ and standard libraries

Installation

From PyPI (Coming Soon)

pip install mdtoken

From Source (Development)

git clone https://github.com/applied-artificial-intelligence/mdtoken.git
cd mdtoken
pip install -e .

Quick Start

1. Create Configuration File

Create .mdtokenrc.yaml in your project root:

# .mdtokenrc.yaml
default_limit: 4000
model: "gpt-4"  # Use GPT-4 tokenizer

limits:
  "CLAUDE.md": 8000
  "README.md": 6000
  ".claude/commands/**": 2000
  ".claude/skills/**": 3000

exclude:
  - "node_modules/**"
  - "**/archive/**"
  - "venv/**"

total_limit: 50000
fail_on_exceed: true

2. Run Manually

# Check all markdown files
mdtoken check

# Check specific files
mdtoken check README.md CLAUDE.md

# Dry run (don't fail on violations)
mdtoken check --dry-run

# Verbose output with suggestions
mdtoken check --verbose

3. Integrate with Pre-commit

Add to .pre-commit-config.yaml:

repos:
  - repo: https://github.com/applied-artificial-intelligence/mdtoken
    rev: v1.0.0  # Use latest release tag
    hooks:
      - id: markdown-token-limit
        args: ['--config=.mdtokenrc.yaml']

Then install the hook:

pre-commit install

Now every commit will check markdown files against your token limits!

Configuration Guide

Basic Options

# Default token limit for all markdown files
default_limit: 4000

# Whether to fail (exit 1) when limits are exceeded
# Set to false for warning-only mode
fail_on_exceed: true

# Optional: Total token limit across all files
total_limit: 50000

Tokenizer Configuration

Choose your tokenizer based on the LLM you're using:

# Option 1: User-friendly model name (recommended)
model: "gpt-4"

# Option 2: Direct tiktoken encoding name
encoding: "cl100k_base"

Supported Models:

  • gpt-4 → cl100k_base (100K token context, default)
  • gpt-4o → o200k_base (200K token context)
  • gpt-3.5-turbo → cl100k_base
  • claude-3 → cl100k_base (Claude uses similar tokenization)
  • claude-3.5 → cl100k_base
  • codex → p50k_base
  • text-davinci-003 → p50k_base
  • text-davinci-002 → p50k_base
  • gpt-3 → r50k_base

Note: If both model and encoding are specified, encoding takes precedence.

Per-File and Pattern-Based Limits

limits:
  # Exact file match
  "CLAUDE.md": 8000

  # Pattern matching (substring)
  "README.md": 6000  # Matches any path ending with README.md

  # Directory patterns
  ".claude/commands/**": 2000
  ".claude/skills/**": 3000
  ".claude/agents/**": 4000
  "docs/*.md": 5000

Pattern Matching Rules:

  • Exact path match has highest priority
  • Substring match (e.g., "README.md" matches "docs/README.md")
  • Falls back to default_limit if no pattern matches

Exclusion Patterns

exclude:
  # Default exclusions (automatically included)
  - ".git/**"
  - "node_modules/**"
  - "venv/**"
  - ".venv/**"
  - "build/**"
  - "dist/**"
  - "__pycache__/**"

  # Custom exclusions
  - "**/archive/**"
  - "**/old_docs/**"
  - "README.md"  # Exclude specific files entirely

Usage Examples

Example 1: Claude Code Project

For a Claude Code project with commands, skills, and agents:

# .mdtokenrc.yaml
model: "claude-3.5"
default_limit: 4000

limits:
  ".claude/CLAUDE.md": 10000  # Main instruction file
  ".claude/commands/**": 2000  # Keep commands concise
  ".claude/skills/**": 3000
  ".claude/agents/**": 5000
  "README.md": 8000

exclude:
  - ".claude/memory/archived/**"

total_limit: 100000  # Aggregate limit

Example 2: Documentation-Heavy Project

# .mdtokenrc.yaml
model: "gpt-4"
default_limit: 5000

limits:
  "README.md": 8000
  "docs/api.md": 10000
  "docs/tutorials/**": 6000
  "docs/reference/**": 8000

exclude:
  - "docs/archive/**"
  - "docs/drafts/**"

fail_on_exceed: true

Example 3: Multi-Model Support

If you're optimizing for different models:

# For GPT-4o with larger context window
model: "gpt-4o"
default_limit: 8000  # Can use larger limits

limits:
  "CLAUDE.md": 15000
  "docs/**": 10000

Output Examples

Passing Check

✓ All files within token limits
3 files checked, 8,245 tokens total

Failing Check

✗ Token limit violations found:

docs/README.md: 5,234 tokens (limit: 4,000, over by 1,234)
  Suggestions:
  - Consider splitting this file into multiple smaller files
  - Target: reduce by ~1,234 tokens to get under the limit
  - Move detailed documentation to separate docs/ files

.claude/CLAUDE.md: 9,876 tokens (limit: 8,000, over by 1,876)
  Suggestions:
  - Review content and remove unnecessary sections
  - Consider moving older content to an archived directory

2/3 files over limit, 15,110 tokens total

Programmatic Usage

For programmatic usage in Python scripts, see the API Documentation for detailed information on:

  • TokenCounter - Count tokens in text and files
  • Config - Load and manage configuration
  • LimitEnforcer - Check files against token limits
  • FileMatcher - Discover and filter markdown files
  • Reporter - Format and display results

Quick example:

from mdtoken.config import Config
from mdtoken.enforcer import LimitEnforcer
from mdtoken.reporter import Reporter

config = Config.from_file()
enforcer = LimitEnforcer(config)
result = enforcer.check_files()

reporter = Reporter(enforcer)
reporter.report(result, verbose=True)

See docs/api.md for complete API reference with examples.

Documentation

Troubleshooting

"Config file not found" Warning

Problem: mdtoken looks for .mdtokenrc.yaml in the current directory.

Solution: Either:

  1. Create .mdtokenrc.yaml in your project root
  2. Specify config path: mdtoken check --config path/to/config.yaml
  3. Use defaults (no config file needed)

"Unknown model" Error

Problem: Model name not recognized in MODEL_ENCODING_MAP.

Solution: Either:

  • Use a supported model name (see "Supported Models" above)
  • Use direct encoding: encoding: "cl100k_base"

Pre-commit Hook Not Running

Problem: Hook doesn't execute on commit.

Solution:

  1. Verify .pre-commit-config.yaml exists
  2. Run pre-commit install
  3. Test with pre-commit run --all-files

Performance Issues

Problem: Slow execution on large repositories.

Current Status: mdtoken is highly optimized:

  • 158 tests run in < 2 seconds
  • Token counting: 6.15ms for 8K tokens (16x faster than requirement)

If you're experiencing slowness:

  1. Use exclusion patterns to skip unnecessary directories
  2. Limit to specific file patterns
  3. Consider running only on changed files (pre-commit does this automatically)

Development

Setup Development Environment

# Clone repository
git clone https://github.com/applied-artificial-intelligence/mdtoken.git
cd mdtoken

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install development dependencies
pip install -e ".[dev]"

# Run tests
pytest

# Run linting
make lint

# Run type checking
make typecheck

# Format code
make format

Running Tests

# Run all tests
pytest

# Run with coverage
pytest --cov=src/mdtoken --cov-report=term

# Run specific test file
pytest tests/test_config.py

# Run verbose
pytest -xvs

Project Structure

mdtoken/
├── src/mdtoken/
│   ├── __init__.py
│   ├── __version__.py
│   ├── cli.py              # Command-line interface
│   ├── config.py           # Configuration loading/validation
│   ├── counter.py          # Token counting with tiktoken
│   ├── enforcer.py         # Limit enforcement logic
│   ├── matcher.py          # File pattern matching
│   ├── reporter.py         # Output formatting
│   └── commands/
│       └── check.py        # Check command implementation
├── tests/
│   ├── test_config.py      # Config tests (64 tests)
│   ├── test_counter.py     # Token counting tests
│   ├── test_enforcer.py    # Enforcement logic tests (72 tests)
│   ├── test_matcher.py     # File matching tests
│   ├── test_edge_cases.py  # Edge case handling (27 tests)
│   └── integration/
│       └── test_git_workflow.py  # End-to-end tests (12 tests)
├── .pre-commit-hooks.yaml  # Pre-commit hook definition
├── pyproject.toml          # Package configuration
└── README.md

Test Coverage

Current Coverage: 84% (176 tests passing)

Module Coverage Notes
config.py 97% Model/encoding resolution fully tested
counter.py 90% Token counting core
enforcer.py 96% Limit enforcement logic
matcher.py 97% File pattern matching
reporter.py 90% Output formatting
cli.py 0% Integration tests cover CLI
commands/check.py 0% Integration tests cover commands

Roadmap

v1.0.0 (Current - Ready for Release)

  • ✅ Core token counting functionality
  • ✅ Configurable tokenizers (encoding/model)
  • ✅ YAML configuration support
  • ✅ Pre-commit hook integration
  • ✅ Per-file and pattern-based limits
  • ✅ Total token budgets
  • ✅ Clear error messages with suggestions
  • ✅ Comprehensive test suite (176 tests, 84% coverage)
  • ✅ CI/CD with GitHub Actions
  • 🚧 PyPI distribution (pending)
  • 🚧 Comprehensive documentation (in progress)

v1.1+ (Future Enhancements)

  • Auto-fix/splitting suggestions with AI
  • Token count caching for performance
  • Parallel processing for large repos
  • GitHub Action for PR checks
  • IDE integrations (VS Code extension)
  • Watch mode for live feedback
  • HTML/JSON output formats
  • Integration with documentation generators

FAQ

Q: Why another markdown linter? A: mdtoken is specifically designed for AI-assisted workflows where token counts matter. Traditional linters check syntax/style; mdtoken checks token consumption.

Q: Does it work with Claude, GPT-4o, or other models? A: Yes! Use the model config parameter to select the appropriate tokenizer. While exact tokenization may vary slightly between models, tiktoken provides excellent approximations.

Q: Can I use this without pre-commit? A: Absolutely! Run mdtoken check manually as part of your CI/CD pipeline, as a git hook, or integrate it into your build process.

Q: What if I want different limits for different branches? A: Create multiple config files (e.g., .mdtokenrc.production.yaml, .mdtokenrc.dev.yaml) and use --config flag to specify which one to use.

Q: How accurate is the token counting? A: mdtoken uses tiktoken, the same library used by OpenAI's models. Accuracy is within 1-2% of actual model tokenization.

Q: Does it support languages other than markdown? A: Currently markdown-only. Future versions may support additional file types.

Contributing

Contributions are welcome! This project follows standard open-source practices:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Make your changes with tests
  4. Run the test suite (pytest)
  5. Ensure code quality (make lint)
  6. Commit your changes (git commit -m 'Add amazing feature')
  7. Push to the branch (git push origin feature/amazing-feature)
  8. Open a Pull Request

See CONTRIBUTING.md (coming soon) for detailed guidelines.

License

MIT License - See LICENSE file for details.

Author

Stefan Rummer (@stefanrmmr)

Acknowledgments

  • Built with Claude Code as part of an AI-assisted development workflow
  • Uses OpenAI's tiktoken library for accurate token counting
  • Inspired by real-world challenges managing context windows in LLM-assisted development
  • Thanks to the pre-commit framework team for excellent hook infrastructure

Status: Ready for v1.0.0 release. Star ⭐ and watch 👀 for updates!

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •