Command-line tool that converts regular PDF documents into PDF/A files using OCRmyPDF with built-in OCR.
- Wraps OCRmyPDF to generate PDF/A-2 compliant files with OCR enforced.
- Accepts input/output paths along with configurable OCR language and PDF/A level.
- Ships with tests,
black, andruffconfigurations for streamlined development.
- Python 3.11+
- OCRmyPDF runtime dependencies (Tesseract, Ghostscript, etc.) installed on your system. Refer to the OCRmyPDF installation guide.
Install the system dependencies with APT before setting up the virtual environment:
sudo apt update
sudo apt install python3-venv python3-pip tesseract-ocr tesseract-ocr-eng tesseract-ocr-deu ghostscript qpdfAdd extra tesseract-ocr-<lang> packages if you need OCR support for additional languages.
python3 -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"
pdfa-cli --helpTip: Activating the virtual environment adds
.venv/binto yourPATH, sopdfa-cliis available directly.
pdfa-cli input.pdf output.pdf --language deu+eng --pdfa-level 3This command converts input.pdf into a PDF/A file written to output.pdf, enforcing OCR with the specified Tesseract languages.
pytest.
├── pyproject.toml
├── README.md
├── src
│ └── pdfa
│ ├── __init__.py
│ └── cli.py
└── tests
├── __init__.py
└── test_cli.py