Skip to content

kutsenko/pdfa-cli

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

pdfa

Command-line tool that converts regular PDF documents into PDF/A files using OCRmyPDF with built-in OCR.

Features

  • Wraps OCRmyPDF to generate PDF/A-2 compliant files with OCR enforced.
  • Accepts input/output paths along with configurable OCR language and PDF/A level.
  • Ships with tests, black, and ruff configurations for streamlined development.

Requirements

  • Python 3.11+
  • OCRmyPDF runtime dependencies (Tesseract, Ghostscript, etc.) installed on your system. Refer to the OCRmyPDF installation guide.

Ubuntu 24.04

Install the system dependencies with APT before setting up the virtual environment:

sudo apt update
sudo apt install python3-venv python3-pip tesseract-ocr tesseract-ocr-eng tesseract-ocr-deu ghostscript qpdf

Add extra tesseract-ocr-<lang> packages if you need OCR support for additional languages.

Getting Started

python3 -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"
pdfa-cli --help

Tip: Activating the virtual environment adds .venv/bin to your PATH, so pdfa-cli is available directly.

Usage

pdfa-cli input.pdf output.pdf --language deu+eng --pdfa-level 3

This command converts input.pdf into a PDF/A file written to output.pdf, enforcing OCR with the specified Tesseract languages.

Testing

pytest

Project Layout

.
├── pyproject.toml
├── README.md
├── src
│   └── pdfa
│       ├── __init__.py
│       └── cli.py
└── tests
    ├── __init__.py
    └── test_cli.py

About

PDF/A Conversions tools

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages