pdfsmith

PDF to Markdown conversion with multiple backend support

A unified interface to 19+ PDF parsing libraries including frontier LLMs. Pick the right tool for the job, or let pdfsmith choose for you.

Why pdfsmith?

One API, many backends - Switch between parsers without changing your code
Auto-selection - Automatically uses the best available parser
Lightweight core - Install only the backends you need
Battle-tested - Wrappers refined through extensive benchmarking

Installation

# Core package (no backends)
pip install pdfsmith

# With lightweight backends
pip install pdfsmith[light]

# Recommended stack (good balance of quality and speed)
pip install pdfsmith[recommended]

# All open-source backends
pip install pdfsmith[all]

# Frontier LLMs (GPT, Claude, Gemini)
pip install pdfsmith[frontier]

# Commercial cloud APIs
pip install pdfsmith[commercial]

# Specific backend
pip install pdfsmith[docling]

Quick Start

from pdfsmith import parse

# Auto-select best available backend
markdown = parse("document.pdf")

# Use a specific backend
markdown = parse("document.pdf", backend="docling")

# Check available backends
from pdfsmith import available_backends
for backend in available_backends():
    print(f"{backend.name}: {backend.description}")

CLI Usage

# Parse PDF to stdout
pdfsmith parse document.pdf

# Parse to file
pdfsmith parse document.pdf -o output.md

# Use specific backend
pdfsmith parse document.pdf -b docling

# List available backends
pdfsmith backends

Available Backends

Open Source

Backend	Weight	Best For
`docling`	heavy	Highest quality, complex documents
`marker`	heavy	Academic papers, LaTeX content
`pymupdf4llm`	medium	Good balance of speed and quality
`kreuzberg`	medium	Fast extraction with OCR
`unstructured`	medium	Versatile document processing
`pdfplumber`	light	Tables and structured data
`pymupdf`	light	Fast general-purpose extraction
`pypdf`	light	Lightweight, pure Python
`pdfminer`	light	Mature, handles encodings well
`pypdfium2`	light	Chrome's PDF engine
`extractous`	medium	Rust-based extraction

Commercial Cloud APIs

Backend	Provider	Cost	Best For
`aws_textract`	AWS	$1.50/1k pages	High-accuracy OCR
`azure_document_intelligence`	Azure	$1.50/1k pages	Enterprise documents
`google_document_ai`	Google Cloud	$1.50/1k pages	Multi-language support
`databricks`	Databricks	~$3/1k pages	SQL-based workflows
`llamaparse`	LlamaIndex	$0.003/page	Cost-effective API

Frontier LLMs

Backend	Model	Cost	Best For
`anthropic`	Claude Sonnet 4.5	~$0.04/page	High accuracy
`openai`	GPT-4o	~$0.02/page	General purpose
`gemini`	Gemini 2.0 Flash	~$0.001/page	Budget LLM option

Note: Frontier LLM backends require API keys set via environment variables (ANTHROPIC_API_KEY, OPENAI_API_KEY, GOOGLE_API_KEY).

Choosing a Backend

Best quality: anthropic or openai - Frontier LLM accuracy (highest cost)
Best value: llamaparse - Near-LLM quality at 10x lower cost
Structure preservation: docling - Deep learning, GPU recommended
Academic papers: marker - Optimized for LaTeX/equations
Tables: pdfplumber - Excellent table detection
Speed: pymupdf or kreuzberg - Fast extraction
Minimal dependencies: pypdf - Pure Python, no binaries
Budget LLM: gemini with gemini-2.0-flash - Very low cost LLM option

System Dependencies

Some backends require system packages for OCR functionality:

Tesseract OCR (for kreuzberg and unstructured with OCR):

# Ubuntu/Debian
sudo apt-get install tesseract-ocr

# macOS
brew install tesseract

# Windows
# Download from https://github.com/UB-Mannheim/tesseract/wiki

Without tesseract, these backends will still work for text-based PDFs but cannot extract text from scanned/image PDFs.

Async Support

from pdfsmith import parse_async

# Async parsing (uses backend's native async if available)
markdown = await parse_async("document.pdf")

Benchmarks

pdfsmith's backend wrappers were developed and refined through the pdf-bench benchmarking project, which evaluates parser performance across diverse document types.

License

MIT

Contributing

Contributions welcome! Please read our contributing guidelines before submitting PRs.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.github/workflows		.github/workflows
src/pdfsmith		src/pdfsmith
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
README.md		README.md
py.typed		py.typed
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pdfsmith

Why pdfsmith?

Installation

Quick Start

CLI Usage

Available Backends

Open Source

Commercial Cloud APIs

Frontier LLMs

Choosing a Backend

System Dependencies

Async Support

Benchmarks

License

Contributing

About

Uh oh!

Releases 1

Packages

Contributors 2

Uh oh!

Languages

applied-artificial-intelligence/pdfsmith

Folders and files

Latest commit

History

Repository files navigation

pdfsmith

Why pdfsmith?

Installation

Quick Start

CLI Usage

Available Backends

Open Source

Commercial Cloud APIs

Frontier LLMs

Choosing a Backend

System Dependencies

Async Support

Benchmarks

License

Contributing

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 2

Uh oh!

Languages

Packages