Introducing doc2md: High-Fidelity Document-to-Markdown Pipeline #1

neoncapy · 2026-02-22T20:49:40Z

neoncapy
Feb 22, 2026
Maintainer

The Problem

If you use Claude Code (or any LLM) to work with PDFs, you have probably hit this wall: Anthropic's copyright filter blocks most PDF content. Your agent tries to read a PDF, gets a partial or empty result, and your workflow breaks.

Even when PDFs do load, the raw text extraction is lossy — tables collapse, images are ignored, headings lose their hierarchy, and scientific notation gets mangled.

What doc2md Does

doc2md is a 15,000-line Python pipeline that converts PDF, DOCX, and PPTX files into high-fidelity Markdown with full image extraction and multi-stage quality control.

Architecture: 2-Tier Design

Python tier (zero LLM tokens): Extracts text, tables, and images using multiple extractors. Cross-validates output. Catches problems before any AI touches it.
Claude vision tier: 8 expert personas analyze extracted images with document-aware context. A statistician sees a forest plot differently than a visualization critic.

Key Features

Multi-extractor support: pymupdf4llm (default), pdfplumber (cross-validation), MinerU (complex layouts with tables/figures). Automatic fallback when one extractor struggles.
Per-image classification: 8 heuristics classify every image as substantive or decorative — file size, pixel variance, aspect ratio, color count, vector content detection, journal branding patterns, near-black detection, and more.
Multi-stage QC pipeline: Structural QC catches table collapse, missing headings, dropped content. Content fidelity checks compare against source. The pipeline loops until genuinely zero issues remain.
Claude Code integration: A PreToolUse hook intercepts PDF/DOCX/PPTX reads and redirects to converted Markdown. A conversion registry tracks every file by SHA-256 hash. A skill file provides the full pipeline as a slash command.
Office format support: DOCX and PPTX get the same treatment — image deduplication, blank detection (WMF conversion failures), TIFF-to-PNG conversion, chart/SmartArt rendering via LibreOffice fallback.

Battle-Tested

This pipeline has processed hundreds of documents across academic papers, pharmaceutical submissions, regulatory documents, and corporate presentations. Every bug surfaced in production became a fix in the codebase.

Who Is This For?

Claude Code users who need to work with PDFs, DOCX, or PPTX files
Researchers converting academic papers to Markdown for LLM processing
Anyone building document processing pipelines who wants extraction + QC out of the box

Getting Started

See the README for installation and usage.

Feedback Welcome

This project grew out of real production needs. I am curious what use cases others have, what document types cause problems, and what features would be most valuable. Open an issue or reply here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introducing doc2md: High-Fidelity Document-to-Markdown Pipeline #1

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Introducing doc2md: High-Fidelity Document-to-Markdown Pipeline #1

Uh oh!

neoncapy Feb 22, 2026 Maintainer

The Problem

What doc2md Does

Architecture: 2-Tier Design

Key Features

Battle-Tested

Who Is This For?

Getting Started

Feedback Welcome

Replies: 0 comments

neoncapy
Feb 22, 2026
Maintainer