Introducing doc2md: High-Fidelity Document-to-Markdown Pipeline #1
neoncapy
started this conversation in
Show and tell
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
The Problem
If you use Claude Code (or any LLM) to work with PDFs, you have probably hit this wall: Anthropic's copyright filter blocks most PDF content. Your agent tries to read a PDF, gets a partial or empty result, and your workflow breaks.
Even when PDFs do load, the raw text extraction is lossy — tables collapse, images are ignored, headings lose their hierarchy, and scientific notation gets mangled.
What doc2md Does
doc2md is a 15,000-line Python pipeline that converts PDF, DOCX, and PPTX files into high-fidelity Markdown with full image extraction and multi-stage quality control.
Architecture: 2-Tier Design
Key Features
Battle-Tested
This pipeline has processed hundreds of documents across academic papers, pharmaceutical submissions, regulatory documents, and corporate presentations. Every bug surfaced in production became a fix in the codebase.
Who Is This For?
Getting Started
See the README for installation and usage.
Feedback Welcome
This project grew out of real production needs. I am curious what use cases others have, what document types cause problems, and what features would be most valuable. Open an issue or reply here.
Beta Was this translation helpful? Give feedback.
All reactions