Implement two-tier Docling parsing for faster ingestion by ArnavAgrawal03 · Pull Request #388 · morphik-org/morphik-core

ArnavAgrawal03 · 2026-02-19T01:09:28Z

Summary

Implement a two-tier document parsing strategy to resolve the significant ingestion slowdown caused by always-on OCR. The parser now tries fast text-layer extraction first (no OCR, no table detection), and only falls back to the full OCR pipeline if no text is found, which is exactly when it's needed (scanned/image PDFs).

Impact

For text-based PDFs (the majority), parsing time drops from minutes back to seconds, matching the old unstructured library behavior. Scanned documents still get OCR'd automatically. This should reduce typical ingestion times from 2+ hours back to minutes for batch uploads.

Changes

Split Docling converter into two cached instances: fast (no OCR) and full (OCR+tables)
_parse_document_local() now tries fast first, returns immediately if text is found
Falls back to full OCR only when needed, with clear logging of the strategy used

🤖 Generated with Claude Code

…tion Replace always-on OCR with a fast-first approach: try text-layer extraction without OCR/table detection first (seconds), then fall back to full OCR only if no text is found (scanned/image PDFs). This restores the performance of the old unstructured parser while maintaining quality for documents that need it. Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Implement two-tier Docling parsing for faster ingestion#388

Implement two-tier Docling parsing for faster ingestion#388
ArnavAgrawal03 wants to merge 1 commit intomainfrom
arnav/investigate-ingestion-slowdown

ArnavAgrawal03 commented Feb 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

ArnavAgrawal03 commented Feb 19, 2026

Summary

Impact

Changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant