Skip to content

Comments

Implement two-tier Docling parsing for faster ingestion#388

Open
ArnavAgrawal03 wants to merge 1 commit intomainfrom
arnav/investigate-ingestion-slowdown
Open

Implement two-tier Docling parsing for faster ingestion#388
ArnavAgrawal03 wants to merge 1 commit intomainfrom
arnav/investigate-ingestion-slowdown

Conversation

@ArnavAgrawal03
Copy link
Collaborator

Summary

Implement a two-tier document parsing strategy to resolve the significant ingestion slowdown caused by always-on OCR. The parser now tries fast text-layer extraction first (no OCR, no table detection), and only falls back to the full OCR pipeline if no text is found, which is exactly when it's needed (scanned/image PDFs).

Impact

For text-based PDFs (the majority), parsing time drops from minutes back to seconds, matching the old unstructured library behavior. Scanned documents still get OCR'd automatically. This should reduce typical ingestion times from 2+ hours back to minutes for batch uploads.

Changes

  • Split Docling converter into two cached instances: fast (no OCR) and full (OCR+tables)
  • _parse_document_local() now tries fast first, returns immediately if text is found
  • Falls back to full OCR only when needed, with clear logging of the strategy used

🤖 Generated with Claude Code

…tion

Replace always-on OCR with a fast-first approach: try text-layer extraction
without OCR/table detection first (seconds), then fall back to full OCR only
if no text is found (scanned/image PDFs). This restores the performance of
the old unstructured parser while maintaining quality for documents that need it.

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant