Skip to content

Add Kreuzberg document converter integration #2926

@v-tan

Description

@v-tan

Summary and motivation

Kreuzberg is a document intelligence framework that extracts text from PDFs, Office documents, images, and 75+ other file formats. All processing is performed locally with no external API calls.

A Haystack converter based on Kreuzberg would give users a single component that handles the widest range of document formats available today, without requiring cloud services or API keys. This addresses several recurring needs:

  • Unified multi-format extraction — Instead of wiring together separate converters for PDF, DOCX, HTML, and images, a single KreuzbergConverter handles them all.
  • Fully offline — Unlike cloud-based converters (Azure Document Intelligence, AWS Textract), Kreuzberg runs entirely locally. This is critical for air-gapped environments, sensitive data, and cost-conscious pipelines.
  • Rich extraction features — Beyond raw text, Kreuzberg provides table extraction, image metadata, PDF annotations, quality scores, language detection, keyword extraction, and configurable OCR backends (Tesseract, EasyOCR).
  • Batch processing — Kreuzberg's Rust-based rayon thread pool enables parallel extraction across multiple files, which is important for ingestion pipelines processing large document collections.

Kreuzberg is comparable in scope to Microsoft's Markitdown (#1248) but offers additional capabilities such as per-page splitting, built-in chunking, multiple output formats (plain text, Markdown, HTML), token reduction for LLM consumption, and OCR configuration.

Detailed design

Component: KreuzbergConverter

A Haystack @component that converts files into Document objects using Kreuzberg's extraction APIs.

Input sockets:

  • sources: list[str | Path | ByteStream] — File paths, directory paths, or ByteStream objects.
  • meta: dict | list[dict] | None — Optional metadata to attach to output Documents.

Output sockets:

  • documents: list[Document] — Converted documents with text content and metadata.
  • raw_extraction: list[dict] — Serialized ExtractionResult objects for debugging and advanced use cases.

Constructor parameters:

Parameter Type Default Description
config ExtractionConfig | None None Kreuzberg extraction configuration object. Controls output format, OCR, chunking, keyword extraction, and more.
config_path str | Path | None None Path to a kreuzberg config file (TOML/YAML/JSON). When both config and config_path are given, config takes precedence.
store_full_path bool False If True, store full file paths in metadata. If False, store only the file name.
batch bool True Use kreuzberg's batch APIs for parallel extraction via Rust rayon thread pool.
easyocr_kwargs dict | None None Extra keyword arguments for EasyOCR (GPU, beam width, model storage, etc.).

Extraction modes

The converter supports three extraction modes depending on configuration:

  1. Default — One Document per source file, containing the full extracted text.
  2. Per-page — When ExtractionConfig(pages=PageConfig(extract_pages=True)) is set, one Document per page with page_number in metadata.
  3. Chunking — When ChunkingConfig is provided, kreuzberg performs server-side chunking and emits one Document per chunk with chunk_index and total_chunks in metadata.

Metadata

Each Document includes rich metadata extracted by kreuzberg: mime_type, file_extensions, output_format, quality_score, detected_languages, tables, images, annotations, extracted_keywords, and format-specific fields (e.g., PDF title, author, page count).

Serialization

The component is fully serializable via Haystack's to_dict / from_dict protocol. ExtractionConfig is serialized using kreuzberg's config_to_json utility; config_path is stored as a POSIX string for cross-platform compatibility.

Supported formats

75+ file formats organized by category:

Category Formats
Documents PDF, DOCX, DOC, ODT, RTF, EPUB
Spreadsheets XLSX, XLS, ODS, CSV, TSV
Presentations PPTX, PPT, ODP
Images PNG, JPEG, TIFF, BMP, WebP (via OCR)
Web HTML, XHTML, XML, Markdown
Email EML, MSG
Code Plain text, source code files
Archives Extracts from contained documents

Usage example

from haystack import Pipeline
from haystack.components.preprocessors import DocumentCleaner
from haystack.components.writers import DocumentWriter
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack_integrations.components.converters.kreuzberg import KreuzbergConverter

document_store = InMemoryDocumentStore()

pipeline = Pipeline()
pipeline.add_component("converter", KreuzbergConverter())
pipeline.add_component("cleaner", DocumentCleaner())
pipeline.add_component("writer", DocumentWriter(document_store=document_store))

pipeline.connect("converter.documents", "cleaner")
pipeline.connect("cleaner", "writer")

pipeline.run({"converter": {"sources": ["report.pdf", "notes.docx"]}})

Implementation PR: #2927

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

Status

In Progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions