-
Notifications
You must be signed in to change notification settings - Fork 215
Description
Summary and motivation
Kreuzberg is a document intelligence framework that extracts text from PDFs, Office documents, images, and 75+ other file formats. All processing is performed locally with no external API calls.
A Haystack converter based on Kreuzberg would give users a single component that handles the widest range of document formats available today, without requiring cloud services or API keys. This addresses several recurring needs:
- Unified multi-format extraction — Instead of wiring together separate converters for PDF, DOCX, HTML, and images, a single
KreuzbergConverterhandles them all. - Fully offline — Unlike cloud-based converters (Azure Document Intelligence, AWS Textract), Kreuzberg runs entirely locally. This is critical for air-gapped environments, sensitive data, and cost-conscious pipelines.
- Rich extraction features — Beyond raw text, Kreuzberg provides table extraction, image metadata, PDF annotations, quality scores, language detection, keyword extraction, and configurable OCR backends (Tesseract, EasyOCR).
- Batch processing — Kreuzberg's Rust-based rayon thread pool enables parallel extraction across multiple files, which is important for ingestion pipelines processing large document collections.
Kreuzberg is comparable in scope to Microsoft's Markitdown (#1248) but offers additional capabilities such as per-page splitting, built-in chunking, multiple output formats (plain text, Markdown, HTML), token reduction for LLM consumption, and OCR configuration.
Detailed design
Component: KreuzbergConverter
A Haystack @component that converts files into Document objects using Kreuzberg's extraction APIs.
Input sockets:
sources: list[str | Path | ByteStream]— File paths, directory paths, or ByteStream objects.meta: dict | list[dict] | None— Optional metadata to attach to output Documents.
Output sockets:
documents: list[Document]— Converted documents with text content and metadata.raw_extraction: list[dict]— SerializedExtractionResultobjects for debugging and advanced use cases.
Constructor parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
config |
ExtractionConfig | None |
None |
Kreuzberg extraction configuration object. Controls output format, OCR, chunking, keyword extraction, and more. |
config_path |
str | Path | None |
None |
Path to a kreuzberg config file (TOML/YAML/JSON). When both config and config_path are given, config takes precedence. |
store_full_path |
bool |
False |
If True, store full file paths in metadata. If False, store only the file name. |
batch |
bool |
True |
Use kreuzberg's batch APIs for parallel extraction via Rust rayon thread pool. |
easyocr_kwargs |
dict | None |
None |
Extra keyword arguments for EasyOCR (GPU, beam width, model storage, etc.). |
Extraction modes
The converter supports three extraction modes depending on configuration:
- Default — One
Documentper source file, containing the full extracted text. - Per-page — When
ExtractionConfig(pages=PageConfig(extract_pages=True))is set, oneDocumentper page withpage_numberin metadata. - Chunking — When
ChunkingConfigis provided, kreuzberg performs server-side chunking and emits oneDocumentper chunk withchunk_indexandtotal_chunksin metadata.
Metadata
Each Document includes rich metadata extracted by kreuzberg: mime_type, file_extensions, output_format, quality_score, detected_languages, tables, images, annotations, extracted_keywords, and format-specific fields (e.g., PDF title, author, page count).
Serialization
The component is fully serializable via Haystack's to_dict / from_dict protocol. ExtractionConfig is serialized using kreuzberg's config_to_json utility; config_path is stored as a POSIX string for cross-platform compatibility.
Supported formats
75+ file formats organized by category:
| Category | Formats |
|---|---|
| Documents | PDF, DOCX, DOC, ODT, RTF, EPUB |
| Spreadsheets | XLSX, XLS, ODS, CSV, TSV |
| Presentations | PPTX, PPT, ODP |
| Images | PNG, JPEG, TIFF, BMP, WebP (via OCR) |
| Web | HTML, XHTML, XML, Markdown |
| EML, MSG | |
| Code | Plain text, source code files |
| Archives | Extracts from contained documents |
Usage example
from haystack import Pipeline
from haystack.components.preprocessors import DocumentCleaner
from haystack.components.writers import DocumentWriter
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack_integrations.components.converters.kreuzberg import KreuzbergConverter
document_store = InMemoryDocumentStore()
pipeline = Pipeline()
pipeline.add_component("converter", KreuzbergConverter())
pipeline.add_component("cleaner", DocumentCleaner())
pipeline.add_component("writer", DocumentWriter(document_store=document_store))
pipeline.connect("converter.documents", "cleaner")
pipeline.connect("cleaner", "writer")
pipeline.run({"converter": {"sources": ["report.pdf", "notes.docx"]}})Implementation PR: #2927
Metadata
Metadata
Assignees
Labels
Type
Projects
Status