Add Kreuzberg document converter integration

## Summary and motivation

[Kreuzberg](https://github.com/kreuzberg-dev/kreuzberg) is a document intelligence framework that extracts text from PDFs, Office documents, images, and 75+ other file formats. All processing is performed locally with no external API calls.

A Haystack converter based on Kreuzberg would give users a single component that handles the widest range of document formats available today, without requiring cloud services or API keys. This addresses several recurring needs:

- **Unified multi-format extraction** — Instead of wiring together separate converters for PDF, DOCX, HTML, and images, a single `KreuzbergConverter` handles them all.
- **Fully offline** — Unlike cloud-based converters (Azure Document Intelligence, AWS Textract), Kreuzberg runs entirely locally. This is critical for air-gapped environments, sensitive data, and cost-conscious pipelines.
- **Rich extraction features** — Beyond raw text, Kreuzberg provides table extraction, image metadata, PDF annotations, quality scores, language detection, keyword extraction, and configurable OCR backends (Tesseract, EasyOCR).
- **Batch processing** — Kreuzberg's Rust-based rayon thread pool enables parallel extraction across multiple files, which is important for ingestion pipelines processing large document collections.

Kreuzberg is comparable in scope to Microsoft's Markitdown (#1248) but offers additional capabilities such as per-page splitting, built-in chunking, multiple output formats (plain text, Markdown, HTML), token reduction for LLM consumption, and OCR configuration.

## Detailed design

### Component: `KreuzbergConverter`

A Haystack `@component` that converts files into `Document` objects using Kreuzberg's extraction APIs.

**Input sockets:**
- `sources: list[str | Path | ByteStream]` — File paths, directory paths, or ByteStream objects.
- `meta: dict | list[dict] | None` — Optional metadata to attach to output Documents.

**Output sockets:**
- `documents: list[Document]` — Converted documents with text content and metadata.
- `raw_extraction: list[dict]` — Serialized `ExtractionResult` objects for debugging and advanced use cases.

**Constructor parameters:**

| Parameter | Type | Default | Description |
|---|---|---|---|
| `config` | `ExtractionConfig \| None` | `None` | Kreuzberg extraction configuration object. Controls output format, OCR, chunking, keyword extraction, and more. |
| `config_path` | `str \| Path \| None` | `None` | Path to a kreuzberg config file (TOML/YAML/JSON). When both `config` and `config_path` are given, `config` takes precedence. |
| `store_full_path` | `bool` | `False` | If `True`, store full file paths in metadata. If `False`, store only the file name. |
| `batch` | `bool` | `True` | Use kreuzberg's batch APIs for parallel extraction via Rust rayon thread pool. |
| `easyocr_kwargs` | `dict \| None` | `None` | Extra keyword arguments for EasyOCR (GPU, beam width, model storage, etc.). |

### Extraction modes

The converter supports three extraction modes depending on configuration:

1. **Default** — One `Document` per source file, containing the full extracted text.
2. **Per-page** — When `ExtractionConfig(pages=PageConfig(extract_pages=True))` is set, one `Document` per page with `page_number` in metadata.
3. **Chunking** — When `ChunkingConfig` is provided, kreuzberg performs server-side chunking and emits one `Document` per chunk with `chunk_index` and `total_chunks` in metadata.

### Metadata

Each Document includes rich metadata extracted by kreuzberg: `mime_type`, `file_extensions`, `output_format`, `quality_score`, `detected_languages`, `tables`, `images`, `annotations`, `extracted_keywords`, and format-specific fields (e.g., PDF title, author, page count).

### Serialization

The component is fully serializable via Haystack's `to_dict` / `from_dict` protocol. `ExtractionConfig` is serialized using kreuzberg's `config_to_json` utility; `config_path` is stored as a POSIX string for cross-platform compatibility.

### Supported formats

75+ file formats organized by category:

| Category | Formats |
|---|---|
| Documents | PDF, DOCX, DOC, ODT, RTF, EPUB |
| Spreadsheets | XLSX, XLS, ODS, CSV, TSV |
| Presentations | PPTX, PPT, ODP |
| Images | PNG, JPEG, TIFF, BMP, WebP (via OCR) |
| Web | HTML, XHTML, XML, Markdown |
| Email | EML, MSG |
| Code | Plain text, source code files |
| Archives | Extracts from contained documents |

### Usage example

```python
from haystack import Pipeline
from haystack.components.preprocessors import DocumentCleaner
from haystack.components.writers import DocumentWriter
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack_integrations.components.converters.kreuzberg import KreuzbergConverter

document_store = InMemoryDocumentStore()

pipeline = Pipeline()
pipeline.add_component("converter", KreuzbergConverter())
pipeline.add_component("cleaner", DocumentCleaner())
pipeline.add_component("writer", DocumentWriter(document_store=document_store))

pipeline.connect("converter.documents", "cleaner")
pipeline.connect("cleaner", "writer")

pipeline.run({"converter": {"sources": ["report.pdf", "notes.docx"]}})
```

---

**Implementation PR:** #2927

Parameter	Type	Default	Description
`config`	`ExtractionConfig \| None`	`None`	Kreuzberg extraction configuration object. Controls output format, OCR, chunking, keyword extraction, and more.
`config_path`	`str \| Path \| None`	`None`	Path to a kreuzberg config file (TOML/YAML/JSON). When both `config` and `config_path` are given, `config` takes precedence.
`store_full_path`	`bool`	`False`	If `True`, store full file paths in metadata. If `False`, store only the file name.
`batch`	`bool`	`True`	Use kreuzberg's batch APIs for parallel extraction via Rust rayon thread pool.
`easyocr_kwargs`	`dict \| None`	`None`	Extra keyword arguments for EasyOCR (GPU, beam width, model storage, etc.).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Kreuzberg document converter integration #2926

Summary and motivation

Detailed design

Component: `KreuzbergConverter`

Extraction modes

Metadata

Serialization

Supported formats

Usage example

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Category	Formats
Documents	PDF, DOCX, DOC, ODT, RTF, EPUB
Spreadsheets	XLSX, XLS, ODS, CSV, TSV
Presentations	PPTX, PPT, ODP
Images	PNG, JPEG, TIFF, BMP, WebP (via OCR)
Web	HTML, XHTML, XML, Markdown
Email	EML, MSG
Code	Plain text, source code files
Archives	Extracts from contained documents

Add Kreuzberg document converter integration #2926

Description

Summary and motivation

Detailed design

Component: KreuzbergConverter

Extraction modes

Metadata

Serialization

Supported formats

Usage example

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Component: `KreuzbergConverter`