Skip to content

feat: add JSON table serializer for embedding-friendly output#637

Open
slothfrog wants to merge 6 commits into
docling-project:mainfrom
slothfrog:feature/json_table_serialization
Open

feat: add JSON table serializer for embedding-friendly output#637
slothfrog wants to merge 6 commits into
docling-project:mainfrom
slothfrog:feature/json_table_serialization

Conversation

@slothfrog

Copy link
Copy Markdown

Add JSON Table Serializer for Embedding-Friendly Output

Issue #3082

Summary

Implements JsonTableSerializer to convert tables into structured JSON format, making them more suitable for embedding models and RAG systems. Tables are serialized as JSON while other document elements remain in markdown/text format.

Key Features

  • Structured JSON Format: Automatically detects table structure (top-only vs. top+left headers)
  • Table Validation: Filters low-quality tables based on configurable thresholds
  • Metadata Detection: Smart detection of table titles and descriptions using simple pattern matching
  • Natural Language Conversion: Optional text format for embeddings
  • Chunker Integration: Compatible with HierarchicalChunker and HybridChunker

Usage

from docling_core.transforms.serializer.json_table import JsonTableSerializer
from docling_core.transforms.serializer.markdown import MarkdownDocSerializer

# Use with MarkdownDocSerializer
serializer = MarkdownDocSerializer(doc=doc)
serializer.table_serializer = JsonTableSerializer()

result = serializer.serialize()
# Output: Markdown text with JSON tables

Implementation Notes

Differences from kalle07's Workaround

While inspired by @kalle07's workaround code, this implementation differs in several ways:

  1. Architecture: Proper serializer class hierarchy following docling-core patterns, not a standalone script
  2. Metadata Detection: Uses a simple pattern-matching approach (first/last row with exactly one non-empty cell) rather than attempting full merged-cell reconstruction
    • This is intentional: export_to_dataframe() does not preserve merged-cell semantics
    • Documented as pattern-based detection, not guaranteed metadata extraction
  3. Header Detection: Enhanced logic with configurable single-character detection (empty, None, or single chars like /, x)
  4. JSON Structure: Added table_index and header_type fields for better context
  5. Additional Features: Natural language conversion, simple JSON format, comprehensive validation options

Why Pattern-Based Metadata Detection?

The metadata detection (_detect_table_metadata()) uses simple pattern matching (a rule-based approach) because:

  • TableItem.export_to_dataframe() outputs cell text without preserving merged-cell semantics
  • Merged cells exist in other representations (OTSL, HTML, Azure) but not in DataFrame export
  • For reliable title/description extraction, one would need to work directly with TableCell span/grid structures

This approach balances practicality with the limitations of the DataFrame representation.

Example Output

Input: Document with text and tables

Output:

# Q1 2024 Sales Report

This report summarizes the sales performance.

{
  "table_index": 0,
  "header_type": "top_only",
  "title": "Product Sales",
  "first_row_headers": ["Product", "Q1 Sales", "Growth"],
  "data_rows_count": 2,
  "data": [
    {"Product": "Apples", "Q1 Sales": "$150,000", "Growth": "+15%"},
    {"Product": "Oranges", "Q1 Sales": "$120,000", "Growth": "+8%"}
  ],
  "description": "Source: Finance Department"
}

Overall, Q1 exceeded expectations with 12% growth.

Testing

pytest test/test_json_table_serialization.py -v
python examples/json_table_serialization.py

@github-actions

github-actions Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

DCO Check Passed

Thanks @slothfrog, all your commits are properly signed off. 🎉

I, Connie Wang <connieniew@gmail.com>, hereby add my Signed-off-by to this commit: 53be320
I, Connie Wang <connieniew@gmail.com>, hereby add my Signed-off-by to this commit: 1fe7495
I, Connie Wang <connieniew@gmail.com>, hereby add my Signed-off-by to this commit: 459103a
I, Connie Wang <connieniew@gmail.com>, hereby add my Signed-off-by to this commit: 493719e
I, Connie Wang <connieniew@gmail.com>, hereby add my Signed-off-by to this commit: a173d94

Signed-off-by: Connie Wang <connieniew@gmail.com>
@mergify

mergify Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Merge Protections

🔴 2 of 2 protections blocking · waiting on 👀 reviews and 🙋 you

Protection Waiting on
🔴 Enforce conventional commit 🙋 you
🔴 Require two reviewer for test updates 👀 reviews

🔴 Enforce conventional commit

Waiting for

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:
This rule is failing.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

🔴 Require two reviewer for test updates

Waiting for

  • #approved-reviews-by >= 2
This rule is failing.

When test data is updated, we require two reviewers

  • #approved-reviews-by >= 2

@slothfrog slothfrog changed the title Feature/json table serialization feat: add JSON table serializer for embedding-friendly output Jun 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant