feat: add JSON table serializer for embedding-friendly output#637
Open
slothfrog wants to merge 6 commits into
Open
feat: add JSON table serializer for embedding-friendly output#637slothfrog wants to merge 6 commits into
slothfrog wants to merge 6 commits into
Conversation
Contributor
|
✅ DCO Check Passed Thanks @slothfrog, all your commits are properly signed off. 🎉 |
I, Connie Wang <connieniew@gmail.com>, hereby add my Signed-off-by to this commit: 53be320 I, Connie Wang <connieniew@gmail.com>, hereby add my Signed-off-by to this commit: 1fe7495 I, Connie Wang <connieniew@gmail.com>, hereby add my Signed-off-by to this commit: 459103a I, Connie Wang <connieniew@gmail.com>, hereby add my Signed-off-by to this commit: 493719e I, Connie Wang <connieniew@gmail.com>, hereby add my Signed-off-by to this commit: a173d94 Signed-off-by: Connie Wang <connieniew@gmail.com>
Contributor
Merge Protections🔴 2 of 2 protections blocking · waiting on 👀 reviews and 🙋 you
🔴 Enforce conventional commitWaiting for
This rule is failing.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
🔴 Require two reviewer for test updatesWaiting for
This rule is failing.When test data is updated, we require two reviewers
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add JSON Table Serializer for Embedding-Friendly Output
Issue #3082
Summary
Implements
JsonTableSerializerto convert tables into structured JSON format, making them more suitable for embedding models and RAG systems. Tables are serialized as JSON while other document elements remain in markdown/text format.Key Features
HierarchicalChunkerandHybridChunkerUsage
Implementation Notes
Differences from kalle07's Workaround
While inspired by @kalle07's workaround code, this implementation differs in several ways:
export_to_dataframe()does not preserve merged-cell semantics/,x)table_indexandheader_typefields for better contextWhy Pattern-Based Metadata Detection?
The metadata detection (
_detect_table_metadata()) uses simple pattern matching (a rule-based approach) because:TableItem.export_to_dataframe()outputs cell text without preserving merged-cell semanticsTableCellspan/grid structuresThis approach balances practicality with the limitations of the DataFrame representation.
Example Output
Input: Document with text and tables
Output:
Testing