fix: restore visited set when TripletTableSerializer produces empty output#597
Open
OfekDanny wants to merge 2 commits into
Open
fix: restore visited set when TripletTableSerializer produces empty output#597OfekDanny wants to merge 2 commits into
OfekDanny wants to merge 2 commits into
Conversation
…utput Layout tables in DOCX are represented as TableItems with RichTableCell entries pointing to TextItem children. When TripletTableSerializer serializes such a table it calls _export_to_dataframe_with_options, which internally calls doc_serializer.serialize() for each RichTableCell reference, marking every child item in the shared visited set. If the table then serializes to empty text (e.g. all rows are column headers so there are no data rows), all child TextItems are permanently marked as visited and skipped by the chunker, resulting in zero chunks for the document even though export_to_text() returns substantial text. Fix: snapshot visited before the dataframe export; if the table produces no output, restore visited to its pre-call state so the child items remain available for chunking by subsequent iterations. Adds a regression test that builds a minimal all-header layout table and asserts that HierarchicalChunker yields chunks for the child TextItems. Fixes #3335
Contributor
|
✅ DCO Check Passed Thanks @OfekDanny, all your commits are properly signed off. 🎉 |
Contributor
Merge Protections🔴 1 of 2 protections blocking · waiting on 👀 reviews
🔴 Require two reviewer for test updatesWaiting for
This rule is failing.When test data is updated, we require two reviewers
Show 1 satisfied protection🟢 Enforce conventional commitMake sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
I, Ofek Danny <ofek@kaps.co.il>, hereby add my Signed-off-by to this commit: 1c27817 Signed-off-by: Ofek Danny <ofek@kaps.co.il>
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Closes #3335 (reported against
docling, root cause is indocling-core).DOCX files that use layout tables for formatting produce zero chunks from both
HierarchicalChunkerandHybridChunker, even thoughdoc.export_to_text()returns substantial content.Root cause
Layout tables are represented as
TableItemobjects whose cells areRichTableCellentries pointing toTextItemchildren of the table. During chunking:HierarchicalChunkercallsdoc_serializer.serialize(item=table, visited=visited).DocSerializer.serialize()marks theTableItemitself invisited, then callsTripletTableSerializer.serialize().TripletTableSerializercalls_export_to_dataframe_with_options(), which internally callsdoc_serializer.serialize()for everyRichTableCellreference, marking all child items in the sharedvisitedset.TextItemis permanently visited.The workaround in the issue (
labels.discard(DocItemLabel.TABLE)) bypasses the table entirely, avoiding visited-set poisoning, but is not a correct general fix.Fix
In
TripletTableSerializer.serialize(), snapshotvisitedbefore calling_export_to_dataframe_with_options. If the table produces no output text, restorevisitedto the snapshot so child items remain available for chunking.This is safe: tables that produce text continue to work exactly as before. Only tables that produce no output have their visited side-effects undone.
Test
Added
test_layout_table_produces_chunkstotest/test_hierarchical_chunker.py. It builds a minimal two-row, one-column table where both rows are column headers (soTripletTableSerializerproduces empty text), and asserts thatHierarchicalChunkerstill yields one chunk perTextItemchild.All existing chunker tests continue to pass.