Skip to content

fix: restore visited set when TripletTableSerializer produces empty output#597

Open
OfekDanny wants to merge 2 commits into
docling-project:mainfrom
OfekDanny:fix/layout-table-zero-chunks
Open

fix: restore visited set when TripletTableSerializer produces empty output#597
OfekDanny wants to merge 2 commits into
docling-project:mainfrom
OfekDanny:fix/layout-table-zero-chunks

Conversation

@OfekDanny

Copy link
Copy Markdown

Problem

Closes #3335 (reported against docling, root cause is in docling-core).

DOCX files that use layout tables for formatting produce zero chunks from both HierarchicalChunker and HybridChunker, even though doc.export_to_text() returns substantial content.

Root cause

Layout tables are represented as TableItem objects whose cells are RichTableCell entries pointing to TextItem children of the table. During chunking:

  1. HierarchicalChunker calls doc_serializer.serialize(item=table, visited=visited).
  2. DocSerializer.serialize() marks the TableItem itself in visited, then calls TripletTableSerializer.serialize().
  3. TripletTableSerializer calls _export_to_dataframe_with_options(), which internally calls doc_serializer.serialize() for every RichTableCell reference, marking all child items in the shared visited set.
  4. If the layout table serializes to empty text (e.g. all rows are column headers so there are no data rows for the triplet format), every child TextItem is permanently visited.
  5. The chunker's main loop skips all visited items → 0 chunks.

The workaround in the issue (labels.discard(DocItemLabel.TABLE)) bypasses the table entirely, avoiding visited-set poisoning, but is not a correct general fix.

Fix

In TripletTableSerializer.serialize(), snapshot visited before calling _export_to_dataframe_with_options. If the table produces no output text, restore visited to the snapshot so child items remain available for chunking.

visited_before: set[str] = set(visited) if visited is not None else set()

# ... export & render the table ...

if not text_res and visited is not None:
    visited.clear()
    visited.update(visited_before)

This is safe: tables that produce text continue to work exactly as before. Only tables that produce no output have their visited side-effects undone.

Test

Added test_layout_table_produces_chunks to test/test_hierarchical_chunker.py. It builds a minimal two-row, one-column table where both rows are column headers (so TripletTableSerializer produces empty text), and asserts that HierarchicalChunker still yields one chunk per TextItem child.

All existing chunker tests continue to pass.

…utput

Layout tables in DOCX are represented as TableItems with RichTableCell
entries pointing to TextItem children. When TripletTableSerializer
serializes such a table it calls _export_to_dataframe_with_options,
which internally calls doc_serializer.serialize() for each RichTableCell
reference, marking every child item in the shared visited set.

If the table then serializes to empty text (e.g. all rows are column
headers so there are no data rows), all child TextItems are permanently
marked as visited and skipped by the chunker, resulting in zero chunks
for the document even though export_to_text() returns substantial text.

Fix: snapshot visited before the dataframe export; if the table produces
no output, restore visited to its pre-call state so the child items
remain available for chunking by subsequent iterations.

Adds a regression test that builds a minimal all-header layout table and
asserts that HierarchicalChunker yields chunks for the child TextItems.

Fixes #3335
@github-actions

github-actions Bot commented Apr 27, 2026

Copy link
Copy Markdown
Contributor

DCO Check Passed

Thanks @OfekDanny, all your commits are properly signed off. 🎉

@mergify

mergify Bot commented Apr 27, 2026

Copy link
Copy Markdown
Contributor

Merge Protections

🔴 1 of 2 protections blocking · waiting on 👀 reviews

Protection Waiting on
🔴 Require two reviewer for test updates 👀 reviews
🟢 Enforce conventional commit

🔴 Require two reviewer for test updates

Waiting for

  • #approved-reviews-by >= 2
This rule is failing.

When test data is updated, we require two reviewers

  • #approved-reviews-by >= 2

Show 1 satisfied protection

🟢 Enforce conventional commit

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

I, Ofek Danny <ofek@kaps.co.il>, hereby add my Signed-off-by to this commit: 1c27817

Signed-off-by: Ofek Danny <ofek@kaps.co.il>
@codecov

codecov Bot commented Apr 29, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants