Skip to content

Fix/doclang suppress content filtered shells#656

Open
nassarofficial wants to merge 5 commits into
mainfrom
fix/doclang-suppress-content-filtered-shells
Open

Fix/doclang suppress content filtered shells#656
nassarofficial wants to merge 5 commits into
mainfrom
fix/doclang-suppress-content-filtered-shells

Conversation

@nassarofficial

Copy link
Copy Markdown
Contributor

Problem When we serialize with narrow content_types (e.g. picture-only prompts like []), the serializer was still emitting head-only shells — elements with metadata (, , , ) but no actual supervised body content. That leaked spurious tags into WDS task-filtered training targets.

Fix (opt-in only — default behavior unchanged) All new logic is gated behind suppress_empty_elements=True (default remains False):

• Text / table / picture: if an element’s body type is filtered out and there’s no visible content, drop the element entirely instead of emitting an empty tag or metadata-only shell
• Layout exception: if add_location=True and the item has provenance, keep the element so layout supervision still gets boxes
• Picture labels: under suppression, is only emitted when picture/chart/chemistry content is actually requested (layout-only prompts get boxes, no classification label)
• New param: emit_picture_layer (default True) — lets callers omit on pictures (we set this to False in granite for picture-classification prompts)

Ahmed Nassar AHN@zurich.ibm.com added 3 commits June 23, 2026 11:40
When suppress_empty_elements is enabled and content_types narrows the
serialized body (task-filtered training), omit text/table/picture elements
that would otherwise emit metadata-only shells (layer, thread, location,
classification label without allowed body). Adds emit_picture_layer and
layout-only picture box behavior with regression tests.
Mirrors granite-4-docling OCR-only training path so the serializer fix
does not rely on callers downgrading label_mode to AUTO.
@nassarofficial nassarofficial requested a review from vagenas June 24, 2026 12:54
@github-actions

github-actions Bot commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

DCO Check Passed

Thanks @nassarofficial, all your commits are properly signed off. 🎉

@mergify

mergify Bot commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Merge Protections

🔴 2 of 2 protections blocking · waiting on 👀 reviews and 🙋 you

Protection Waiting on
🔴 Enforce conventional commit 🙋 you
🔴 Require two reviewer for test updates 👀 reviews

🔴 Enforce conventional commit

Waiting for

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:
This rule is failing.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

🔴 Require two reviewer for test updates

Waiting for

  • #approved-reviews-by >= 2
This rule is failing.

When test data is updated, we require two reviewers

  • #approved-reviews-by >= 2

Ahmed Nassar AHN@zurich.ibm.com added 2 commits June 24, 2026 12:59
Update content-filtered table/picture tests to reflect that
suppress_empty_elements is opt-in and that add_location keeps layout
boxes (without classification labels) for filtered elements.

Signed-off-by: Ahmed Nassar AHN@zurich.ibm.com <AHN@zurich.ibm.com>
…h.ibm.com>

I, Ahmed Nassar AHN@zurich.ibm.com <AHN@zurich.ibm.com>, hereby add my Signed-off-by to this commit: 323dd1d
I, Ahmed Nassar AHN@zurich.ibm.com <AHN@zurich.ibm.com>, hereby add my Signed-off-by to this commit: d12853a
I, Ahmed Nassar AHN@zurich.ibm.com <AHN@zurich.ibm.com>, hereby add my Signed-off-by to this commit: e2bfb21

Signed-off-by: Ahmed Nassar AHN@zurich.ibm.com <AHN@zurich.ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant