Fix/doclang suppress content filtered shells#656
Open
nassarofficial wants to merge 5 commits into
Open
Conversation
added 3 commits
June 23, 2026 11:40
When suppress_empty_elements is enabled and content_types narrows the serialized body (task-filtered training), omit text/table/picture elements that would otherwise emit metadata-only shells (layer, thread, location, classification label without allowed body). Adds emit_picture_layer and layout-only picture box behavior with regression tests.
Mirrors granite-4-docling OCR-only training path so the serializer fix does not rely on callers downgrading label_mode to AUTO.
Contributor
|
✅ DCO Check Passed Thanks @nassarofficial, all your commits are properly signed off. 🎉 |
Contributor
Merge Protections🔴 2 of 2 protections blocking · waiting on 👀 reviews and 🙋 you
🔴 Enforce conventional commitWaiting for
This rule is failing.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
🔴 Require two reviewer for test updatesWaiting for
This rule is failing.When test data is updated, we require two reviewers
|
added 2 commits
June 24, 2026 12:59
Update content-filtered table/picture tests to reflect that suppress_empty_elements is opt-in and that add_location keeps layout boxes (without classification labels) for filtered elements. Signed-off-by: Ahmed Nassar AHN@zurich.ibm.com <AHN@zurich.ibm.com>
…h.ibm.com> I, Ahmed Nassar AHN@zurich.ibm.com <AHN@zurich.ibm.com>, hereby add my Signed-off-by to this commit: 323dd1d I, Ahmed Nassar AHN@zurich.ibm.com <AHN@zurich.ibm.com>, hereby add my Signed-off-by to this commit: d12853a I, Ahmed Nassar AHN@zurich.ibm.com <AHN@zurich.ibm.com>, hereby add my Signed-off-by to this commit: e2bfb21 Signed-off-by: Ahmed Nassar AHN@zurich.ibm.com <AHN@zurich.ibm.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem When we serialize with narrow content_types (e.g. picture-only prompts like []), the serializer was still emitting head-only shells — elements with metadata (, , , ) but no actual supervised body content. That leaked spurious tags into WDS task-filtered training targets.
Fix (opt-in only — default behavior unchanged) All new logic is gated behind suppress_empty_elements=True (default remains False):
• Text / table / picture: if an element’s body type is filtered out and there’s no visible content, drop the element entirely instead of emitting an empty tag or metadata-only shell
• Layout exception: if add_location=True and the item has provenance, keep the element so layout supervision still gets boxes
• Picture labels: under suppression, is only emitted when picture/chart/chemistry content is actually requested (layout-only prompts get boxes, no classification label)
• New param: emit_picture_layer (default True) — lets callers omit on pictures (we set this to False in granite for picture-classification prompts)