fix: (cont.) Remove Extra Space Before and After Group Items using Inline Boundaries#605
Conversation
Merge Protections🔴 1 of 2 protections blocking · waiting on 👀 reviews
🔴 Require two reviewer for test updatesWaiting for
This rule is failing.When test data is updated, we require two reviewers
Show 1 satisfied protection🟢 Enforce conventional commitMake sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
|
✅ DCO Check Passed Thanks @wanadzhar913, all your commits are properly signed off. 🎉 |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
|
Hi everyone, thanks so much for reviewing my code. Let me know if there's anything else I can change, as I recognize this approach is much more verbose (and has more code to maintain). @vagenas |
ceberam
left a comment
There was a problem hiding this comment.
@wanadzhar913 could you please check the conflict with test/test_serialization.py?
I have created a draft PR on docling project that pins the latest commit of this PR and regenerates the ground truth files using the changes of your PR: docling-project/docling#3527
(Note that it is just a draft PR and the CI/CD checks complain, since it is using docling-core as a Git dependency source)
Please, review the output, since your changes will definitely have an impact on docling repo. At a first glance, many issues discussed on this PR are now resolved, but I see others that may need attention (e.g., in some cases, necessary blank spaces are now removed).
HI @ceberam, I'll look over the integration over the weekend.
|
78392e8 to
0a9ba2f
Compare
|
Whoops, sorry! Requested a review too early. Will lyk once I've fixed the main docling package integration. |
0a9ba2f to
b678d1c
Compare
|
@wanadzhar913 how is the PR progressing? Were you able to check the implications on docling repo? Let me know if you want me to update the PR docling-project/docling#3527 to reflect any changes on this PR. |
|
hi @wanadzhar913 it would be great to close this task soon. Is this PR ready to be reviewed? |
3ea9560 to
6e8828a
Compare
|
Hi @ceberam, apologies for the delay in getting back to you. Some of the changes I found that needed to be applied were quite sizeable (apologies for the +1300 line diff!). Can I trouble you to rerun the tests on docling-project/docling#3527 and see if they're satisfactory? To make it easier, I've also added a Mermaid diagram (in the PR description) for all possible control flow. This will be quite cumbersome to maintain but happy for suggestions on how to cut this down. Tests I ran in docling-project/docling: DOCLING_GEN_TEST_DATA=1 uv run pytest \
-k "not test_gen_test_data_flag" \
tests/test_e2e_conversion.py \
tests/test_e2e_ocr_conversion.py \
tests/test_backend_webp.py \
tests/test_interfaces.py \
tests/test_backend_csv.py \
tests/test_backend_msword.py \
tests/test_backend_msexcel.py \
tests/test_backend_pptx.py \
tests/test_backend_html.py \
tests/test_backend_jats.py \
tests/test_backend_vtt.py \
tests/test_backend_xbrl.py \
tests/test_backend_patent_uspto.py \
tests/test_backend_markdown.py \
tests/test_latex/test_basic.py |
|
In my previous approach, provenance tended to enforce the wrong boundary:
This broke OCR-like cases where a word is split with a false source gap, and also cases where separate words have contiguous charspans. Hence, the reason for the very large diff is mainly to account for the examples below (for when provenance and Examples:
|
|
@wanadzhar913 no worries , we all knew this is a challenging issue.
|
Signed-off-by: wanadzhar913 <adzhar.faiq@gmail.com>
…ries/styled text handling by letting text heuristics override provenance only when it is safe Signed-off-by: wanadzhar913 <adzhar.faiq@gmail.com>
Signed-off-by: wanadzhar913 <adzhar.faiq@gmail.com>
6e8828a to
f5298f8
Compare
|
Hi @ceberam, I see and noted! I'll leave my OCR attempts at that, and have reran everything with |
Thanks @wanadzhar913 , I have rebased docling's docling-project/docling#3527 pinning your latest commit. Can you check if the serializations are as you expected? I had a quick look and it seems pretty good! |
|
Yeahp, they're in line. Though upon closer inspection, I think it still missed:
Besides that, I think it looks pretty great! Lmk how you want to proceed @ceberam. |
Details
This is a continuation of the work in Pull Request: #458 which removes extra space before and after group items to resolve the issue raised in #2745
Resolves #371
Resolves docling-project/docling#2745
Approach
Refactors inline spacing in
docling_core/transforms/serializer/common.pyinto a clearer decision flow centered on_classify_inline_boundary()instead the old approach in #458 where we just remove the space (" ") when joining all parts without separators.Control Flow
_join_inline_parts()is the entry point. It walks adjacent inline chunks, calls_classify_inline_boundary()for each boundary condition, and inserts a space only when that classifier returnsInlineBoundary.SPACE._classify_inline_boundary()handles boundaries in a fixed order:Control Flow for
_classify_inline_boundary()flowchart TD A["_classify_inline_boundary"] --> B["Read rendered boundary chars<br/>prev_tail = prev_text[-1]<br/>curr_head = text[0]"] B --> C{"Already whitespace<br/>on either side?"} C -- "Yes" --> J1["JOIN<br/>avoid duplicate spacing"] C -- "No" --> D{"Missing item metadata?<br/>prev_item is None or item is None"} D -- "Yes" --> K["_classify_character_boundary(prev_tail, curr_head)"] D -- "No" --> P["_classify_provenance_boundary(prev_item, item)"] P --> T{"Both normal TextItem?<br/>not semantic inline atoms"} T -- "Yes" --> TB["_classify_text_boundary(prev_item, item)"] TB --> O{"Text boundary should<br/>override provenance?"} O -- "Yes" --> R1["Return text boundary"] O -- "No" --> PV T -- "No" --> PV{"Provenance boundary known?"} PV -- "SPACE" --> R2["SPACE"] PV -- "JOIN but cannot safely override spacing" --> R3["SPACE"] PV -- "JOIN allowed or UNKNOWN" --> RAW["Choose raw chars from item.text<br/>for TextItem, CodeItem, FormulaItem"] RAW --> S1{"Regular TextItem before<br/>semantic inline atom?"} S1 -- "Alnum or : ; , & before atom" --> R4["SPACE"] S1 -- "Otherwise" --> RC1["_classify_rendered_character_boundary"] RAW --> S2{"Semantic inline atom before<br/>TextItem?"} S2 -- "Text begins alnum, (, &, [, or quote" --> R5["SPACE"] S2 -- "Otherwise" --> RC2["_classify_rendered_character_boundary<br/>using rendered text chars"] RAW --> S3["Other cases"] S3 --> RC3["_classify_rendered_character_boundary"] RC1 --> K RC2 --> K RC3 --> K K --> OUT{"Result"} OUT -- "SPACE" --> RS["Insert a space"] OUT -- "JOIN or UNKNOWN" --> RJ["Append directly"]NOTE: Inline serialization was previously making spacing decisions from text or provenance, but the real inputs have competing signals: rendered markdown, raw text, source orig, and provenance charspans can disagree.
_classify_inline_boundary()now lets text heuristics override provenance only in controlled cases.Helper Roles
Control Flow for
_classify_character_boundary()flowchart TD A["_classify_character_boundary(prev_tail, curr_head)"] --> B{"Missing char?"} B -- "Yes" --> U["UNKNOWN"] B -- "No" --> C{"Word punctuation boundary?"} C -- "comma/semicolon/colon + alnum" --> S["SPACE"] C -- "period + alnum or '['" --> S C -- "')' + '['" --> S C -- "alnum + '&'" --> S C -- "No" --> D{"Both alnum?"} D -- "Yes" --> S D -- "No" --> E{"Word join char involved?<br/>- or /"} E -- "Yes" --> J["JOIN"] E -- "No" --> F{"Current char is right-attaching?<br/>)]},;:.!?%"} F -- "Yes" --> J F -- "No" --> G{"Previous char is bracket opener?<br/>( [ {"} G -- "Yes" --> J G -- "No" --> H{"Quote next to quote?"} H -- "Yes" --> J H -- "No" --> UTests
# when new datasets are needed DOCLING_GEN_TEST_DATA=1 uv run pytest -q \ test/test_serialization.py \ test/test_plain_text_serialization.py \ test/test_docling_doc.py uv run pytest -q \ test/test_serialization.py \ test/test_plain_text_serialization.py \ test/test_docling_doc.py \ --cov=docling_core.transforms.serializer.common \ --cov-report term-missing