Skip to content

fix: (cont.) Remove Extra Space Before and After Group Items using Inline Boundaries#605

Open
wanadzhar913 wants to merge 3 commits into
docling-project:mainfrom
wanadzhar913:bugfix/v2-2745_ExtraSpaceBeforeAndAfterGroupItems
Open

fix: (cont.) Remove Extra Space Before and After Group Items using Inline Boundaries#605
wanadzhar913 wants to merge 3 commits into
docling-project:mainfrom
wanadzhar913:bugfix/v2-2745_ExtraSpaceBeforeAndAfterGroupItems

Conversation

@wanadzhar913

@wanadzhar913 wanadzhar913 commented May 7, 2026

Copy link
Copy Markdown
Contributor

Details

This is a continuation of the work in Pull Request: #458 which removes extra space before and after group items to resolve the issue raised in #2745

Resolves #371
Resolves docling-project/docling#2745

Approach

Refactors inline spacing in docling_core/transforms/serializer/common.py into a clearer decision flow centered on _classify_inline_boundary() instead the old approach in #458 where we just remove the space (" ") when joining all parts without separators.

Control Flow

_join_inline_parts() is the entry point. It walks adjacent inline chunks, calls _classify_inline_boundary() for each boundary condition, and inserts a space only when that classifier returns InlineBoundary.SPACE.

_classify_inline_boundary() handles boundaries in a fixed order:

Control Flow for _classify_inline_boundary()
flowchart TD
    A["_classify_inline_boundary"] --> B["Read rendered boundary chars<br/>prev_tail = prev_text[-1]<br/>curr_head = text[0]"]

    B --> C{"Already whitespace<br/>on either side?"}
    C -- "Yes" --> J1["JOIN<br/>avoid duplicate spacing"]

    C -- "No" --> D{"Missing item metadata?<br/>prev_item is None or item is None"}
    D -- "Yes" --> K["_classify_character_boundary(prev_tail, curr_head)"]

    D -- "No" --> P["_classify_provenance_boundary(prev_item, item)"]
    P --> T{"Both normal TextItem?<br/>not semantic inline atoms"}

    T -- "Yes" --> TB["_classify_text_boundary(prev_item, item)"]
    TB --> O{"Text boundary should<br/>override provenance?"}
    O -- "Yes" --> R1["Return text boundary"]
    O -- "No" --> PV

    T -- "No" --> PV{"Provenance boundary known?"}

    PV -- "SPACE" --> R2["SPACE"]
    PV -- "JOIN but cannot safely override spacing" --> R3["SPACE"]
    PV -- "JOIN allowed or UNKNOWN" --> RAW["Choose raw chars from item.text<br/>for TextItem, CodeItem, FormulaItem"]

    RAW --> S1{"Regular TextItem before<br/>semantic inline atom?"}
    S1 -- "Alnum or : ; , & before atom" --> R4["SPACE"]
    S1 -- "Otherwise" --> RC1["_classify_rendered_character_boundary"]

    RAW --> S2{"Semantic inline atom before<br/>TextItem?"}
    S2 -- "Text begins alnum, (, &, [, or quote" --> R5["SPACE"]
    S2 -- "Otherwise" --> RC2["_classify_rendered_character_boundary<br/>using rendered text chars"]

    RAW --> S3["Other cases"]
    S3 --> RC3["_classify_rendered_character_boundary"]

    RC1 --> K
    RC2 --> K
    RC3 --> K

    K --> OUT{"Result"}
    OUT -- "SPACE" --> RS["Insert a space"]
    OUT -- "JOIN or UNKNOWN" --> RJ["Append directly"]
Loading

NOTE: Inline serialization was previously making spacing decisions from text or provenance, but the real inputs have competing signals: rendered markdown, raw text, source orig, and provenance charspans can disagree. _classify_inline_boundary() now lets text heuristics override provenance only in controlled cases.

Helper Roles

Control Flow for _classify_character_boundary()
flowchart TD
    A["_classify_character_boundary(prev_tail, curr_head)"] --> B{"Missing char?"}
    B -- "Yes" --> U["UNKNOWN"]

    B -- "No" --> C{"Word punctuation boundary?"}
    C -- "comma/semicolon/colon + alnum" --> S["SPACE"]
    C -- "period + alnum or '['" --> S
    C -- "')' + '['" --> S
    C -- "alnum + '&'" --> S

    C -- "No" --> D{"Both alnum?"}
    D -- "Yes" --> S

    D -- "No" --> E{"Word join char involved?<br/>- or /"}
    E -- "Yes" --> J["JOIN"]

    E -- "No" --> F{"Current char is right-attaching?<br/>)]},;:.!?%"}
    F -- "Yes" --> J

    F -- "No" --> G{"Previous char is bracket opener?<br/>( [ {"}
    G -- "Yes" --> J

    G -- "No" --> H{"Quote next to quote?"}
    H -- "Yes" --> J
    H -- "No" --> U

Loading

Tests

# when new datasets are needed
DOCLING_GEN_TEST_DATA=1 uv run pytest -q \
  test/test_serialization.py \
  test/test_plain_text_serialization.py \
  test/test_docling_doc.py

uv run pytest -q \
  test/test_serialization.py \
  test/test_plain_text_serialization.py \
  test/test_docling_doc.py \
  --cov=docling_core.transforms.serializer.common \
  --cov-report term-missing

@mergify

mergify Bot commented May 7, 2026

Copy link
Copy Markdown
Contributor

Merge Protections

🔴 1 of 2 protections blocking · waiting on 👀 reviews

Protection Waiting on
🔴 Require two reviewer for test updates 👀 reviews
🟢 Enforce conventional commit

🔴 Require two reviewer for test updates

Waiting for

  • #approved-reviews-by >= 2
This rule is failing.

When test data is updated, we require two reviewers

  • #approved-reviews-by >= 2

Show 1 satisfied protection

🟢 Enforce conventional commit

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

@wanadzhar913

wanadzhar913 commented May 7, 2026

Copy link
Copy Markdown
Contributor Author

Hi @ceberam, do review when you can. Thanks so much! Happy to just reuse the old approach in #458 where we just remove the space (" ") when joining all parts without separators. Let me know what you prefer.

@github-actions

github-actions Bot commented May 7, 2026

Copy link
Copy Markdown
Contributor

DCO Check Passed

Thanks @wanadzhar913, all your commits are properly signed off. 🎉

@PeterStaar-IBM PeterStaar-IBM requested a review from vagenas May 18, 2026 05:32

@PeterStaar-IBM PeterStaar-IBM left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

@codecov

codecov Bot commented May 18, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 97.93388% with 5 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
docling_core/transforms/serializer/common.py 97.91% 5 Missing ⚠️

📢 Thoughts on this report? Let us know!

@wanadzhar913

Copy link
Copy Markdown
Contributor Author

Hi everyone, thanks so much for reviewing my code. Let me know if there's anything else I can change, as I recognize this approach is much more verbose (and has more code to maintain). @vagenas

PeterStaar-IBM
PeterStaar-IBM previously approved these changes May 28, 2026

@ceberam ceberam left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wanadzhar913 could you please check the conflict with test/test_serialization.py?

I have created a draft PR on docling project that pins the latest commit of this PR and regenerates the ground truth files using the changes of your PR: docling-project/docling#3527
(Note that it is just a draft PR and the CI/CD checks complain, since it is using docling-core as a Git dependency source)

Please, review the output, since your changes will definitely have an impact on docling repo. At a first glance, many issues discussed on this PR are now resolved, but I see others that may need attention (e.g., in some cases, necessary blank spaces are now removed).

@wanadzhar913

wanadzhar913 commented Jun 4, 2026

Copy link
Copy Markdown
Contributor Author

@wanadzhar913 could you please check the conflict with test/test_serialization.py?

HI @ceberam, I'll look over the integration over the weekend.

However, what do you mean by the above? Can't seem to find any conflicts in test/test_serialization.py? Are you referring to the CodeCov report coverage percentage, or are there merge conflicts due to my branch being stale? Thanks! Nvm, I see it now! Will resolve soon!

@wanadzhar913 wanadzhar913 force-pushed the bugfix/v2-2745_ExtraSpaceBeforeAndAfterGroupItems branch from 78392e8 to 0a9ba2f Compare June 4, 2026 13:50
@wanadzhar913 wanadzhar913 requested a review from ceberam June 4, 2026 13:51
@wanadzhar913

Copy link
Copy Markdown
Contributor Author

Whoops, sorry! Requested a review too early. Will lyk once I've fixed the main docling package integration.

@wanadzhar913 wanadzhar913 force-pushed the bugfix/v2-2745_ExtraSpaceBeforeAndAfterGroupItems branch from 0a9ba2f to b678d1c Compare June 8, 2026 17:51
@ceberam

ceberam commented Jun 10, 2026

Copy link
Copy Markdown
Member

@wanadzhar913 how is the PR progressing? Were you able to check the implications on docling repo? Let me know if you want me to update the PR docling-project/docling#3527 to reflect any changes on this PR.

@ceberam

ceberam commented Jun 22, 2026

Copy link
Copy Markdown
Member

hi @wanadzhar913 it would be great to close this task soon. Is this PR ready to be reviewed?
Did you check the potential impact on docling's groundtruth files?
Let me know if you need any help and I can take it from there.

@wanadzhar913 wanadzhar913 force-pushed the bugfix/v2-2745_ExtraSpaceBeforeAndAfterGroupItems branch from 3ea9560 to 6e8828a Compare June 22, 2026 17:13
@wanadzhar913

wanadzhar913 commented Jun 22, 2026

Copy link
Copy Markdown
Contributor Author

Hi @ceberam, apologies for the delay in getting back to you. Some of the changes I found that needed to be applied were quite sizeable (apologies for the +1300 line diff!). Can I trouble you to rerun the tests on docling-project/docling#3527 and see if they're satisfactory?

To make it easier, I've also added a Mermaid diagram (in the PR description) for all possible control flow. This will be quite cumbersome to maintain but happy for suggestions on how to cut this down.

Tests I ran in docling-project/docling:

DOCLING_GEN_TEST_DATA=1 uv run pytest \
  -k "not test_gen_test_data_flag" \
  tests/test_e2e_conversion.py \
  tests/test_e2e_ocr_conversion.py \
  tests/test_backend_webp.py \
  tests/test_interfaces.py \
  tests/test_backend_csv.py \
  tests/test_backend_msword.py \
  tests/test_backend_msexcel.py \
  tests/test_backend_pptx.py \
  tests/test_backend_html.py \
  tests/test_backend_jats.py \
  tests/test_backend_vtt.py \
  tests/test_backend_xbrl.py \
  tests/test_backend_patent_uspto.py \
  tests/test_backend_markdown.py \
  tests/test_latex/test_basic.py

@wanadzhar913

wanadzhar913 commented Jun 22, 2026

Copy link
Copy Markdown
Contributor Author

In my previous approach, provenance tended to enforce the wrong boundary:

  • if charspans were contiguous, we joined;
  • if there was a gap, we spaced.

This broke OCR-like cases where a word is split with a false source gap, and also cases where separate words have contiguous charspans.

Hence, the reason for the very large diff is mainly to account for the examples below (for when provenance and TextItem reliance isn't enough).

Examples:

  • OCR false gap, should join: Pars + ing with a provenance gap should serialize as Parsing.
  • OCR false join, should space: plain + text with contiguous provenance should serialize as plain text.
  • Styled single-letter prefix: bold D + ocling should serialize as Docling, but bold D + and should serialize as D and.
  • Markdown punctuation cleanup: **bold (b)** + . should serialize as **bold (b)**. without an extra space.
  • Citation/glossary spacing: hen. + [[ 3 ]] should become hen. [[ 3 ]]; *dūce* + 'diver' should become *dūce* 'diver'.
  • Adjacent links: [[ 3 ]](#cite_note-3) + [[ 4 ]](#cite_note-4) should have a space between links.
  • Sub/superscript style boundaries: H + subscript 2 + O needs spacing decisions that plain alphanumeric checks do not handle well.

@ceberam

ceberam commented Jun 23, 2026

Copy link
Copy Markdown
Member

@wanadzhar913 no worries , we all knew this is a challenging issue.
Please, fix the formatting issues and once you (force) push I'll update the docling-project/docling#3527 PR and do my review.
A couple of tips:

  • You can avoid these CI/CD check errors by ensuring they pass locally before you push a commit. Just install prek pre-commit in your local repository with uv run prek install uv run pre-commit install
  • Don't fall down the rabbit hole trying to fix OCR issues. The objective of this PR should be to have a clear rule on how inline groups should be serialized, so that one has full control of the spacing in DoclingDocument to achieve a human readable output in markdown or HTML. OCR issues will always be there and we have already something in mind to deal with them in a post-processing step, but the DoclingDocument will always be the reference and the goal of what needs to be fixed, before any eventual serialization.

Signed-off-by: wanadzhar913 <adzhar.faiq@gmail.com>
…ries/styled text handling by letting text heuristics override provenance only when it is safe

Signed-off-by: wanadzhar913 <adzhar.faiq@gmail.com>
Signed-off-by: wanadzhar913 <adzhar.faiq@gmail.com>
@wanadzhar913 wanadzhar913 force-pushed the bugfix/v2-2745_ExtraSpaceBeforeAndAfterGroupItems branch from 6e8828a to f5298f8 Compare June 23, 2026 11:07
@wanadzhar913

Copy link
Copy Markdown
Contributor Author

Hi @ceberam, I see and noted! I'll leave my OCR attempts at that, and have reran everything with uv run pre-commit run --all-files. Looking forward to your review!

@PeterStaar-IBM PeterStaar-IBM left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

@ceberam

ceberam commented Jun 24, 2026

Copy link
Copy Markdown
Member

Hi @ceberam, I see and noted! I'll leave my OCR attempts at that, and have reran everything with uv run pre-commit run --all-files. Looking forward to your review!

Thanks @wanadzhar913 , I have rebased docling's docling-project/docling#3527 pinning your latest commit. Can you check if the serializations are as you expected? I had a quick look and it seems pretty good!

@wanadzhar913

Copy link
Copy Markdown
Contributor Author

Yeahp, they're in line. Though upon closer inspection, I think it still missed:

  • line 508 for tests/data/groundtruth/docling_v2/wiki_duck.html.md; and
  • tests/data/groundtruth/docling_v2/docx_rich_tables_01.docx.md.

Besides that, I think it looks pretty great! Lmk how you want to proceed @ceberam.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

3 participants