fix: (cont.) Remove Extra Space Before and After Group Items using Inline Boundaries by wanadzhar913 · Pull Request #605 · docling-project/docling-core

wanadzhar913 · 2026-05-07T15:12:53Z

Details

This is a continuation of the work in Pull Request: #458 which removes extra space before and after group items to resolve the issue raised in #2745

Resolves #371
Resolves docling-project/docling#2745

Approach

Refactors inline spacing in docling_core/transforms/serializer/common.py into a clearer decision flow centered on _classify_inline_boundary() instead the old approach in #458 where we just remove the space (" ") when joining all parts without separators.

Control Flow

_join_inline_parts() is the entry point. It walks adjacent inline chunks, calls _classify_inline_boundary() for each boundary condition, and inserts a space only when that classifier returns InlineBoundary.SPACE.

_classify_inline_boundary() handles boundaries in a fixed order:

Control Flow for _classify_inline_boundary()

flowchart TD
    A["_classify_inline_boundary"] --> B["Read rendered boundary chars<br/>prev_tail = prev_text[-1]<br/>curr_head = text[0]"]

    B --> C{"Already whitespace<br/>on either side?"}
    C -- "Yes" --> J1["JOIN<br/>avoid duplicate spacing"]

    C -- "No" --> D{"Missing item metadata?<br/>prev_item is None or item is None"}
    D -- "Yes" --> K["_classify_character_boundary(prev_tail, curr_head)"]

    D -- "No" --> P["_classify_provenance_boundary(prev_item, item)"]
    P --> T{"Both normal TextItem?<br/>not semantic inline atoms"}

    T -- "Yes" --> TB["_classify_text_boundary(prev_item, item)"]
    TB --> O{"Text boundary should<br/>override provenance?"}
    O -- "Yes" --> R1["Return text boundary"]
    O -- "No" --> PV

    T -- "No" --> PV{"Provenance boundary known?"}

    PV -- "SPACE" --> R2["SPACE"]
    PV -- "JOIN but cannot safely override spacing" --> R3["SPACE"]
    PV -- "JOIN allowed or UNKNOWN" --> RAW["Choose raw chars from item.text<br/>for TextItem, CodeItem, FormulaItem"]

    RAW --> S1{"Regular TextItem before<br/>semantic inline atom?"}
    S1 -- "Alnum or : ; , & before atom" --> R4["SPACE"]
    S1 -- "Otherwise" --> RC1["_classify_rendered_character_boundary"]

    RAW --> S2{"Semantic inline atom before<br/>TextItem?"}
    S2 -- "Text begins alnum, (, &, [, or quote" --> R5["SPACE"]
    S2 -- "Otherwise" --> RC2["_classify_rendered_character_boundary<br/>using rendered text chars"]

    RAW --> S3["Other cases"]
    S3 --> RC3["_classify_rendered_character_boundary"]

    RC1 --> K
    RC2 --> K
    RC3 --> K

    K --> OUT{"Result"}
    OUT -- "SPACE" --> RS["Insert a space"]
    OUT -- "JOIN or UNKNOWN" --> RJ["Append directly"]

NOTE: Inline serialization was previously making spacing decisions from text or provenance, but the real inputs have competing signals: rendered markdown, raw text, source orig, and provenance charspans can disagree. _classify_inline_boundary() now lets text heuristics override provenance only in controlled cases.

Helper Roles

Control Flow for _classify_character_boundary()

flowchart TD
    A["_classify_character_boundary(prev_tail, curr_head)"] --> B{"Missing char?"}
    B -- "Yes" --> U["UNKNOWN"]

    B -- "No" --> C{"Word punctuation boundary?"}
    C -- "comma/semicolon/colon + alnum" --> S["SPACE"]
    C -- "period + alnum or '['" --> S
    C -- "')' + '['" --> S
    C -- "alnum + '&'" --> S

    C -- "No" --> D{"Both alnum?"}
    D -- "Yes" --> S

    D -- "No" --> E{"Word join char involved?<br/>- or /"}
    E -- "Yes" --> J["JOIN"]

    E -- "No" --> F{"Current char is right-attaching?<br/>)]},;:.!?%"}
    F -- "Yes" --> J

    F -- "No" --> G{"Previous char is bracket opener?<br/>( [ {"}
    G -- "Yes" --> J

    G -- "No" --> H{"Quote next to quote?"}
    H -- "Yes" --> J
    H -- "No" --> U

Tests

# when new datasets are needed
DOCLING_GEN_TEST_DATA=1 uv run pytest -q \
  test/test_serialization.py \
  test/test_plain_text_serialization.py \
  test/test_docling_doc.py

uv run pytest -q \
  test/test_serialization.py \
  test/test_plain_text_serialization.py \
  test/test_docling_doc.py \
  --cov=docling_core.transforms.serializer.common \
  --cov-report term-missing

mergify · 2026-05-07T15:13:30Z

Merge Protections

🔴 1 of 2 protections blocking · waiting on 👀 reviews

	Protection	Waiting on
🔴	Require two reviewer for test updates	👀 reviews
🟢	Enforce conventional commit	—

🔴 Require two reviewer for test updates

Waiting for

#approved-reviews-by >= 2

This rule is failing.

When test data is updated, we require two reviewers

#approved-reviews-by >= 2

Show 1 satisfied protection

🟢 Enforce conventional commit

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

wanadzhar913 · 2026-05-07T15:14:05Z

Hi @ceberam, do review when you can. Thanks so much! Happy to just reuse the old approach in #458 where we just remove the space (" ") when joining all parts without separators. Let me know what you prefer.

github-actions · 2026-05-07T15:14:11Z

✅ DCO Check Passed

Thanks @wanadzhar913, all your commits are properly signed off. 🎉

PeterStaar-IBM

lgtm!

codecov · 2026-05-18T05:36:10Z

Codecov Report

❌ Patch coverage is 97.93388% with 5 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
docling_core/transforms/serializer/common.py	97.91%	5 Missing ⚠️

📢 Thoughts on this report? Let us know!

wanadzhar913 · 2026-05-25T02:59:48Z

Hi everyone, thanks so much for reviewing my code. Let me know if there's anything else I can change, as I recognize this approach is much more verbose (and has more code to maintain). @vagenas

ceberam

@wanadzhar913 could you please check the conflict with test/test_serialization.py?

I have created a draft PR on docling project that pins the latest commit of this PR and regenerates the ground truth files using the changes of your PR: docling-project/docling#3527
(Note that it is just a draft PR and the CI/CD checks complain, since it is using docling-core as a Git dependency source)

Please, review the output, since your changes will definitely have an impact on docling repo. At a first glance, many issues discussed on this PR are now resolved, but I see others that may need attention (e.g., in some cases, necessary blank spaces are now removed).

wanadzhar913 · 2026-06-04T01:44:51Z

@wanadzhar913 could you please check the conflict with test/test_serialization.py?

HI @ceberam, I'll look over the integration over the weekend.

However, what do you mean by the above? Can't seem to find any conflicts in test/test_serialization.py? Are you referring to the CodeCov report coverage percentage, or are there merge conflicts due to my branch being stale? Thanks! Nvm, I see it now! Will resolve soon!

wanadzhar913 · 2026-06-04T14:02:39Z

Whoops, sorry! Requested a review too early. Will lyk once I've fixed the main docling package integration.

ceberam · 2026-06-10T07:41:53Z

@wanadzhar913 how is the PR progressing? Were you able to check the implications on docling repo? Let me know if you want me to update the PR docling-project/docling#3527 to reflect any changes on this PR.

ceberam · 2026-06-22T14:51:11Z

hi @wanadzhar913 it would be great to close this task soon. Is this PR ready to be reviewed?
Did you check the potential impact on docling's groundtruth files?
Let me know if you need any help and I can take it from there.

wanadzhar913 · 2026-06-22T18:20:45Z

Hi @ceberam, apologies for the delay in getting back to you. Some of the changes I found that needed to be applied were quite sizeable (apologies for the +1300 line diff!). Can I trouble you to rerun the tests on docling-project/docling#3527 and see if they're satisfactory?

To make it easier, I've also added a Mermaid diagram (in the PR description) for all possible control flow. This will be quite cumbersome to maintain but happy for suggestions on how to cut this down.

Tests I ran in docling-project/docling:

DOCLING_GEN_TEST_DATA=1 uv run pytest \
  -k "not test_gen_test_data_flag" \
  tests/test_e2e_conversion.py \
  tests/test_e2e_ocr_conversion.py \
  tests/test_backend_webp.py \
  tests/test_interfaces.py \
  tests/test_backend_csv.py \
  tests/test_backend_msword.py \
  tests/test_backend_msexcel.py \
  tests/test_backend_pptx.py \
  tests/test_backend_html.py \
  tests/test_backend_jats.py \
  tests/test_backend_vtt.py \
  tests/test_backend_xbrl.py \
  tests/test_backend_patent_uspto.py \
  tests/test_backend_markdown.py \
  tests/test_latex/test_basic.py

wanadzhar913 · 2026-06-22T18:27:50Z

In my previous approach, provenance tended to enforce the wrong boundary:

if charspans were contiguous, we joined;
if there was a gap, we spaced.

This broke OCR-like cases where a word is split with a false source gap, and also cases where separate words have contiguous charspans.

Hence, the reason for the very large diff is mainly to account for the examples below (for when provenance and TextItem reliance isn't enough).

Examples:

OCR false gap, should join: Pars + ing with a provenance gap should serialize as Parsing.
OCR false join, should space: plain + text with contiguous provenance should serialize as plain text.
Styled single-letter prefix: bold D + ocling should serialize as Docling, but bold D + and should serialize as D and.
Markdown punctuation cleanup: **bold (b)** + . should serialize as **bold (b)**. without an extra space.
Citation/glossary spacing: hen. + [[ 3 ]] should become hen. [[ 3 ]]; *dūce* + 'diver' should become *dūce* 'diver'.
Adjacent links: [[ 3 ]](#cite_note-3) + [[ 4 ]](#cite_note-4) should have a space between links.
Sub/superscript style boundaries: H + subscript 2 + O needs spacing decisions that plain alphanumeric checks do not handle well.

ceberam · 2026-06-23T08:17:57Z

@wanadzhar913 no worries , we all knew this is a challenging issue.
Please, fix the formatting issues and once you (force) push I'll update the docling-project/docling#3527 PR and do my review.
A couple of tips:

You can avoid these CI/CD check errors by ensuring they pass locally before you push a commit. Just install ~~prek~~ pre-commit in your local repository with ~~uv run prek install~~ uv run pre-commit install
Don't fall down the rabbit hole trying to fix OCR issues. The objective of this PR should be to have a clear rule on how inline groups should be serialized, so that one has full control of the spacing in DoclingDocument to achieve a human readable output in markdown or HTML. OCR issues will always be there and we have already something in mind to deal with them in a post-processing step, but the DoclingDocument will always be the reference and the goal of what needs to be fixed, before any eventual serialization.

Signed-off-by: wanadzhar913 <adzhar.faiq@gmail.com>

…ries/styled text handling by letting text heuristics override provenance only when it is safe Signed-off-by: wanadzhar913 <adzhar.faiq@gmail.com>

Signed-off-by: wanadzhar913 <adzhar.faiq@gmail.com>

wanadzhar913 · 2026-06-23T11:10:13Z

Hi @ceberam, I see and noted! I'll leave my OCR attempts at that, and have reran everything with uv run pre-commit run --all-files. Looking forward to your review!

PeterStaar-IBM

lgtm!

ceberam · 2026-06-24T12:02:11Z

Hi @ceberam, I see and noted! I'll leave my OCR attempts at that, and have reran everything with uv run pre-commit run --all-files. Looking forward to your review!

Thanks @wanadzhar913 , I have rebased docling's docling-project/docling#3527 pinning your latest commit. Can you check if the serializations are as you expected? I had a quick look and it seems pretty good!

wanadzhar913 · 2026-06-24T15:31:10Z

Yeahp, they're in line. Though upon closer inspection, I think it still missed:

line 508 for tests/data/groundtruth/docling_v2/wiki_duck.html.md; and
tests/data/groundtruth/docling_v2/docx_rich_tables_01.docx.md.

Besides that, I think it looks pretty great! Lmk how you want to proceed @ceberam.

PeterStaar-IBM requested a review from vagenas May 18, 2026 05:32

PeterStaar-IBM approved these changes May 18, 2026

View reviewed changes

PeterStaar-IBM previously approved these changes May 28, 2026

View reviewed changes

ceberam mentioned this pull request Jun 2, 2026

fix: remove extra space before and after group items docling-project/docling#3527

Draft

3 tasks

ceberam requested changes Jun 3, 2026

View reviewed changes

wanadzhar913 dismissed PeterStaar-IBM’s stale review via 0a9ba2f June 4, 2026 13:50

wanadzhar913 force-pushed the bugfix/v2-2745_ExtraSpaceBeforeAndAfterGroupItems branch from 78392e8 to 0a9ba2f Compare June 4, 2026 13:50

wanadzhar913 requested a review from ceberam June 4, 2026 13:51

wanadzhar913 force-pushed the bugfix/v2-2745_ExtraSpaceBeforeAndAfterGroupItems branch from 0a9ba2f to b678d1c Compare June 8, 2026 17:51

wanadzhar913 force-pushed the bugfix/v2-2745_ExtraSpaceBeforeAndAfterGroupItems branch from 3ea9560 to 6e8828a Compare June 22, 2026 17:13

wanadzhar913 requested a review from PeterStaar-IBM June 22, 2026 18:48

wanadzhar913 added 3 commits June 23, 2026 19:06

fix: merge conflicts

a23bca0

Signed-off-by: wanadzhar913 <adzhar.faiq@gmail.com>

fix: account for noisy OCR layout extraction/markdown specific bounda…

6f0ba5d

…ries/styled text handling by letting text heuristics override provenance only when it is safe Signed-off-by: wanadzhar913 <adzhar.faiq@gmail.com>

fix: formatting issues

f5298f8

Signed-off-by: wanadzhar913 <adzhar.faiq@gmail.com>

wanadzhar913 force-pushed the bugfix/v2-2745_ExtraSpaceBeforeAndAfterGroupItems branch from 6e8828a to f5298f8 Compare June 23, 2026 11:07

PeterStaar-IBM approved these changes Jun 24, 2026

View reviewed changes

Uh oh!

Conversation

wanadzhar913 commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Details

Approach

Control Flow

Helper Roles

Tests

Uh oh!

mergify Bot commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merge Protections

🔴 Require two reviewer for test updates

🟢 Enforce conventional commit

Uh oh!

wanadzhar913 commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

PeterStaar-IBM left a comment

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

wanadzhar913 commented May 25, 2026

Uh oh!

ceberam left a comment

Choose a reason for hiding this comment

Uh oh!

wanadzhar913 commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wanadzhar913 commented Jun 4, 2026

Uh oh!

ceberam commented Jun 10, 2026

Uh oh!

ceberam commented Jun 22, 2026

Uh oh!

wanadzhar913 commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wanadzhar913 commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ceberam commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wanadzhar913 commented Jun 23, 2026

Uh oh!

PeterStaar-IBM left a comment

Choose a reason for hiding this comment

Uh oh!

ceberam commented Jun 24, 2026

Uh oh!

wanadzhar913 commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wanadzhar913 commented May 7, 2026 •

edited

Loading

mergify Bot commented May 7, 2026 •

edited

Loading

wanadzhar913 commented May 7, 2026 •

edited

Loading

github-actions Bot commented May 7, 2026 •

edited

Loading

codecov Bot commented May 18, 2026 •

edited

Loading

wanadzhar913 commented Jun 4, 2026 •

edited

Loading

wanadzhar913 commented Jun 22, 2026 •

edited

Loading

wanadzhar913 commented Jun 22, 2026 •

edited

Loading

ceberam commented Jun 23, 2026 •

edited

Loading