Skip to content

feat: add HTML pre-heading content layer option#3719

Open
xdong99 wants to merge 2 commits into
docling-project:mainfrom
xdong99:codex/html-pre-heading-content-layer
Open

feat: add HTML pre-heading content layer option#3719
xdong99 wants to merge 2 commits into
docling-project:mainfrom
xdong99:codex/html-pre-heading-content-layer

Conversation

@xdong99

@xdong99 xdong99 commented Jun 27, 2026

Copy link
Copy Markdown

Summary

  • Add HTMLBackendOptions.pre_heading_content_layer so callers can choose the content layer assigned to HTML content before the first heading.
  • Keep the existing inferred furniture behavior as the default when the option is unset.
  • Add lightweight HTML backend tests covering the default behavior and explicit ContentLayer.BODY override.

Why

For HTML fragments used in RAG pipelines, content before the first heading may be semantically part of the body. The new option lets callers preserve that content as body text without disabling the existing default furniture inference for everyone.

Fixes #2487.

Checks

  • python -m pytest tests/test_backend_html_content_layer.py -q
  • python -m ruff check docling/datamodel/backend_options.py docling/backend/html_backend.py tests/test_backend_html_content_layer.py
  • python -m ruff format --check docling/datamodel/backend_options.py docling/backend/html_backend.py tests/test_backend_html_content_layer.py

@github-actions

github-actions Bot commented Jun 27, 2026

Copy link
Copy Markdown
Contributor

DCO Check Passed

Thanks @xdong99, all your commits are properly signed off. 🎉

@mergify

mergify Bot commented Jun 27, 2026

Copy link
Copy Markdown
Contributor

Merge Protections

🟢 Merge protection satisfied — ready to merge.

Show 1 satisfied protection

🟢 Enforce conventional commit

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

I, xdong94 <xdong94@wisc.edu>, hereby add my Signed-off-by to this commit: 0d69847

Signed-off-by: xdong94 <xdong94@wisc.edu>
@xdong99 xdong99 changed the title Add HTML pre-heading content layer option feat: add HTML pre-heading content layer option Jun 27, 2026
@xdong99 xdong99 marked this pull request as ready for review June 27, 2026 06:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Option to set the default ContentLayer in the Body for html_backend

1 participant