Skip to content

feat: split the monolithic document.py into focused modules#664

Open
PeterStaar-IBM wants to merge 1 commit into
mainfrom
dev/refactor-code-layout
Open

feat: split the monolithic document.py into focused modules#664
PeterStaar-IBM wants to merge 1 commit into
mainfrom
dev/refactor-code-layout

Conversation

@PeterStaar-IBM

@PeterStaar-IBM PeterStaar-IBM commented Jun 28, 2026

Copy link
Copy Markdown
Member

Summary

docling_core/types/doc/document.py had grown to 8164 lines holding ~60
classes — DoclingDocument plus every *Item/*Group type and all their
supporting models. This PR splits it into a clear module tree so that
document.py now contains only DoclingDocument (5446 lines, almost all of
which is that one class), while each item family lives in its own file.

No behavioural changes — this is a pure structural refactor. The public API and
all import paths are preserved.

Motivation

  • document.py was too large and mixed many unrelated concerns.
  • Goal: one file per *Item/*Group family (e.g. table classes together,
    picture classes together), each co-located with the data models it owns.

New layout

  docling_core/types/doc/
  ├── document.py            # DoclingDocument only
  ├── doctags.py             # DocTagsPage, DocTagsDocument
  ├── common/                # cross-cutting value models
  │   ├── scalars.py         # Uint64, LevelNumber, CharSpan
  │   ├── constants.py       # CURRENT_VERSION, *_EXPORT_LABELS
  │   ├── content_layer.py   # ContentLayer
  │   ├── formatting.py      # Formatting, Script
  │   ├── annotations.py     # BaseAnnotation, Description/MiscAnnotation
  │   ├── reference.py       # RefItem, FineRef, ImageRef, ProvenanceItem
  │   ├── origin.py          # DocumentOrigin, BaseSource, TrackSource, SourceType
  │   ├── meta.py            # BaseMeta + all *MetaField + MetaUtils
  │   └── page_item.py       # PageItem
  └── items/                 # document-tree node types
      ├── node.py            # NodeItem, DocItem, FloatingItem (bases)
      ├── group.py           # GroupItem, ListGroup, OrderedList, InlineGroup
      ├── text.py            # TextItem, Title/SectionHeader/List/Formula
      ├── code.py            # CodeItem
      ├── key_value.py       # GraphCell/Link/Data, KeyValueItem, FormItem
      ├── form.py            # Field{Region,Heading,,Value}Item
      ├── content.py         # ContentItem union
      ├── picture/           # charts, classification, molecule, picture (PictureItem)
      └── table/             # table_data (TableData/cells), table (TableItem)

Design notes

  • Acyclic by construction. The model stores tree children as RefItem
    (not embedded item types) and references DoclingDocument only via
    TYPE_CHECKING string forward-refs in method signatures, so concrete item
    modules sit cleanly below DoclingDocument with no import cycles.
  • TableData lives in items/table/table_data.py as a leaf (depends only on
    common/base), because it’s a field on TableItem, PictureTabularChartData
    and TabularChartMetaField. Splitting table/ into table_data.py +
    table.py (mirroring picture/) avoids a meta → table → node → meta cycle.
  • Subpackage __init__.py files are intentionally empty to keep the import graph
    acyclic; re-exports live in document.py and the package __init__.py.
  • All intra-package imports are absolute.

Backward compatibility

  • from docling_core.types.doc import X — unchanged.
  • from docling_core.types.doc.document import X — still works for every moved
    name via a re-export block in document.py.

__init__.py completeness

docling_core/types/doc/__init__.py was regenerated (via an AST scan, not by
hand) to export every public class and type alias in the package — now
160 names across 28 modules (was 108). Verified invariants:

  • every public class/alias defined in any module is exported (no omissions);
  • no previously-exported name was dropped (no regressions).

Verification

  • pytest: 542 passed, 6 skipped (unchanged from baseline)
  • ruff check: clean across the repo
  • mypy: Success: no issues found in 35 source files

Stats

  • document.py: 8164 → 5446 lines
  • 23 new modules; 29 files changed (+3345 / −2859)

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
@github-actions

Copy link
Copy Markdown
Contributor

DCO Check Passed

Thanks @PeterStaar-IBM, all your commits are properly signed off. 🎉

@mergify

mergify Bot commented Jun 28, 2026

Copy link
Copy Markdown
Contributor

Merge Protections

🟢 Merge protection satisfied — ready to merge.

Show 1 satisfied protection

🟢 Enforce conventional commit

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant