feat: split the monolithic document.py into focused modules#664
Open
PeterStaar-IBM wants to merge 1 commit into
Open
feat: split the monolithic document.py into focused modules#664PeterStaar-IBM wants to merge 1 commit into
document.py into focused modules#664PeterStaar-IBM wants to merge 1 commit into
Conversation
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Contributor
|
✅ DCO Check Passed Thanks @PeterStaar-IBM, all your commits are properly signed off. 🎉 |
Contributor
Merge Protections🟢 Merge protection satisfied — ready to merge. Show 1 satisfied protection🟢 Enforce conventional commitMake sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
docling_core/types/doc/document.pyhad grown to 8164 lines holding ~60classes —
DoclingDocumentplus every*Item/*Grouptype and all theirsupporting models. This PR splits it into a clear module tree so that
document.pynow contains onlyDoclingDocument(5446 lines, almost all ofwhich is that one class), while each item family lives in its own file.
No behavioural changes — this is a pure structural refactor. The public API and
all import paths are preserved.
Motivation
document.pywas too large and mixed many unrelated concerns.*Item/*Groupfamily (e.g. table classes together,picture classes together), each co-located with the data models it owns.
New layout
Design notes
RefItem(not embedded item types) and references
DoclingDocumentonly viaTYPE_CHECKINGstring forward-refs in method signatures, so concrete itemmodules sit cleanly below
DoclingDocumentwith no import cycles.TableDatalives initems/table/table_data.pyas a leaf (depends only oncommon/base), because it’s a field onTableItem,PictureTabularChartDataand
TabularChartMetaField. Splittingtable/intotable_data.py+table.py(mirroringpicture/) avoids ameta → table → node → metacycle.__init__.pyfiles are intentionally empty to keep the import graphacyclic; re-exports live in
document.pyand the package__init__.py.Backward compatibility
from docling_core.types.doc import X— unchanged.from docling_core.types.doc.document import X— still works for every movedname via a re-export block in
document.py.__init__.pycompletenessdocling_core/types/doc/__init__.pywas regenerated (via an AST scan, not byhand) to export every public class and type alias in the package — now
160 names across 28 modules (was 108). Verified invariants:
Verification
pytest: 542 passed, 6 skipped (unchanged from baseline)ruff check: clean across the repomypy: Success: no issues found in 35 source filesStats
document.py: 8164 → 5446 lines+3345 / −2859)