Fix/toc part metadata by ghedo44 · Pull Request #737 · dgunning/edgartools

ghedo44 · 2026-03-26T11:04:37Z

Closes #736

Fix MSFT 10-K TOC metadata mapping: propagate PART context from standalone TOC rows and support display-style section names so section.item/section.part are populated consistently.

dgunning

Review

The parse_section_name regex generalization is clean and correct. The _infer_part_from_row_context approach is right in principle — walking backward through <tr> siblings to find standalone PART headers is a solid way to handle this TOC pattern.

However, there's one issue that needs fixing before merge:

`text_content()` on `<tr>` concatenates all cells without spaces

The regex ^\s*PART\s+([IVX]+)\b uses \b (word boundary) after the roman numeral. Real TOC rows often have page numbers in adjacent cells:

<tr><td>PART I</td><td>3</td></tr>

text_content() → "PART I3". The \b between I and 3 won't match because both are \w characters. This would silently fail to detect the part header.

The test uses clean single-cell rows that don't exercise this case. A fix could be to check <td> cell text individually rather than the whole row's text_content(), or adjust the regex.

Minor notes

The backward sibling search in _infer_part_from_row_context is unbounded (walks all previous siblings). Consider adding a limit for consistency with the upward traversal limit of 10 in the same method.
Test anchors in test_toc_analyzer_part_context.py all point to #i1 — would be cleaner with distinct anchors per item.
No integration test with the actual MSFT filing that motivated this. Worth verifying the fix works end-to-end on the real filing.

ghedo44 · 2026-03-30T08:50:02Z

Thanks for the detailed review. All requested items are addressed in 850cf7d.

Fixed PART detection for multi-cell TOC rows: Now checks each td/th cell individually instead of relying on tr text content, correctly handling cases like PART I | 3
Added bounded backward sibling scan: The backward tr walk is now capped with a row limit to prevent unbounded traversal
Removed duplicated part parsing logic: Introduced a shared part-context helper reused across both detection paths
Cleaned test anchors and added missing coverage: Updated test HTML with distinct anchors per item, added a test for PART rows with page-number cells, and added end-to-end verification against the real MSFT 10-K fixture

dgunning

Thanks a lot for your changes. This PR is approved and will be included in the next release.

Dwight

Features: FilingViewer (SEC Interactive Data Viewer), ConceptGraph (XBRL knowledge graph), BDC non-accrual extraction, to_markdown() for LLM drill-down, compare_context() cross-validation, MetaLinks.json parser. Fixes: iXBRL abbreviation spacing (#734), TOC part metadata (#737), 533 ruff code quality issues including LinkBlock f-string bug (#740). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

ghedo44 added 2 commits March 26, 2026 11:40

Fix TOC section item/part metadata mapping

8708cb5

refactor: remove debug print statement for part header detection

af904f3

dgunning reviewed Mar 27, 2026

View reviewed changes

Fix TOC part header detection edge cases

850cf7d

dgunning approved these changes Mar 30, 2026

View reviewed changes

dgunning merged commit 4125faf into dgunning:main Mar 30, 2026
5 of 6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix/toc part metadata#737

Fix/toc part metadata#737
dgunning merged 3 commits intodgunning:mainfrom
ghedo44:fix/toc-part-metadata

ghedo44 commented Mar 26, 2026

Uh oh!

dgunning left a comment

Uh oh!

ghedo44 commented Mar 30, 2026

Uh oh!

dgunning left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

ghedo44 commented Mar 26, 2026

Uh oh!

dgunning left a comment

Choose a reason for hiding this comment

Review

text_content() on <tr> concatenates all cells without spaces

Minor notes

Uh oh!

ghedo44 commented Mar 30, 2026

Uh oh!

dgunning left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

`text_content()` on `<tr>` concatenates all cells without spaces