Skip to content

Fix/toc part metadata#737

Merged
dgunning merged 3 commits intodgunning:mainfrom
ghedo44:fix/toc-part-metadata
Mar 30, 2026
Merged

Fix/toc part metadata#737
dgunning merged 3 commits intodgunning:mainfrom
ghedo44:fix/toc-part-metadata

Conversation

@ghedo44
Copy link
Copy Markdown
Contributor

@ghedo44 ghedo44 commented Mar 26, 2026

Closes #736

Fix MSFT 10-K TOC metadata mapping: propagate PART context from standalone TOC rows and support display-style section names so section.item/section.part are populated consistently.

Copy link
Copy Markdown
Owner

@dgunning dgunning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review

The parse_section_name regex generalization is clean and correct. The _infer_part_from_row_context approach is right in principle — walking backward through <tr> siblings to find standalone PART headers is a solid way to handle this TOC pattern.

However, there's one issue that needs fixing before merge:

text_content() on <tr> concatenates all cells without spaces

The regex ^\s*PART\s+([IVX]+)\b uses \b (word boundary) after the roman numeral. Real TOC rows often have page numbers in adjacent cells:

<tr><td>PART I</td><td>3</td></tr>

text_content()"PART I3". The \b between I and 3 won't match because both are \w characters. This would silently fail to detect the part header.

The test uses clean single-cell rows that don't exercise this case. A fix could be to check <td> cell text individually rather than the whole row's text_content(), or adjust the regex.

Minor notes

  • The backward sibling search in _infer_part_from_row_context is unbounded (walks all previous siblings). Consider adding a limit for consistency with the upward traversal limit of 10 in the same method.
  • Test anchors in test_toc_analyzer_part_context.py all point to #i1 — would be cleaner with distinct anchors per item.
  • No integration test with the actual MSFT filing that motivated this. Worth verifying the fix works end-to-end on the real filing.

@ghedo44
Copy link
Copy Markdown
Contributor Author

ghedo44 commented Mar 30, 2026

Thanks for the detailed review. All requested items are addressed in 850cf7d.

  • Fixed PART detection for multi-cell TOC rows: Now checks each td/th cell individually instead of relying on tr text content, correctly handling cases like PART I | 3
  • Added bounded backward sibling scan: The backward tr walk is now capped with a row limit to prevent unbounded traversal
  • Removed duplicated part parsing logic: Introduced a shared part-context helper reused across both detection paths
  • Cleaned test anchors and added missing coverage: Updated test HTML with distinct anchors per item, added a test for PART rows with page-number cells, and added end-to-end verification against the real MSFT 10-K fixture

Copy link
Copy Markdown
Owner

@dgunning dgunning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for your changes. This PR is approved and will be included in the next release.

Dwight

@dgunning dgunning merged commit 4125faf into dgunning:main Mar 30, 2026
5 of 6 checks passed
dgunning added a commit that referenced this pull request Mar 30, 2026
Features: FilingViewer (SEC Interactive Data Viewer), ConceptGraph
(XBRL knowledge graph), BDC non-accrual extraction, to_markdown() for
LLM drill-down, compare_context() cross-validation, MetaLinks.json parser.

Fixes: iXBRL abbreviation spacing (#734), TOC part metadata (#737),
533 ruff code quality issues including LinkBlock f-string bug (#740).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: section.item and section.part missing in some 10-K filings

2 participants