Conversation
dgunning
left a comment
There was a problem hiding this comment.
Review
The parse_section_name regex generalization is clean and correct. The _infer_part_from_row_context approach is right in principle — walking backward through <tr> siblings to find standalone PART headers is a solid way to handle this TOC pattern.
However, there's one issue that needs fixing before merge:
text_content() on <tr> concatenates all cells without spaces
The regex ^\s*PART\s+([IVX]+)\b uses \b (word boundary) after the roman numeral. Real TOC rows often have page numbers in adjacent cells:
<tr><td>PART I</td><td>3</td></tr>text_content() → "PART I3". The \b between I and 3 won't match because both are \w characters. This would silently fail to detect the part header.
The test uses clean single-cell rows that don't exercise this case. A fix could be to check <td> cell text individually rather than the whole row's text_content(), or adjust the regex.
Minor notes
- The backward sibling search in
_infer_part_from_row_contextis unbounded (walks all previous siblings). Consider adding a limit for consistency with the upward traversal limit of 10 in the same method. - Test anchors in
test_toc_analyzer_part_context.pyall point to#i1— would be cleaner with distinct anchors per item. - No integration test with the actual MSFT filing that motivated this. Worth verifying the fix works end-to-end on the real filing.
|
Thanks for the detailed review. All requested items are addressed in 850cf7d.
|
dgunning
left a comment
There was a problem hiding this comment.
Thanks a lot for your changes. This PR is approved and will be included in the next release.
Dwight
Features: FilingViewer (SEC Interactive Data Viewer), ConceptGraph (XBRL knowledge graph), BDC non-accrual extraction, to_markdown() for LLM drill-down, compare_context() cross-validation, MetaLinks.json parser. Fixes: iXBRL abbreviation spacing (#734), TOC part metadata (#737), 533 ruff code quality issues including LinkBlock f-string bug (#740). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Closes #736
Fix MSFT 10-K TOC metadata mapping: propagate PART context from standalone TOC rows and support display-style section names so
section.item/section.partare populated consistently.