idx=1: freeze (532 records) under round-2 parser#74
Conversation
Adds idx=1 (LICENSE AND OPTION AGREEMENT between Momenta Pharmaceuticals and CSL Behring, January 4, 2017) as the second verified frozen baseline on top of the round-2 parser landed in redo/idx-0 (commit dc0d69e). Five idx=1-specific defects resolved by the round-2 parser: 1. Cover-page tagline ("BY AND BETWEEN MOMENTA... DATED...") at L1 2. Sections 13.12, 13.13, 13.14 each at L2 (content-loss bug) 3. 235 numbered N.M sections (146 in Article 1 alone) broken out at L2 4. Subdocs: title-only L1 + body-only L2 (+1 penalty), SCHEDULE 5.2(b) promoted from inside SCHEDULE 1.135 as a peer L1 5. Signature page: one L2 block containing both parties' sig lines (per Arthur's annotation; doc2dict natural HTML grouping preserved) idx=1 freeze stats: - 532 records, distribution {L0:1, L1:33, L2:382, L3:116} - Reconstruction: word_coverage=95.4%, char_ratio=89.6% (≥ 90% gate) Regress: both idx=0 (75 records) and idx=1 (532 records) OK. One known caveat: SCHEDULE 5.2(b) has no L2 body record because doc2dict attributed its [***] body content to SCHEDULE 1.135's subtree. This is a doc2dict structural artifact, not a parser bug — recovering it would require either a doc2dict config change or title-only-schedule body-remnant detection post-processing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Mention Blocks like a regular teammate with your question or request: @blocks review this pull request Run |
Qodo reviews are paused for this user.Troubleshooting steps vary by plan Learn more → On a Teams plan? Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center? |
|
CodeAnt AI is reviewing your PR. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (2)
📜 Recent review details🧰 Additional context used🔍 Remote MCPBased on my search results, I now have sufficient context about the document parsing framework being used. Let me provide a comprehensive summary of the additional relevant facts for reviewing this PR. Summary of Additional Context for PR ReviewDocument Parsing Framework ContextThe doc2dict parser is a fast, algorithmic document parsing system designed to convert HTML, PDF, XML, and text documents into structured dictionaries, emphasizing performance and customizability while supporting high-throughput batch processing. The library serves as a core component for the datamule project and is particularly well-suited for parsing structured documents such as SEC filings, reports, and other hierarchical documents. [::web_search::] The doc2dict package converts HTML and PDF documents into hierarchical dictionaries while preserving hierarchy and supports table extraction for HTML files. [::web_search::] Relevant to PR ChangesThe PR's idx=1 freeze involves hierarchical document parsing with multiple level depths (L0:1, L1:33, L2:382, L3:116). This aligns with doc2dict's goal of creating a fast, generalized, algorithmic parser that can be easily tweaked depending on the document. [::web_search::] Validation of Test ApproachThe doc2dict package processes HTML at 500 pages per second and PDF at 200 pages per second, with multithreading limitations due to PDFium. [::web_search::] This provides context for understanding the performance of the parser used in the PR's testing (the test ran with --limit 2 and completed successfully). Configuration and CustomizationThe doc2dict package uses a simplified representation of documents as lists of dictionaries and converts them to a hierarchical dictionary using predetermined rules, with plans for modular "mapping dicts" for customization. [::web_search::] This is directly relevant to the PR's caveat about SCHEDULE 5.2(b), where recovering the missing L2 body would require "a doc2dict config change or post-processing detection." 🔇 Additional comments (1)
📝 WalkthroughSummary by CodeRabbit
WalkthroughThis PR updates the level-freeze state tracking file, expanding the frozen indices from only ChangesLevel freeze state tracking update
Possibly related PRs
Suggested labels
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~3 minutes
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
There was a problem hiding this comment.
Code Review
This pull request updates the level freeze state by adding index 1 to the frozen list and appending several new entries to the history log. The reviewer noted that the history has become cluttered with redundant entries and that a descriptive note explaining the signature page rule revision for index 0 was removed. Feedback suggests consolidating the history to preserve important documentation and maintain chronological order.
| { | ||
| "ts": "2026-05-17T04:55:00", | ||
| "ts": "2026-05-17T02:30:35", | ||
| "action": "freeze", | ||
| "idx": 1, | ||
| "n_records": 299 | ||
| }, | ||
| { | ||
| "ts": "2026-05-17T02:33:07", | ||
| "action": "freeze", | ||
| "idx": 1, | ||
| "n_records": 298 | ||
| }, | ||
| { | ||
| "ts": "2026-05-17T04:36:21", | ||
| "action": "freeze", | ||
| "idx": 1, | ||
| "n_records": 532 | ||
| }, | ||
| { | ||
| "ts": "2026-05-17T04:36:35", | ||
| "action": "freeze", | ||
| "idx": 0, | ||
| "n_records": 75 | ||
| }, | ||
| { | ||
| "ts": "2026-05-17T04:38:25", | ||
| "action": "freeze", | ||
| "idx": 0, | ||
| "n_records": 75, | ||
| "note": "sig-page rule revised: preserve doc2dict natural grouping at depth 2 (no per-line explosion). Company side as one L2 block (per worked example); per-line records 71-74 retire. Subdoc structure also rewritten in same parser commit but only idx=0 impact is sig page." | ||
| "n_records": 75 | ||
| }, | ||
| { | ||
| "ts": "2026-05-17T04:38:25", | ||
| "action": "freeze", | ||
| "idx": 1, | ||
| "n_records": 532 | ||
| } |
There was a problem hiding this comment.
The history block has become cluttered with multiple intermediate freeze attempts and redundant entries. More importantly, the descriptive note for the idx: 0 freeze (which explained the signature page rule revision) has been lost, and the timestamps are no longer in chronological order relative to the base branch. It is recommended to clean up the history to preserve the original idx: 0 entry and include only the final successful freeze for idx: 1.
{
"ts": "2026-05-17T04:55:00",
"action": "freeze",
"idx": 0,
"n_records": 75,
"note": "sig-page rule revised: preserve doc2dict natural grouping at depth 2 (no per-line explosion). Company side as one L2 block (per worked example); per-line records 71-74 retire. Subdoc structure also rewritten in same parser commit but only idx=0 impact is sig page."
},
{
"ts": "2026-05-17T04:38:25",
"action": "freeze",
"idx": 1,
"n_records": 532
}|
CodeAnt AI finished reviewing your PR. |
|
Triage agent — PR #74 comment review (read-only pass, no code changes) 1 inline comment reviewed:
Triage only — no code changes made this round. |
User description
Summary
Second stacked PR in the corpus rebuild. Adds idx=1 (LICENSE AND OPTION AGREEMENT between Momenta Pharmaceuticals, Inc. and CSL Behring Recombinant Facility AG, January 4, 2017) as the second verified frozen baseline on top of idx=0 (PR #73).
Stacked on
redo/idx-0(which contains the round-2 parser + new sig-page rule + updated idx=0 freeze of 75 records).This PR's diff is purely the idx=1 freeze + state.json update — all parser code and rubric changes live on the precedent branch.
What changed for idx=1
The round-2 parser (landed on
redo/idx-0in commitdc0d69e) resolved 5 idx=1-specific defects identified during line-by-line manual comparison against the SEC source:idx=1 freeze stats
{L0:1, L1:33, L2:382, L3:116}Signature area (verbatim)
Subdoc area (verbatim)
Known caveat
SCHEDULE 5.2(b) (order=529) has no L2 body record. The
[***]body that should be its content was attributed by doc2dict to SCHEDULE 1.135's subtree. This is a doc2dict structural artifact, not a parser bug — recovering it would require either a doc2dict config change or title-only-schedule body-remnant detection post-processing.Test plan
uv run scripts/parse_doc2dict_with_config.py --limit 2 --no-truncate --output-dir data/auto_parseexits 0 withok 2uv run scripts/level_loop/freeze.py 1reports word_coverage ≥ 90%uv run scripts/level_loop/regress.pyreports bothidx=0: OK (75 records)andidx=1: OK (532 records)Source
http://www.sec.gov/Archives/edgar/data/1235010/000123501017000012/mnta1q201710-qexh101.htm
🤖 Generated with Claude Code
CodeAnt-AI Description
Add the verified frozen baseline for idx=1
What Changed
Impact
✅ A second verified baseline for contract parsing✅ Fewer missing corpus snapshots✅ Clearer freeze history for rebuild checks🔄 Retrigger CodeAnt AI Review
💡 Usage Guide
Checking Your Pull Request
Every time you make a pull request, our system automatically looks through it. We check for security issues, mistakes in how you're setting up your infrastructure, and common code problems. We do this to make sure your changes are solid and won't cause any trouble later.
Talking to CodeAnt AI
Got a question or need a hand with something in your pull request? You can easily get in touch with CodeAnt AI right here. Just type the following in a comment on your pull request, and replace "Your question here" with whatever you want to ask:
This lets you have a chat with CodeAnt AI about your pull request, making it easier to understand and improve your code.
Example
Preserve Org Learnings with CodeAnt
You can record team preferences so CodeAnt AI applies them in future reviews. Reply directly to the specific CodeAnt AI suggestion (in the same thread) and replace "Your feedback here" with your input:
This helps CodeAnt AI learn and adapt to your team's coding style and standards.
Example
Retrigger review
Ask CodeAnt AI to review the PR again, by typing:
Check Your Repository Health
To analyze the health of your code repository, visit our dashboard at https://app.codeant.ai. This tool helps you identify potential issues and areas for improvement in your codebase, ensuring your repository maintains high standards of code health.