Skip to content

idx=1: freeze (532 records) under round-2 parser#74

Open
arthrod wants to merge 1 commit into
redo/idx-0from
redo/idx-1
Open

idx=1: freeze (532 records) under round-2 parser#74
arthrod wants to merge 1 commit into
redo/idx-0from
redo/idx-1

Conversation

@arthrod
Copy link
Copy Markdown
Owner

@arthrod arthrod commented May 17, 2026

User description

Summary

Second stacked PR in the corpus rebuild. Adds idx=1 (LICENSE AND OPTION AGREEMENT between Momenta Pharmaceuticals, Inc. and CSL Behring Recombinant Facility AG, January 4, 2017) as the second verified frozen baseline on top of idx=0 (PR #73).

Stacked on redo/idx-0 (which contains the round-2 parser + new sig-page rule + updated idx=0 freeze of 75 records).

This PR's diff is purely the idx=1 freeze + state.json update — all parser code and rubric changes live on the precedent branch.

What changed for idx=1

The round-2 parser (landed on redo/idx-0 in commit dc0d69e) resolved 5 idx=1-specific defects identified during line-by-line manual comparison against the SEC source:

  1. Cover-page tagline ("BY AND BETWEEN MOMENTA... DATED AS OF JANUARY 4, 2017") now at L1 — previously dropped
  2. Sections 13.12, 13.13, 13.14 each at L2 — previously missing (content-loss bug)
  3. 235 numbered N.M sections (146 in Article 1 alone) broken out at L2 — previously buried inside 47KB Article 1 body
  4. Subdocs split: title-only L1 + body-only L2 with +1 penalty; SCHEDULE 5.2(b) promoted from inside SCHEDULE 1.135 as a peer L1 — previously all 4 subdocs consolidated to 1 L1 each
  5. Signature page as one L2 block containing both parties (matches doc2dict's natural HTML grouping; per Arthur's annotation) — previously over-split into 8 per-line records

idx=1 freeze stats

  • 532 records, distribution {L0:1, L1:33, L2:382, L3:116}
  • Reconstruction: word_coverage 95.4%, char_ratio 89.6% (above 90% blocking gate)
  • Max depth: 3

Signature area (verbatim)

order=521 L1: IN WITNESS WHEREOF, the Parties hereto have set their hand as of the Execution Date.
order=522 L2: MOMENTA PHARMACEUTICALS, INC.
              By: /s/ Craig A. Wheeler
              Name: Craig A. Wheeler
              Title: President and Chief Executive Officer
              CSL BEHRING RECOMBINANT FACILITY AG by its duly authorized attorney
              By: /s/ David Lamont
              Name: David Lamont
              Title: Chief Financial Officer

Subdoc area (verbatim)

order=523 L1: SCHEDULE 1.33 CALCULATION OF LABOR COSTS, EXPENSE ALLOCATION AND RELATED MATTERS
order=524 L2: [***]
order=525 L1: SCHEDULE 1.93 MOMENTA PATENT RIGHTS
order=526 L2: [***]
order=527 L1: SCHEDULE 1.135 TECHNOLOGY TRANSFER PLAN – OUTLINE OF CMC ASPECTS
order=528 L2: [***]\n[***]
order=529 L1: SCHEDULE 5.2(b) OUTLINE OF INITIAL DEVELOPMENT PLAN FOR FIRST PRODUCT
order=530 L1: EXHIBIT 8.2 INITIAL PRESS RELEASE
order=531 L2: <11,788 chars of press release body>

Known caveat

SCHEDULE 5.2(b) (order=529) has no L2 body record. The [***] body that should be its content was attributed by doc2dict to SCHEDULE 1.135's subtree. This is a doc2dict structural artifact, not a parser bug — recovering it would require either a doc2dict config change or title-only-schedule body-remnant detection post-processing.

Test plan

  • uv run scripts/parse_doc2dict_with_config.py --limit 2 --no-truncate --output-dir data/auto_parse exits 0 with ok 2
  • uv run scripts/level_loop/freeze.py 1 reports word_coverage ≥ 90%
  • uv run scripts/level_loop/regress.py reports both idx=0: OK (75 records) and idx=1: OK (532 records)
  • Manual visual verification of all 5 defects by independent inspector agent (PASS verdict)

Source

http://www.sec.gov/Archives/edgar/data/1235010/000123501017000012/mnta1q201710-qexh101.htm

🤖 Generated with Claude Code


CodeAnt-AI Description

Add the verified frozen baseline for idx=1

What Changed

  • Adds idx=1 as a frozen baseline alongside idx=0
  • Saves the new idx=1 parsed output and updates freeze history and state to reflect it
  • Keeps both frozen baselines available for future runs

Impact

✅ A second verified baseline for contract parsing
✅ Fewer missing corpus snapshots
✅ Clearer freeze history for rebuild checks

🔄 Retrigger CodeAnt AI Review

💡 Usage Guide

Checking Your Pull Request

Every time you make a pull request, our system automatically looks through it. We check for security issues, mistakes in how you're setting up your infrastructure, and common code problems. We do this to make sure your changes are solid and won't cause any trouble later.

Talking to CodeAnt AI

Got a question or need a hand with something in your pull request? You can easily get in touch with CodeAnt AI right here. Just type the following in a comment on your pull request, and replace "Your question here" with whatever you want to ask:

@codeant-ai ask: Your question here

This lets you have a chat with CodeAnt AI about your pull request, making it easier to understand and improve your code.

Example

@codeant-ai ask: Can you suggest a safer alternative to storing this secret?

Preserve Org Learnings with CodeAnt

You can record team preferences so CodeAnt AI applies them in future reviews. Reply directly to the specific CodeAnt AI suggestion (in the same thread) and replace "Your feedback here" with your input:

@codeant-ai: Your feedback here

This helps CodeAnt AI learn and adapt to your team's coding style and standards.

Example

@codeant-ai: Do not flag unused imports.

Retrigger review

Ask CodeAnt AI to review the PR again, by typing:

@codeant-ai: review

Check Your Repository Health

To analyze the health of your code repository, visit our dashboard at https://app.codeant.ai. This tool helps you identify potential issues and areas for improvement in your codebase, ensuring your repository maintains high standards of code health.

Adds idx=1 (LICENSE AND OPTION AGREEMENT between Momenta Pharmaceuticals
and CSL Behring, January 4, 2017) as the second verified frozen baseline
on top of the round-2 parser landed in redo/idx-0 (commit dc0d69e).

Five idx=1-specific defects resolved by the round-2 parser:
  1. Cover-page tagline ("BY AND BETWEEN MOMENTA... DATED...") at L1
  2. Sections 13.12, 13.13, 13.14 each at L2 (content-loss bug)
  3. 235 numbered N.M sections (146 in Article 1 alone) broken out at L2
  4. Subdocs: title-only L1 + body-only L2 (+1 penalty), SCHEDULE 5.2(b)
     promoted from inside SCHEDULE 1.135 as a peer L1
  5. Signature page: one L2 block containing both parties' sig lines
     (per Arthur's annotation; doc2dict natural HTML grouping preserved)

idx=1 freeze stats:
  - 532 records, distribution {L0:1, L1:33, L2:382, L3:116}
  - Reconstruction: word_coverage=95.4%, char_ratio=89.6% (≥ 90% gate)

Regress: both idx=0 (75 records) and idx=1 (532 records) OK.

One known caveat: SCHEDULE 5.2(b) has no L2 body record because
doc2dict attributed its [***] body content to SCHEDULE 1.135's
subtree. This is a doc2dict structural artifact, not a parser bug —
recovering it would require either a doc2dict config change or
title-only-schedule body-remnant detection post-processing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@blocksorg
Copy link
Copy Markdown

blocksorg Bot commented May 17, 2026

Mention Blocks like a regular teammate with your question or request:

@blocks review this pull request
@blocks make the following changes ...
@blocks create an issue from what was mentioned in the following comment ...
@blocks explain the following code ...
@blocks are there any security or performance concerns?

Run @blocks /help for more information.

Workspace settings | Disable this message

Copy link
Copy Markdown

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @arthrod! 👋

Your private repo does not have access to Sourcery.

Please upgrade to continue using Sourcery ✨

@qodo-code-review
Copy link
Copy Markdown

Qodo reviews are paused for this user.

Troubleshooting steps vary by plan Learn more →

On a Teams plan?
Reviews resume once this user has a paid seat and their Git account is linked in Qodo.
Link Git account →

Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center?
These require an Enterprise plan - Contact us
Contact us →

@codeant-ai
Copy link
Copy Markdown

codeant-ai Bot commented May 17, 2026

CodeAnt AI is reviewing your PR.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 17, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 7f1f5b1c-28e2-4f54-b181-221c1ed47d62

📥 Commits

Reviewing files that changed from the base of the PR and between dc0d69e and e2a6054.

📒 Files selected for processing (2)
  • data/auto_parse/level_freeze/frozen/idx_1.jsonl
  • data/auto_parse/level_freeze/state.json
📜 Recent review details
🧰 Additional context used
🔍 Remote MCP

Based on my search results, I now have sufficient context about the document parsing framework being used. Let me provide a comprehensive summary of the additional relevant facts for reviewing this PR.

Summary of Additional Context for PR Review

Document Parsing Framework Context

The doc2dict parser is a fast, algorithmic document parsing system designed to convert HTML, PDF, XML, and text documents into structured dictionaries, emphasizing performance and customizability while supporting high-throughput batch processing. The library serves as a core component for the datamule project and is particularly well-suited for parsing structured documents such as SEC filings, reports, and other hierarchical documents. [::web_search::]

The doc2dict package converts HTML and PDF documents into hierarchical dictionaries while preserving hierarchy and supports table extraction for HTML files. [::web_search::]

Relevant to PR Changes

The PR's idx=1 freeze involves hierarchical document parsing with multiple level depths (L0:1, L1:33, L2:382, L3:116). This aligns with doc2dict's goal of creating a fast, generalized, algorithmic parser that can be easily tweaked depending on the document. [::web_search::]

Validation of Test Approach

The doc2dict package processes HTML at 500 pages per second and PDF at 200 pages per second, with multithreading limitations due to PDFium. [::web_search::] This provides context for understanding the performance of the parser used in the PR's testing (the test ran with --limit 2 and completed successfully).

Configuration and Customization

The doc2dict package uses a simplified representation of documents as lists of dictionaries and converts them to a hierarchical dictionary using predetermined rules, with plans for modular "mapping dicts" for customization. [::web_search::] This is directly relevant to the PR's caveat about SCHEDULE 5.2(b), where recovering the missing L2 body would require "a doc2dict config change or post-processing detection."

🔇 Additional comments (1)
data/auto_parse/level_freeze/state.json (1)

4-5: LGTM!

Also applies to: 44-78


📝 Walkthrough

Summary by CodeRabbit

  • Chores
    • Updated internal state configuration and freeze tracking settings.

Walkthrough

This PR updates the level-freeze state tracking file, expanding the frozen indices from only 0 to include 1, and appends a detailed chronological audit trail of freeze and toggle operations documenting how this expanded state was reached.

Changes

Level freeze state tracking update

Layer / File(s) Summary
Frozen indices expansion and history audit trail
data/auto_parse/level_freeze/state.json
The frozen set is expanded to track both index 0 and 1. The history array replaces a prior single freeze record with multiple timestamped freeze and toggle events, documenting operations on both indices with record counts, spanning 2026-05-17 from 02:30:35 through 04:38:25.

Possibly related PRs

  • arthrod/clause-extract#42: Updates the same state file by extending the frozen set and adding corresponding history entries, though for a different index value.
  • arthrod/clause-extract#67: Modifies the same frozen and history structures in the level-freeze state tracking file.
  • arthrod/clause-extract#39: Updates the same level-freeze state file with freeze and toggle events, tied to the same state machinery.

Suggested labels

Feat2

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

🐰 A single file hops and bounds,
With frozen states and audit sounds,
Index one now joins the dance,
History records each advance,
Level freeze tracks every prance! 🎯

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: adding idx=1 freeze with 532 records using the round-2 parser, which matches the PR's primary objective.
Description check ✅ Passed The description is directly related to the changeset, providing comprehensive context about the idx=1 freeze, defects resolved, statistics, test results, and known caveats.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

@codeant-ai codeant-ai Bot added the size:XL This PR changes 500-999 lines, ignoring generated files label May 17, 2026
@coderabbitai coderabbitai Bot added the Feat2 label May 17, 2026
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the level freeze state by adding index 1 to the frozen list and appending several new entries to the history log. The reviewer noted that the history has become cluttered with redundant entries and that a descriptive note explaining the signature page rule revision for index 0 was removed. Feedback suggests consolidating the history to preserve important documentation and maintain chronological order.

Comment on lines 43 to 78
{
"ts": "2026-05-17T04:55:00",
"ts": "2026-05-17T02:30:35",
"action": "freeze",
"idx": 1,
"n_records": 299
},
{
"ts": "2026-05-17T02:33:07",
"action": "freeze",
"idx": 1,
"n_records": 298
},
{
"ts": "2026-05-17T04:36:21",
"action": "freeze",
"idx": 1,
"n_records": 532
},
{
"ts": "2026-05-17T04:36:35",
"action": "freeze",
"idx": 0,
"n_records": 75
},
{
"ts": "2026-05-17T04:38:25",
"action": "freeze",
"idx": 0,
"n_records": 75,
"note": "sig-page rule revised: preserve doc2dict natural grouping at depth 2 (no per-line explosion). Company side as one L2 block (per worked example); per-line records 71-74 retire. Subdoc structure also rewritten in same parser commit but only idx=0 impact is sig page."
"n_records": 75
},
{
"ts": "2026-05-17T04:38:25",
"action": "freeze",
"idx": 1,
"n_records": 532
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The history block has become cluttered with multiple intermediate freeze attempts and redundant entries. More importantly, the descriptive note for the idx: 0 freeze (which explained the signature page rule revision) has been lost, and the timestamps are no longer in chronological order relative to the base branch. It is recommended to clean up the history to preserve the original idx: 0 entry and include only the final successful freeze for idx: 1.

    {
      "ts": "2026-05-17T04:55:00",
      "action": "freeze",
      "idx": 0,
      "n_records": 75,
      "note": "sig-page rule revised: preserve doc2dict natural grouping at depth 2 (no per-line explosion). Company side as one L2 block (per worked example); per-line records 71-74 retire. Subdoc structure also rewritten in same parser commit but only idx=0 impact is sig page."
    },
    {
      "ts": "2026-05-17T04:38:25",
      "action": "freeze",
      "idx": 1,
      "n_records": 532
    }

@codeant-ai
Copy link
Copy Markdown

codeant-ai Bot commented May 17, 2026

CodeAnt AI finished reviewing your PR.

@arthrod
Copy link
Copy Markdown
Owner Author

arthrod commented May 17, 2026

Triage agent — PR #74 comment review (read-only pass, no code changes)

1 inline comment reviewed:

  1. gemini-code-assist @ state.json:78 — history block cluttered, missing note (WONT-FIX)
    The history array is append-only bookkeeping. regress.py reads the frozen JSONL files in frozen/, not the history. Cleanup is cosmetic. The descriptive note for the idx=0 sig-page rule revision is present in the PR description and in arthrod's comment on PR idx=0: full re-parse + foundation infra under new rubric #73. No structural impact; no change needed.

Triage only — no code changes made this round.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Feat2 size:XL This PR changes 500-999 lines, ignoring generated files

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant