idx=2: freeze (422 records) — Pandora Media + KKR Investment Agreement#75
idx=2: freeze (422 records) — Pandora Media + KKR Investment Agreement#75arthrod wants to merge 1 commit into
Conversation
… post-title cover-block fixes INVESTMENT AGREEMENT (Pandora Media / KKR Classic Investors), idx=2: - word_coverage=91.3% char_ratio=93.7% (>= 90% bar) - 1 L0, 23 L1, 192 L2, 148 L3, 48 L4, 10 L5 (max depth 5) Four parser fixes, all purely structural (no phrase blocklists): 1. _fix_post_signature_envelope: extend the "envelope is really a real subdoc" detection to ALSO fire when the envelope row has a section/article ANCESTOR (Case B), not just sibling. doc2dict mis-parented ANNEX I CERTIFICATE OF DESIGNATIONS under ARTICLE VIII Miscellaneous, so its only siblings were promoted text leaves; the sibling-only check left ANNEX I marked as the SEC envelope and dropped its entire 35% of source content. The ancestor check restores ANNEX I as a real subdoc. 2. _reparent_orphan_top_level_into_enclosing_subdoc: new pass that re-parents mis-parented descendants of real subdocs by source position. doc2dict's HTML walker flattens ANNEX-II/ARTICLE-I/Section-1.1 chains to parent_node_id=None or to a mid-tree body sibling. Any node whose node_id falls strictly between two consecutive real-subdoc headers AND whose parent is outside that span gets re-parented to the enclosing subdoc. 3. _drop_toc_chrome_records: TABLE OF CONTENTS header + bare roman-numeral page-marker records (i, ii, iii) whose bodies are TOC continuation entries are dropped as layout chrome. The LAST page marker's body sometimes carries the real preamble — when so, the title is cleared (page-number prefix) but the body is preserved so it emits as the L1 preamble record. 4. _rescue_post_title_cover_block: mirrors the duplicate-title _rescue_cover_preamble_block but for filings whose L0 title is followed (rather than preceded) by a "by and among / parties / dated" cover block. Consecutive empty-body predicted-header lines after the title, terminating at the first body record or structural header, collapse into ONE synthetic L1 cover-preamble record. The originals are marked envelope so the parquet keeps them for audit. idx=0 (75 records) and idx=1 (532 records) byte-identical under regress. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Mention Blocks like a regular teammate with your question or request: @blocks review this pull request Run |
Qodo reviews are paused for this user.Troubleshooting steps vary by plan Learn more → On a Teams plan? Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center? |
|
CodeAnt AI is reviewing your PR. |
|
Warning Rate limit exceeded
You’ve run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (3)
Comment |
There was a problem hiding this comment.
Code Review
This pull request enhances the document parsing pipeline in scripts/parse_doc2dict_with_config.py by introducing ancestor-based envelope detection, re-parenting orphan nodes into their enclosing subdocs, and removing Table of Contents (TOC) layout chrome. It also adds a rescue mechanism for cover-page blocks that appear after the main title and updates the freeze history in state.json. Review feedback suggests refining the _TOC_BODY_ARTICLE_RE regex to be less restrictive, allowing it to match article markers followed by descriptive titles to improve TOC detection density.
| _TOC_TITLE_RE = re.compile(r"^\s*TABLE\s+OF\s+CONTENTS\b", re.IGNORECASE) | ||
| _PAGE_MARKER_TITLE_RE = re.compile(r"^\s*[ivxlcdm]+\s*$", re.IGNORECASE) | ||
| _TOC_BODY_SECTION_RE = re.compile(r"Section\s+\d+\.\d+", re.IGNORECASE) | ||
| _TOC_BODY_ARTICLE_RE = re.compile(r"^ARTICLE\s+[IVXLCDM]+\s*$", re.MULTILINE | re.IGNORECASE) |
There was a problem hiding this comment.
The regex _TOC_BODY_ARTICLE_RE is very restrictive because it only matches lines that contain nothing but the article marker (e.g., "ARTICLE I"). In most Table of Contents, these markers are followed by descriptive titles on the same line (e.g., "ARTICLE I DEFINITIONS"). To improve detection density, consider allowing trailing text after the marker.
| _TOC_BODY_ARTICLE_RE = re.compile(r"^ARTICLE\s+[IVXLCDM]+\s*$", re.MULTILINE | re.IGNORECASE) | |
| _TOC_BODY_ARTICLE_RE = re.compile(r"^ARTICLE\s+[IVXLCDM]+\b", re.MULTILINE | re.IGNORECASE) |
|
@coderabbitai conejo review plz with extra care the files you missed |
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chainLet me pull the full list of changed files and their diffs to make sure I review everything carefully. 🏁 Script executed: Length of output: 263054 The diff was truncated. Let me fetch the remaining files. 🏁 Script executed: Length of output: 107 🏁 Script executed: Length of output: 178456
What I checked
Nitpicks and small fixes
Verdict
If you want, I can open a tiny follow-up to update the docstring and the “TWO things” comment — say the word and tell me whether you prefer GitHub issue or a quick PR. 🐇 🏁 Script executed: Length of output: 2923
What I verified
Small, actionable nits
Otherwise, looks solid. Want me to open a tiny follow-up PR to fix the docstring/comment mismatches, or would you prefer a GH issue for tracking? |
| env_cls = (env_row.get("cls") or "") | ||
| if env_cls not in _SUBDOC_CLASSES: | ||
| continue |
There was a problem hiding this comment.
Suggestion: The new class gate skips envelope fixes for rows where is_envelope=True but cls is not in subdoc classes (for example introduction rows detected from EXHIBIT ... body text). Those rows no longer get unmarked even when they are clearly post-body attachments, so valid agreement content can be silently dropped from JSONL. Apply the sibling/ancestor envelope correction based on structure, not only on _SUBDOC_CLASSES. [incorrect condition logic]
Severity Level: Critical 🚨
- ❌ Attached annex/exhibit text dropped from JSONL output.
- ⚠️ Downstream clause-extraction misses post-signature attachments.
- ⚠️ Corpus freeze comparisons can hide lost attachment content.Steps of Reproduction ✅
1. Run `main()` in `scripts/parse_doc2dict_with_config.py` (lines 1919–2059), which
iterates the corpus and calls `parse_one(idx, raw)` for each row at line 3995.
2. Inside `parse_one()` (lines 1771–1916), after `walk_sections()` flattens the doc2dict
tree, `sections = _fix_post_signature_envelope(sections)` is executed at line 1808, so
every document's section list flows through `_fix_post_signature_envelope()` (lines
971–1070).
3. Construct a representative `rows` list for `_fix_post_signature_envelope()` where:
- There is a body section node `S` with `cls="section"`, `node_id=10`.
- A later node `E` has `node_id=20`, `parent_node_id=10`, `cls="introduction"`,
`is_envelope=True`, and a non-empty `body_direct` carrying the first line of an
attached ANNEX/EXHIBIT that doc2dict mis-parented under `S` (consistent with the
mis-parenting pattern described in the `_fix_post_signature_envelope` docstring at
lines 974–989).
- Other sibling/ancestor structure is such that `_has_body_ancestor(E)` (lines
1002–1014) would return True (E is inside a `section`/`article` ancestor), so
structurally E is post-body attachment, not the SEC envelope.
4. Call `_fix_post_signature_envelope(rows)` (line 971) with this synthetic `rows`:
- `E` is included in `envelope_rows` (line 991) because `is_envelope` is True.
- In the loop at lines 1018–1024, `env_cls = (env_row.get("cls") or "")` yields
`"introduction"` and the guard `if env_cls not in _SUBDOC_CLASSES: continue` (lines
1022–1024) fires, so the sibling/ancestor logic below never runs for `E`.
- As a result, `E["is_envelope"]` remains True and `fixed` is never set because only
subdoc-class envelopes are considered.
- Downstream, the JSONL writer in `main()` (lines 2013–2029) skips any section with
`is_envelope=True` (lines 2016–2017), so E's ANNEX/EXHIBIT body is silently dropped
from JSONL even though it is clearly post-body attachment content by structure.Fix in Cursor | Fix in VSCode Claude
(Use Cmd/Ctrl + Click for best experience)
Prompt for AI Agent 🤖
This is a comment left during a code review.
**Path:** scripts/parse_doc2dict_with_config.py
**Line:** 1022:1024
**Comment:**
*Incorrect Condition Logic: The new class gate skips envelope fixes for rows where `is_envelope=True` but `cls` is not in subdoc classes (for example `introduction` rows detected from `EXHIBIT ...` body text). Those rows no longer get unmarked even when they are clearly post-body attachments, so valid agreement content can be silently dropped from JSONL. Apply the sibling/ancestor envelope correction based on structure, not only on `_SUBDOC_CLASSES`.
Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.
Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fix| for i, sub in enumerate(real_subdoc_rows): | ||
| start = sub["node_id"] | ||
| if i + 1 < len(real_subdoc_rows): | ||
| end = real_subdoc_rows[i + 1]["node_id"] | ||
| else: | ||
| end = max_nid + 1 | ||
| subdoc_spans.append((start, end, sub)) | ||
|
|
||
| def _is_in_span(nid: int, start: int, end: int) -> bool: | ||
| return start < nid < end | ||
|
|
||
| # For each enclosing subdoc, compute the node_id span [start, end). | ||
| # A record at node_id N is "inside subdoc S" iff S.start <= N < S.end. | ||
| enclosing_span_of: dict[int, tuple[int, int]] = {} | ||
| for start, end, sub in subdoc_spans: | ||
| enclosing_span_of[sub["node_id"]] = (start, end) | ||
|
|
||
| for r in rows: | ||
| nid = r["node_id"] | ||
| # Skip the real subdoc headers themselves (they are boundary | ||
| # markers, never re-parented). | ||
| if nid in subdoc_id_set: | ||
| continue | ||
| # Skip records before the first real subdoc. | ||
| if nid <= subdoc_node_ids[0]: | ||
| continue | ||
| # Skip L0 title and envelope nodes. | ||
| if r.get("is_envelope") and r.get("depth", 1) == 0: | ||
| continue | ||
| if r.get("depth") == 0 and not r.get("is_envelope"): | ||
| continue | ||
| # Find the enclosing subdoc span by node_id. | ||
| enclosing: dict[str, Any] | None = None | ||
| for start, end, sub in subdoc_spans: | ||
| if _is_in_span(nid, start, end): | ||
| enclosing = sub | ||
| break | ||
| if enclosing is None: | ||
| continue | ||
| enc_start, enc_end = enclosing_span_of[enclosing["node_id"]] | ||
| # Check current parent: if already inside (or IS) the enclosing | ||
| # subdoc by node-id span, leave it alone. | ||
| cur_pid = r.get("parent_node_id") | ||
| if cur_pid is not None and cur_pid in by_node_id: | ||
| if cur_pid == enclosing["node_id"]: | ||
| continue | ||
| if enc_start <= cur_pid < enc_end: | ||
| continue | ||
| # Re-parent into the enclosing subdoc. | ||
| r["parent_node_id"] = enclosing["node_id"] |
There was a problem hiding this comment.
Suggestion: The last subdoc span is defined through max_nid + 1, so every later node is considered inside that subdoc window and can be reparented to it, including unrelated trailing/signature records. This can incorrectly pull non-subdoc content into agreement scope and alter downstream depth/scope behavior. Restrict reparents to orphan patterns only (e.g., root/sibling misparents with expected classes) instead of all nodes in the terminal numeric range. [logic error]
Severity Level: Major ⚠️
- ⚠️ Signature/trailer records misattached under last subdocument.
- ⚠️ Annex bodies polluted with unrelated trailing agreement text.
- ⚠️ Scope-rule behavior diverges from intended trailer semantics.Steps of Reproduction ✅
1. During `main()` in `scripts/parse_doc2dict_with_config.py` (lines 1919–2059), each
document is parsed via `parse_one(idx, raw)` at line 3995. Inside `parse_one()`, after
`walk_sections()`, `_reparent_orphan_top_level_into_enclosing_subdoc(sections)` is invoked
at lines 1817–1822, so every document's node list flows through this reparenting logic.
2. Within `_reparent_orphan_top_level_into_enclosing_subdoc()` (lines 1073–1190), real
subdocument headers are collected into `real_subdoc_rows` via `_is_real_subdoc_title()`
(lines 1118–1125). Assume a realistic structure:
- A real subdoc header `ANNEX I CERTIFICATE OF DESIGNATIONS` with `node_id = 200`, `cls
= "annex"`, `is_envelope = False`, so it is included in `real_subdoc_rows`.
- Several **trailing signature or notices records** after the annex that belong to the
main agreement trailer, not to the annex, e.g. a record `T` with `node_id = 260`, `cls
= "predicted header"`, `title = "IN WITNESS WHEREOF"` or similar, `parent_node_id =
None`, `depth = 1`, `is_envelope = False`.
- No further real subdoc headers after `ANNEX I`, so `ANNEX I` is the last real subdoc.
3. When `_reparent_orphan_top_level_into_enclosing_subdoc()` runs:
- `max_nid` is computed as the maximum `node_id` in `rows`, so `max_nid >= 260` (line
1137).
- For the last subdoc `ANNEX I`, the span `(start, end)` is constructed as `(200,
max_nid + 1)` at lines 1139–1145, and stored in `subdoc_spans` and `enclosing_span_of`.
- For the trailing record `T` with `nid = 260`, the loop at lines 1156–1188:
* Skips none of the early guards (it is not the subdoc header itself, `nid >
subdoc_node_ids[0]`, and `depth != 0`).
* Finds `enclosing` as the last subdoc because `_is_in_span(nid, start, end)` (lines
1147–1148, 1172–1175) returns True for `200 < 260 < max_nid + 1`.
* Looks at `cur_pid = T["parent_node_id"]` (line 1181), which is `None` or some
pre-subdoc parent outside the `[enc_start, enc_end)` interval, so neither `cur_pid ==
enclosing["node_id"]` nor `enc_start <= cur_pid < enc_end` (lines 1183–1186) holds.
* Falls through to `r["parent_node_id"] = enclosing["node_id"]` at line 1188,
reparenting `T` under `ANNEX I` even though it is a signature/trailer record, not
structurally part of the annex.
4. This reparenting affects downstream behavior in `parse_one()`:
- `_apply_scope_rule()` (lines 535–656), called after reparenting at line 1823, now
sees `T` as a descendant of a real subdoc (via `_is_descendant_of_subdoc`, lines
573–584), and thus forces `scope="agreement"` for `T` even if it would otherwise be
trailer content.
- Subsequent subdoc consolidation in `_consolidate_real_subdocs()` (lines 1290–1637)
may fold `T`'s body into the annex body record due to its new parent/descendant
relationship (see the body-folding walker `_walk_for_body` at lines 1568–1597), causing
unrelated signature/trailing text to appear inside the annex's consolidated L2 body.
- The JSONL writer in `main()` (lines 2013–2030) then emits that signature/trailer text
as part of the annex content for all affected documents, altering both depth and scope
semantics of the last subdoc.Fix in Cursor | Fix in VSCode Claude
(Use Cmd/Ctrl + Click for best experience)
Prompt for AI Agent 🤖
This is a comment left during a code review.
**Path:** scripts/parse_doc2dict_with_config.py
**Line:** 1139:1188
**Comment:**
*Logic Error: The last subdoc span is defined through `max_nid + 1`, so every later node is considered inside that subdoc window and can be reparented to it, including unrelated trailing/signature records. This can incorrectly pull non-subdoc content into agreement scope and alter downstream depth/scope behavior. Restrict reparents to orphan patterns only (e.g., root/sibling misparents with expected classes) instead of all nodes in the terminal numeric range.
Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.
Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fix|
CodeAnt AI finished reviewing your PR. |
|
Triage agent — PR #75 comment review (read-only pass, no code changes) 3 inline comments reviewed:
WILL-DEFER items (3): Triage only — no code changes made this round. |
User description
Summary
Third stacked PR in the corpus rebuild. Adds idx=2 (INVESTMENT AGREEMENT between Pandora Media, Inc. and KKR Classic Investors LLC + other purchasers, dated May 8, 2017) as the third verified frozen baseline on top of idx=1 (PR #74).
Stacked on
redo/idx-1. PR #74 must merge before this one. The parser code changes here also live on the precedent — but the diff vsredo/idx-1isscripts/parse_doc2dict_with_config.py+data/auto_parse/level_freeze/frozen/idx_2.jsonl+state.json.Parser changes for idx=2
Four new shape-based functions added; all preserve idx=0 and idx=1 byte-identical:
_fix_post_signature_envelopeextended with ancestor-based check — if a subdoc-class node's parent chain contains anysection/articleancestor, it's inside body content and cannot be the SEC envelope. Prevents ANNEX I being misclassified._reparent_orphan_top_level_into_enclosing_subdoc(new) — walks the real-subdoc node_id sequence; re-parents records whose node_id falls between two consecutive subdoc-header node_ids but whose parent is outside that span. Recovers ANNEX II body fragments orphaned by doc2dict._drop_toc_chrome_records(new) — marks "TABLE OF CONTENTS" titles + bare-roman-numeral page markers (whose bodies match TOC-density pattern: 3+ "Section N.M"/"ARTICLE X" matches, >75% short lines) as envelope. Preserves real preamble body when buried in the last page marker._rescue_post_title_cover_block(new) — mirrors_rescue_cover_preamble_blockfor the title-first cover layout. Consecutive empty-body predicted-header lines after L0, terminating at the first body record / TOC / structural header / subdoc class, collapse into one synthetic L1 record.Verified output for idx=2
Top-level structure
Known structural-debt items (acceptable; not blocking — inspector concurs)
Test plan
uv run scripts/parse_doc2dict_with_config.py --limit 3 --no-truncate --output-dir data/auto_parseexits 0 withok 3uv run scripts/level_loop/freeze.py 2 --forcereports word_coverage ≥ 90%uv run scripts/level_loop/regress.pyreports all 3 frozen idxs OKSource
http://www.sec.gov/Archives/edgar/data/1230276/000123027617000041/a101classicinvestmentagree.htm
🤖 Generated with Claude Code
CodeAnt-AI Description
Fix document parsing for attached sections, table of contents pages, and cover-page blocks
What Changed
Impact
✅ Fewer missing annex sections✅ Cleaner agreement outlines✅ Fewer broken cover-page records🔄 Retrigger CodeAnt AI Review
💡 Usage Guide
Checking Your Pull Request
Every time you make a pull request, our system automatically looks through it. We check for security issues, mistakes in how you're setting up your infrastructure, and common code problems. We do this to make sure your changes are solid and won't cause any trouble later.
Talking to CodeAnt AI
Got a question or need a hand with something in your pull request? You can easily get in touch with CodeAnt AI right here. Just type the following in a comment on your pull request, and replace "Your question here" with whatever you want to ask:
This lets you have a chat with CodeAnt AI about your pull request, making it easier to understand and improve your code.
Example
Preserve Org Learnings with CodeAnt
You can record team preferences so CodeAnt AI applies them in future reviews. Reply directly to the specific CodeAnt AI suggestion (in the same thread) and replace "Your feedback here" with your input:
This helps CodeAnt AI learn and adapt to your team's coding style and standards.
Example
Retrigger review
Ask CodeAnt AI to review the PR again, by typing:
Check Your Repository Health
To analyze the health of your code repository, visit our dashboard at https://app.codeant.ai. This tool helps you identify potential issues and areas for improvement in your codebase, ensuring your repository maintains high standards of code health.