idx=2: freeze (422 records) — Pandora Media + KKR Investment Agreement by arthrod · Pull Request #75 · arthrod/clause-extract

arthrod · 2026-05-17T09:33:35Z

User description

Summary

Third stacked PR in the corpus rebuild. Adds idx=2 (INVESTMENT AGREEMENT between Pandora Media, Inc. and KKR Classic Investors LLC + other purchasers, dated May 8, 2017) as the third verified frozen baseline on top of idx=1 (PR #74).

Stacked on redo/idx-1. PR #74 must merge before this one. The parser code changes here also live on the precedent — but the diff vs redo/idx-1 is scripts/parse_doc2dict_with_config.py + data/auto_parse/level_freeze/frozen/idx_2.jsonl + state.json.

Parser changes for idx=2

Four new shape-based functions added; all preserve idx=0 and idx=1 byte-identical:

_fix_post_signature_envelope extended with ancestor-based check — if a subdoc-class node's parent chain contains any section/article ancestor, it's inside body content and cannot be the SEC envelope. Prevents ANNEX I being misclassified.
_reparent_orphan_top_level_into_enclosing_subdoc (new) — walks the real-subdoc node_id sequence; re-parents records whose node_id falls between two consecutive subdoc-header node_ids but whose parent is outside that span. Recovers ANNEX II body fragments orphaned by doc2dict.
_drop_toc_chrome_records (new) — marks "TABLE OF CONTENTS" titles + bare-roman-numeral page markers (whose bodies match TOC-density pattern: 3+ "Section N.M"/"ARTICLE X" matches, >75% short lines) as envelope. Preserves real preamble body when buried in the last page marker.
_rescue_post_title_cover_block (new) — mirrors _rescue_cover_preamble_block for the title-first cover layout. Consecutive empty-body predicted-header lines after L0, terminating at the first body record / TOC / structural header / subdoc class, collapse into one synthetic L1 record.

Verified output for idx=2

422 records across 6 depths (L0=1, L1=23, L2=192, L3=148, L4=48, L5=10)
Reconstruction: word_coverage 91.3%, char_ratio 93.7% (above 90% blocking gate)
Max depth: 5 (well below 7 ceiling)

Top-level structure

o=0   L0 : INVESTMENT AGREEMENT
o=1   L1 : by and among / PANDORA MEDIA, INC., / KKR CLASSIC INVESTORS LLC / and / THE OTHER PURCHASERS HERETO / Dated as of May 8, 2017
o=2   L1 : INVESTMENT AGREEMENT, dated as of May 8, 2017 ... (real preamble, rescued from TOC's last page marker body)
o=3   L1 : Article I Definitions ...
o=11  L1 : ARTICLE II Purchase and Sale ...
o=21  L1 : ARTICLE III Representations and Warranties of the Company ...
o=54  L1 : ARTICLE IV Representations and Warranties of the Purchasers ...
o=64  L1 : ARTICLE V Additional Agreements ...
o=140 L1 : ARTICLE VI Conditions to Closing ...
o=168 L1 : ARTICLE VII Termination; Survival ...
o=177 L1 : ARTICLE VIII Miscellaneous ...
o=192 L1 : [Remainder of page intentionally left blank] IN WITNESS WHEREOF ...
o=193 L1 : ANNEX I CERTIFICATE OF DESIGNATIONS              (161-record subtree)
o=354 L1 : ANNEX II REGISTRATION RIGHTS AGREEMENT           (58-record subtree)
o=412 L1 : EXHIBIT A DEFINED TERMS                          (7-record subtree)
o=419 L1 : ANNEX III ANNOUNCEMENT                           (3-record subtree)
o=420 L2 : PANDORA MEDIA, INC. / By: /s/ Naveen Chopra / ... (main-agreement sig block, positionally at end)
o=421 L2 : KKR CLASSIC INVESTORS LLC / By: /s/ Nicole Macarchuk / ... (sig block)

Known structural-debt items (acceptable; not blocking — inspector concurs)

ANNEX III body content not in doc2dict output (~155 unique words missing) — doc2dict HTML-level loss, not parser bug
3 orphan "provided, however..." paragraphs emit at L1 (o=95, o=174, o=415) instead of nesting under their parent section
Notice address blocks (Sidley Austin, KKR Credit Advisors) at L1 instead of inside Section 8.10 Notices
IWW operating clause carries "[Remainder of page intentionally left blank]" prefix in its span
Main-agreement signature blocks positionally land at end of file (after ANNEX III) instead of immediately after IWW — preserves doc2dict's natural ordering per the rubric

Test plan

uv run scripts/parse_doc2dict_with_config.py --limit 3 --no-truncate --output-dir data/auto_parse exits 0 with ok 3
uv run scripts/level_loop/freeze.py 2 --force reports word_coverage ≥ 90%
uv run scripts/level_loop/regress.py reports all 3 frozen idxs OK
Manual visual verification of all 4 fixes by independent inspector agent (PASS verdict)
All 4 subdocs (ANNEX I/II/III, EXHIBIT A) confirmed at L1

Source

http://www.sec.gov/Archives/edgar/data/1230276/000123027617000041/a101classicinvestmentagree.htm

🤖 Generated with Claude Code

CodeAnt-AI Description

Fix document parsing for attached sections, table of contents pages, and cover-page blocks

What Changed

Attached sections that were being treated as the main agreement header are now kept inside the correct annex or exhibit
Table of contents headers and page-number-only lines are removed when they are just layout chrome, while real preamble text on those pages is kept
Cover-page party/date lines that appear after the title are now combined into one normal section instead of several tiny fragments
The Pandora Media / KKR Investment Agreement baseline for idx=2 is now frozen with the corrected output

Impact

✅ Fewer missing annex sections
✅ Cleaner agreement outlines
✅ Fewer broken cover-page records

🔄 Retrigger CodeAnt AI Review

💡 Usage Guide

Checking Your Pull Request

Every time you make a pull request, our system automatically looks through it. We check for security issues, mistakes in how you're setting up your infrastructure, and common code problems. We do this to make sure your changes are solid and won't cause any trouble later.

Talking to CodeAnt AI

Got a question or need a hand with something in your pull request? You can easily get in touch with CodeAnt AI right here. Just type the following in a comment on your pull request, and replace "Your question here" with whatever you want to ask:

@codeant-ai ask: Your question here

This lets you have a chat with CodeAnt AI about your pull request, making it easier to understand and improve your code.

Example

@codeant-ai ask: Can you suggest a safer alternative to storing this secret?

Preserve Org Learnings with CodeAnt

You can record team preferences so CodeAnt AI applies them in future reviews. Reply directly to the specific CodeAnt AI suggestion (in the same thread) and replace "Your feedback here" with your input:

@codeant-ai: Your feedback here

This helps CodeAnt AI learn and adapt to your team's coding style and standards.

Example

@codeant-ai: Do not flag unused imports.

Retrigger review

Ask CodeAnt AI to review the PR again, by typing:

@codeant-ai: review

Check Your Repository Health

To analyze the health of your code repository, visit our dashboard at https://app.codeant.ai. This tool helps you identify potential issues and areas for improvement in your codebase, ensuring your repository maintains high standards of code health.

… post-title cover-block fixes INVESTMENT AGREEMENT (Pandora Media / KKR Classic Investors), idx=2: - word_coverage=91.3% char_ratio=93.7% (>= 90% bar) - 1 L0, 23 L1, 192 L2, 148 L3, 48 L4, 10 L5 (max depth 5) Four parser fixes, all purely structural (no phrase blocklists): 1. _fix_post_signature_envelope: extend the "envelope is really a real subdoc" detection to ALSO fire when the envelope row has a section/article ANCESTOR (Case B), not just sibling. doc2dict mis-parented ANNEX I CERTIFICATE OF DESIGNATIONS under ARTICLE VIII Miscellaneous, so its only siblings were promoted text leaves; the sibling-only check left ANNEX I marked as the SEC envelope and dropped its entire 35% of source content. The ancestor check restores ANNEX I as a real subdoc. 2. _reparent_orphan_top_level_into_enclosing_subdoc: new pass that re-parents mis-parented descendants of real subdocs by source position. doc2dict's HTML walker flattens ANNEX-II/ARTICLE-I/Section-1.1 chains to parent_node_id=None or to a mid-tree body sibling. Any node whose node_id falls strictly between two consecutive real-subdoc headers AND whose parent is outside that span gets re-parented to the enclosing subdoc. 3. _drop_toc_chrome_records: TABLE OF CONTENTS header + bare roman-numeral page-marker records (i, ii, iii) whose bodies are TOC continuation entries are dropped as layout chrome. The LAST page marker's body sometimes carries the real preamble — when so, the title is cleared (page-number prefix) but the body is preserved so it emits as the L1 preamble record. 4. _rescue_post_title_cover_block: mirrors the duplicate-title _rescue_cover_preamble_block but for filings whose L0 title is followed (rather than preceded) by a "by and among / parties / dated" cover block. Consecutive empty-body predicted-header lines after the title, terminating at the first body record or structural header, collapse into ONE synthetic L1 cover-preamble record. The originals are marked envelope so the parquet keeps them for audit. idx=0 (75 records) and idx=1 (532 records) byte-identical under regress. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

blocksorg · 2026-05-17T09:33:38Z

Mention Blocks like a regular teammate with your question or request:

@blocks review this pull request
@blocks make the following changes ...
@blocks create an issue from what was mentioned in the following comment ...
@blocks explain the following code ...
@blocks are there any security or performance concerns?

Run @blocks /help for more information.

Workspace settings | Disable this message

sourcery-ai

Hi @arthrod! 👋

Your private repo does not have access to Sourcery.

Please upgrade to continue using Sourcery ✨

qodo-code-review · 2026-05-17T09:33:39Z

Qodo reviews are paused for this user.

Troubleshooting steps vary by plan Learn more →

On a Teams plan?
Reviews resume once this user has a paid seat and their Git account is linked in Qodo.
Link Git account →

Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center?
These require an Enterprise plan - Contact us
Contact us →

codeant-ai · 2026-05-17T09:33:39Z

CodeAnt AI is reviewing your PR.

coderabbitai · 2026-05-17T09:33:41Z

Warning

Rate limit exceeded

@arthrod has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 1 minute and 39 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 151bda92-8ff3-45e8-bbb9-15dd247aa5ea

📥 Commits

Reviewing files that changed from the base of the PR and between e2a6054 and 9f9c160.

📒 Files selected for processing (3)

data/auto_parse/level_freeze/frozen/idx_2.jsonl
data/auto_parse/level_freeze/state.json
scripts/parse_doc2dict_with_config.py

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request enhances the document parsing pipeline in scripts/parse_doc2dict_with_config.py by introducing ancestor-based envelope detection, re-parenting orphan nodes into their enclosing subdocs, and removing Table of Contents (TOC) layout chrome. It also adds a rescue mechanism for cover-page blocks that appear after the main title and updates the freeze history in state.json. Review feedback suggests refining the _TOC_BODY_ARTICLE_RE regex to be less restrictive, allowing it to match article markers followed by descriptive titles to improve TOC detection density.

gemini-code-assist · 2026-05-17T09:35:21Z

+    _TOC_TITLE_RE = re.compile(r"^\s*TABLE\s+OF\s+CONTENTS\b", re.IGNORECASE)
+    _PAGE_MARKER_TITLE_RE = re.compile(r"^\s*[ivxlcdm]+\s*$", re.IGNORECASE)
+    _TOC_BODY_SECTION_RE = re.compile(r"Section\s+\d+\.\d+", re.IGNORECASE)
+    _TOC_BODY_ARTICLE_RE = re.compile(r"^ARTICLE\s+[IVXLCDM]+\s*$", re.MULTILINE | re.IGNORECASE)


The regex _TOC_BODY_ARTICLE_RE is very restrictive because it only matches lines that contain nothing but the article marker (e.g., "ARTICLE I"). In most Table of Contents, these markers are followed by descriptive titles on the same line (e.g., "ARTICLE I DEFINITIONS"). To improve detection density, consider allowing trailing text after the marker.

Suggested change

_TOC_BODY_ARTICLE_RE = re.compile(r"^ARTICLE\s+[IVXLCDM]+\s*$", re.MULTILINE | re.IGNORECASE)

_TOC_BODY_ARTICLE_RE = re.compile(r"^ARTICLE\s+[IVXLCDM]+\b", re.MULTILINE | re.IGNORECASE)

arthrod · 2026-05-17T09:36:24Z

@coderabbitai conejo review plz with extra care the files you missed

coderabbitai · 2026-05-17T09:39:50Z

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

Let me pull the full list of changed files and their diffs to make sure I review everything carefully.

🏁 Script executed:

#!/bin/bash
# Get the list of files changed in PR `#75`
gh pr view 75 --repo arthrod/clause-extract --json files --jq '.files[].path'
echo "---"
# Get the PR diff
gh pr diff 75 --repo arthrod/clause-extract

Length of output: 263054

The diff was truncated. Let me fetch the remaining files.

🏁 Script executed:

#!/bin/bash
# Get state.json changes
gh pr diff 75 --repo arthrod/clause-extract -- data/auto_parse/level_freeze/state.json

echo "====PARSER SCRIPT DIFF START===="
# Get parser script diff
gh pr diff 75 --repo arthrod/clause-extract -- scripts/parse_doc2dict_with_config.py

Length of output: 107

🏁 Script executed:

#!/bin/bash
# Get the state.json file content from the PR branch
cat data/auto_parse/level_freeze/state.json

echo "====PARSER SCRIPT START===="
# Get the parser script - show the full file
cat scripts/parse_doc2dict_with_config.py

Length of output: 178456

@arthrod Thanks for the nudge — I pulled the full diff and reviewed the files I missed with extra care. Here’s a focused pass:

What I checked

data/auto_parse/level_freeze/frozen/idx_2.jsonl
- 422 records present; order is contiguous 0..421; single L0; max level 5.
- Top-level anchors present as expected: ANNEX I, ANNEX II, EXHIBIT A, ANNEX III; signature lines with “/s/” appear at the end as L2.
data/auto_parse/level_freeze/state.json
- Freeze history shows successive attempts, ending with two stable freezes for idx=2 at 422 records. Frozen list includes [0,1,2]. current_idx remains 0; if that field is only informational for the level_loop, fine — otherwise consider bumping or clarifying.
scripts/parse_doc2dict_with_config.py (deep skim + spot reading of the new/extended routines)
- All 4 parser changes listed in the PR description are present and implemented as structural rules (no phrase blocklists), and they are sequenced in main as described:
  1. _fix_post_signature_envelope extends the envelope fix to sibling- and ancestor-based detection — prevents ANNEX I being mistaken for the SEC envelope.
  2. _reparent_orphan_top_level_into_enclosing_subdoc reparents subdoc body orphans by node_id span — matches your ANNEX II orphan recovery note.
  3. _drop_toc_chrome_records drops “TABLE OF CONTENTS” and roman page markers when their bodies are TOC-like, but keeps a real preamble if found in the last page marker — aligns with the Pandora rescue.
  4. _rescue_post_title_cover_block collapses consecutive empty-body predicted-header lines after the L0 title into a single cover block — mirrors your cover-page rescue.
- Signature-page logic is structurally driven and places the IWW clause at L1 and per-party blocks at L2, then consolidates body-only fragments. Good fit with the rubric and the known “banner” and “By:/Name:/Title:” shapes.

Nitpicks and small fixes

Docstring mismatch: The script’s top-of-file JSONL schema still claims “only 3 keys per line” (idx, level, span), but the writer now emits idx, order, level, span. Suggest updating the docstring header and the “Outputs” section to list 4 keys and remove “ONLY 3” to avoid future confusion.
Minor doc inconsistency: In _consolidate_real_subdocs the comment says “This pass does TWO things” but lists three — trivial, but easy to tidy.
Robustness note: _split_inline_section_markers sets new parent_node_id to the original record’s parent and new depth to 1 + subdoc_penalty. That’s correct for top-level N. sections extracted out of mispacked table text. If you ever encounter N. segments genuinely meant to remain under a non-root ancestor, you might need to guard on the original record’s depth or title context; for idx=2 this looks fine.
State file clarity: state.json current_idx is 0 while frozen includes [0,1,2]. If tools don’t consume current_idx, no action. If they do, consider setting it to 2 or annotating its meaning in a comment to avoid reader confusion.

Verdict

The four new structural fixes are implemented as described and sequenced sensibly. The idx=2 freeze output (422 records) matches your summary: single L0, clean L1 outline with ANNEX/EXHIBIT placements, and signature lines as L2 at the end.
Only actionable item I’d ask to address in this PR is the JSONL schema docstring so the script docs match reality.

If you want, I can open a tiny follow-up to update the docstring and the “TWO things” comment — say the word and tell me whether you prefer GitHub issue or a quick PR.

🐇

🏁 Script executed:

#!/usr/bin/env bash
set -euo pipefail

echo "== Count lines =="
wc -l data/auto_parse/level_freeze/frozen/idx_2.jsonl

echo "== Validate contiguous order starting at 0 =="
awk '
  {
    if (match($0, /"order": *([0-9]+)/, a)) {
      if (a[1] != NR-1) { 
        printf("mismatch at line %d: saw order=%d\n", NR, a[1]); 
        exit 1 
      }
    } else { 
      printf("no order field at line %d\n", NR); 
      exit 1 
    }
  }
  END { print "order_ok total_lines=" NR }
' data/auto_parse/level_freeze/frozen/idx_2.jsonl

echo "== Level distribution =="
jq -r '.level' data/auto_parse/level_freeze/frozen/idx_2.jsonl | sort -n | uniq -c

echo "== First and last 3 records =="
head -n 3 data/auto_parse/level_freeze/frozen/idx_2.jsonl
tail -n 3 data/auto_parse/level_freeze/frozen/idx_2.jsonl

echo "== Check L0 title contains INVESTMENT AGREEMENT =="
head -n1 data/auto_parse/level_freeze/frozen/idx_2.jsonl | jq -r '.span,.level'

echo "== Find key anchors =="
rg -n 'ANNEX I|ANNEX II|EXHIBIT A|ANNEX III' data/auto_parse/level_freeze/frozen/idx_2.jsonl || true
rg -n '/s/' data/auto_parse/level_freeze/frozen/idx_2.jsonl | tail -n 4 || true

echo "== Max level present =="
jq -r '.level' data/auto_parse/level_freeze/frozen/idx_2.jsonl | sort -nr | head -n1

Length of output: 2923

@arthrod Thanks — I reran the checks and here’s the tight verdict:

What I verified

idx_2.jsonl
- 422 lines; order is contiguous 0..421; exactly one L0; max level 5.
- Level histogram matches your summary: L0=1, L1=23, L2=192, L3=148, L4=48, L5=10.
- Anchors present: ANNEX I/II/III and EXHIBIT A at L1; two sig blocks at the end as L2 with “/s/”.
state.json
- Freeze history shows multiple attempts culminating in stable idx=2 at 422. frozen = [0,1,2]. Looks good.
scripts/parse_doc2dict_with_config.py
- The four structural fixes you listed are implemented and sequenced correctly in main; behavior lines up with the Pandora outcome.

Small, actionable nits

JSONL schema docstring: it still says “ONLY 3 keys per line” but the writer now emits 4 (idx, order, level, span). Update that to avoid confusion.
Truncation numbers in the top docstring mention 4000/3000/1000, but code uses 1000/800/200. Align the docs with the constants.
Minor comment typo: “This pass does TWO things” lists three items in _consolidate_real_subdocs.
Optional: if current_idx in state.json is meant to track latest freeze, consider bumping to 2 or clarifying its meaning in a comment.

Otherwise, looks solid. Want me to open a tiny follow-up PR to fix the docstring/comment mismatches, or would you prefer a GH issue for tracking?

codeant-ai · 2026-05-17T09:42:47Z

+        env_cls = (env_row.get("cls") or "")
+        if env_cls not in _SUBDOC_CLASSES:
+            continue


Suggestion: The new class gate skips envelope fixes for rows where is_envelope=True but cls is not in subdoc classes (for example introduction rows detected from EXHIBIT ... body text). Those rows no longer get unmarked even when they are clearly post-body attachments, so valid agreement content can be silently dropped from JSONL. Apply the sibling/ancestor envelope correction based on structure, not only on _SUBDOC_CLASSES. [incorrect condition logic]

Severity Level: Critical 🚨

- ❌ Attached annex/exhibit text dropped from JSONL output. - ⚠️ Downstream clause-extraction misses post-signature attachments. - ⚠️ Corpus freeze comparisons can hide lost attachment content.

Steps of Reproduction ✅

1. Run `main()` in `scripts/parse_doc2dict_with_config.py` (lines 1919–2059), which iterates the corpus and calls `parse_one(idx, raw)` for each row at line 3995. 2. Inside `parse_one()` (lines 1771–1916), after `walk_sections()` flattens the doc2dict tree, `sections = _fix_post_signature_envelope(sections)` is executed at line 1808, so every document's section list flows through `_fix_post_signature_envelope()` (lines 971–1070). 3. Construct a representative `rows` list for `_fix_post_signature_envelope()` where: - There is a body section node `S` with `cls="section"`, `node_id=10`. - A later node `E` has `node_id=20`, `parent_node_id=10`, `cls="introduction"`, `is_envelope=True`, and a non-empty `body_direct` carrying the first line of an attached ANNEX/EXHIBIT that doc2dict mis-parented under `S` (consistent with the mis-parenting pattern described in the `_fix_post_signature_envelope` docstring at lines 974–989). - Other sibling/ancestor structure is such that `_has_body_ancestor(E)` (lines 1002–1014) would return True (E is inside a `section`/`article` ancestor), so structurally E is post-body attachment, not the SEC envelope. 4. Call `_fix_post_signature_envelope(rows)` (line 971) with this synthetic `rows`: - `E` is included in `envelope_rows` (line 991) because `is_envelope` is True. - In the loop at lines 1018–1024, `env_cls = (env_row.get("cls") or "")` yields `"introduction"` and the guard `if env_cls not in _SUBDOC_CLASSES: continue` (lines 1022–1024) fires, so the sibling/ancestor logic below never runs for `E`. - As a result, `E["is_envelope"]` remains True and `fixed` is never set because only subdoc-class envelopes are considered. - Downstream, the JSONL writer in `main()` (lines 2013–2029) skips any section with `is_envelope=True` (lines 2016–2017), so E's ANNEX/EXHIBIT body is silently dropped from JSONL even though it is clearly post-body attachment content by structure.

Fix in Cursor | Fix in VSCode Claude

(Use Cmd/Ctrl + Click for best experience)

Prompt for AI Agent 🤖

This is a comment left during a code review. **Path:** scripts/parse_doc2dict_with_config.py **Line:** 1022:1024 **Comment:** *Incorrect Condition Logic: The new class gate skips envelope fixes for rows where `is_envelope=True` but `cls` is not in subdoc classes (for example `introduction` rows detected from `EXHIBIT ...` body text). Those rows no longer get unmarked even when they are clearly post-body attachments, so valid agreement content can be silently dropped from JSONL. Apply the sibling/ancestor envelope correction based on structure, not only on `_SUBDOC_CLASSES`. Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise. Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fix

👍 | 👎

codeant-ai · 2026-05-17T09:42:47Z

+    for i, sub in enumerate(real_subdoc_rows):
+        start = sub["node_id"]
+        if i + 1 < len(real_subdoc_rows):
+            end = real_subdoc_rows[i + 1]["node_id"]
+        else:
+            end = max_nid + 1
+        subdoc_spans.append((start, end, sub))
+
+    def _is_in_span(nid: int, start: int, end: int) -> bool:
+        return start < nid < end
+
+    # For each enclosing subdoc, compute the node_id span [start, end).
+    # A record at node_id N is "inside subdoc S" iff S.start <= N < S.end.
+    enclosing_span_of: dict[int, tuple[int, int]] = {}
+    for start, end, sub in subdoc_spans:
+        enclosing_span_of[sub["node_id"]] = (start, end)
+
+    for r in rows:
+        nid = r["node_id"]
+        # Skip the real subdoc headers themselves (they are boundary
+        # markers, never re-parented).
+        if nid in subdoc_id_set:
+            continue
+        # Skip records before the first real subdoc.
+        if nid <= subdoc_node_ids[0]:
+            continue
+        # Skip L0 title and envelope nodes.
+        if r.get("is_envelope") and r.get("depth", 1) == 0:
+            continue
+        if r.get("depth") == 0 and not r.get("is_envelope"):
+            continue
+        # Find the enclosing subdoc span by node_id.
+        enclosing: dict[str, Any] | None = None
+        for start, end, sub in subdoc_spans:
+            if _is_in_span(nid, start, end):
+                enclosing = sub
+                break
+        if enclosing is None:
+            continue
+        enc_start, enc_end = enclosing_span_of[enclosing["node_id"]]
+        # Check current parent: if already inside (or IS) the enclosing
+        # subdoc by node-id span, leave it alone.
+        cur_pid = r.get("parent_node_id")
+        if cur_pid is not None and cur_pid in by_node_id:
+            if cur_pid == enclosing["node_id"]:
+                continue
+            if enc_start <= cur_pid < enc_end:
+                continue
+        # Re-parent into the enclosing subdoc.
+        r["parent_node_id"] = enclosing["node_id"]


Suggestion: The last subdoc span is defined through max_nid + 1, so every later node is considered inside that subdoc window and can be reparented to it, including unrelated trailing/signature records. This can incorrectly pull non-subdoc content into agreement scope and alter downstream depth/scope behavior. Restrict reparents to orphan patterns only (e.g., root/sibling misparents with expected classes) instead of all nodes in the terminal numeric range. [logic error]

Severity Level: Major ⚠️

- ⚠️ Signature/trailer records misattached under last subdocument. - ⚠️ Annex bodies polluted with unrelated trailing agreement text. - ⚠️ Scope-rule behavior diverges from intended trailer semantics.

Steps of Reproduction ✅

1. During `main()` in `scripts/parse_doc2dict_with_config.py` (lines 1919–2059), each document is parsed via `parse_one(idx, raw)` at line 3995. Inside `parse_one()`, after `walk_sections()`, `_reparent_orphan_top_level_into_enclosing_subdoc(sections)` is invoked at lines 1817–1822, so every document's node list flows through this reparenting logic. 2. Within `_reparent_orphan_top_level_into_enclosing_subdoc()` (lines 1073–1190), real subdocument headers are collected into `real_subdoc_rows` via `_is_real_subdoc_title()` (lines 1118–1125). Assume a realistic structure: - A real subdoc header `ANNEX I CERTIFICATE OF DESIGNATIONS` with `node_id = 200`, `cls = "annex"`, `is_envelope = False`, so it is included in `real_subdoc_rows`. - Several **trailing signature or notices records** after the annex that belong to the main agreement trailer, not to the annex, e.g. a record `T` with `node_id = 260`, `cls = "predicted header"`, `title = "IN WITNESS WHEREOF"` or similar, `parent_node_id = None`, `depth = 1`, `is_envelope = False`. - No further real subdoc headers after `ANNEX I`, so `ANNEX I` is the last real subdoc. 3. When `_reparent_orphan_top_level_into_enclosing_subdoc()` runs: - `max_nid` is computed as the maximum `node_id` in `rows`, so `max_nid >= 260` (line 1137). - For the last subdoc `ANNEX I`, the span `(start, end)` is constructed as `(200, max_nid + 1)` at lines 1139–1145, and stored in `subdoc_spans` and `enclosing_span_of`. - For the trailing record `T` with `nid = 260`, the loop at lines 1156–1188: * Skips none of the early guards (it is not the subdoc header itself, `nid > subdoc_node_ids[0]`, and `depth != 0`). * Finds `enclosing` as the last subdoc because `_is_in_span(nid, start, end)` (lines 1147–1148, 1172–1175) returns True for `200 < 260 < max_nid + 1`. * Looks at `cur_pid = T["parent_node_id"]` (line 1181), which is `None` or some pre-subdoc parent outside the `[enc_start, enc_end)` interval, so neither `cur_pid == enclosing["node_id"]` nor `enc_start <= cur_pid < enc_end` (lines 1183–1186) holds. * Falls through to `r["parent_node_id"] = enclosing["node_id"]` at line 1188, reparenting `T` under `ANNEX I` even though it is a signature/trailer record, not structurally part of the annex. 4. This reparenting affects downstream behavior in `parse_one()`: - `_apply_scope_rule()` (lines 535–656), called after reparenting at line 1823, now sees `T` as a descendant of a real subdoc (via `_is_descendant_of_subdoc`, lines 573–584), and thus forces `scope="agreement"` for `T` even if it would otherwise be trailer content. - Subsequent subdoc consolidation in `_consolidate_real_subdocs()` (lines 1290–1637) may fold `T`'s body into the annex body record due to its new parent/descendant relationship (see the body-folding walker `_walk_for_body` at lines 1568–1597), causing unrelated signature/trailing text to appear inside the annex's consolidated L2 body. - The JSONL writer in `main()` (lines 2013–2030) then emits that signature/trailer text as part of the annex content for all affected documents, altering both depth and scope semantics of the last subdoc.

Fix in Cursor | Fix in VSCode Claude

(Use Cmd/Ctrl + Click for best experience)

Prompt for AI Agent 🤖

This is a comment left during a code review. **Path:** scripts/parse_doc2dict_with_config.py **Line:** 1139:1188 **Comment:** *Logic Error: The last subdoc span is defined through `max_nid + 1`, so every later node is considered inside that subdoc window and can be reparented to it, including unrelated trailing/signature records. This can incorrectly pull non-subdoc content into agreement scope and alter downstream depth/scope behavior. Restrict reparents to orphan patterns only (e.g., root/sibling misparents with expected classes) instead of all nodes in the terminal numeric range. Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise. Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fix

👍 | 👎

codeant-ai · 2026-05-17T09:42:52Z

CodeAnt AI finished reviewing your PR.

arthrod · 2026-05-17T11:54:17Z

Triage agent — PR #75 comment review (read-only pass, no code changes)

3 inline comments reviewed:

gemini-code-assist @ parse_doc2dict...py:2796 — _TOC_BODY_ARTICLE_RE too restrictive (WILL-DEFER)
Valid observation: the regex only matches bare article markers without titles (e.g. "ARTICLE I" standalone), missing the more common "ARTICLE I — Definitions" form in TOC lines. However, changing this regex risks matching non-TOC content. Needs a fixture-driven test before landing. Targeted for a future polish PR after the 40-PR foundation batch completes.
codeant-ai @ parse_doc2dict...py:1024 — envelope fix skips non-subdoc is_envelope=True rows (WILL-DEFER)
Legitimate edge case: introduction rows tagged is_envelope=True with non-subdoc cls values may not get the envelope-unmark pass. Worth investigating once the full freeze sequence is stable. Deferred to polish PR.
codeant-ai @ parse_doc2dict...py:1188 — last subdoc span extends to max_nid + 1 (WILL-DEFER)
The unbounded trailing span is a real risk for pulling trailing/signature records into the final subdoc. Needs a concrete failing idx before fixing to avoid regressing current freezes. Deferred to polish PR.

WILL-DEFER items (3): _TOC_BODY_ARTICLE_RE expansion, envelope-unmark for non-subdoc is_envelope rows, subdoc span upper-bound tightening.

Triage only — no code changes made this round.

sourcery-ai Bot reviewed May 17, 2026

View reviewed changes

gemini-code-assist Bot reviewed May 17, 2026

View reviewed changes

codeant-ai Bot added the size:XL This PR changes 500-999 lines, ignoring generated files label May 17, 2026

codeant-ai Bot reviewed May 17, 2026

View reviewed changes

arthrod mentioned this pull request May 17, 2026

idx=3: freeze (102 records) — Savara Loan & Security Agreement, L0 sig-shape guard #76

Open

5 tasks

arthrod mentioned this pull request May 17, 2026

idx=12: freeze (412 records) — Triton Container Ninth Restated Improve IWW detection in signature-page explosion logicedit Agreement (widened IWW + ancestor up-walk) #85

Open

6 tasks

	_TOC_BODY_ARTICLE_RE = re.compile(r"^ARTICLE\s+[IVXLCDM]+\s*$", re.MULTILINE \| re.IGNORECASE)
	_TOC_BODY_ARTICLE_RE = re.compile(r"^ARTICLE\s+[IVXLCDM]+\b", re.MULTILINE \| re.IGNORECASE)

Conversation

arthrod commented May 17, 2026 • edited by codeant-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

User description

Summary

Parser changes for idx=2

Verified output for idx=2

Top-level structure

Known structural-debt items (acceptable; not blocking — inspector concurs)

Test plan

Source

CodeAnt-AI Description

What Changed

Impact

Checking Your Pull Request

Talking to CodeAnt AI

Example

Preserve Org Learnings with CodeAnt

Example

Retrigger review

Check Your Repository Health

Uh oh!

blocksorg Bot commented May 17, 2026

Uh oh!

sourcery-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

qodo-code-review Bot commented May 17, 2026

Qodo reviews are paused for this user.

Uh oh!

codeant-ai Bot commented May 17, 2026

Uh oh!

coderabbitai Bot commented May 17, 2026

Rate limit exceeded

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 17, 2026

Choose a reason for hiding this comment

Uh oh!

arthrod commented May 17, 2026

Uh oh!

coderabbitai Bot commented May 17, 2026

Uh oh!

codeant-ai Bot May 17, 2026

Choose a reason for hiding this comment

Uh oh!

codeant-ai Bot May 17, 2026

Choose a reason for hiding this comment

Uh oh!

codeant-ai Bot commented May 17, 2026

Uh oh!

arthrod commented May 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

arthrod commented May 17, 2026 •

edited by codeant-ai Bot

Loading