Skip to content

idx=2: freeze (422 records) — Pandora Media + KKR Investment Agreement#75

Open
arthrod wants to merge 1 commit into
redo/idx-1from
redo/idx-2
Open

idx=2: freeze (422 records) — Pandora Media + KKR Investment Agreement#75
arthrod wants to merge 1 commit into
redo/idx-1from
redo/idx-2

Conversation

@arthrod
Copy link
Copy Markdown
Owner

@arthrod arthrod commented May 17, 2026

User description

Summary

Third stacked PR in the corpus rebuild. Adds idx=2 (INVESTMENT AGREEMENT between Pandora Media, Inc. and KKR Classic Investors LLC + other purchasers, dated May 8, 2017) as the third verified frozen baseline on top of idx=1 (PR #74).

Stacked on redo/idx-1. PR #74 must merge before this one. The parser code changes here also live on the precedent — but the diff vs redo/idx-1 is scripts/parse_doc2dict_with_config.py + data/auto_parse/level_freeze/frozen/idx_2.jsonl + state.json.

Parser changes for idx=2

Four new shape-based functions added; all preserve idx=0 and idx=1 byte-identical:

  1. _fix_post_signature_envelope extended with ancestor-based check — if a subdoc-class node's parent chain contains any section/article ancestor, it's inside body content and cannot be the SEC envelope. Prevents ANNEX I being misclassified.
  2. _reparent_orphan_top_level_into_enclosing_subdoc (new) — walks the real-subdoc node_id sequence; re-parents records whose node_id falls between two consecutive subdoc-header node_ids but whose parent is outside that span. Recovers ANNEX II body fragments orphaned by doc2dict.
  3. _drop_toc_chrome_records (new) — marks "TABLE OF CONTENTS" titles + bare-roman-numeral page markers (whose bodies match TOC-density pattern: 3+ "Section N.M"/"ARTICLE X" matches, >75% short lines) as envelope. Preserves real preamble body when buried in the last page marker.
  4. _rescue_post_title_cover_block (new) — mirrors _rescue_cover_preamble_block for the title-first cover layout. Consecutive empty-body predicted-header lines after L0, terminating at the first body record / TOC / structural header / subdoc class, collapse into one synthetic L1 record.

Verified output for idx=2

  • 422 records across 6 depths (L0=1, L1=23, L2=192, L3=148, L4=48, L5=10)
  • Reconstruction: word_coverage 91.3%, char_ratio 93.7% (above 90% blocking gate)
  • Max depth: 5 (well below 7 ceiling)

Top-level structure

o=0   L0 : INVESTMENT AGREEMENT
o=1   L1 : by and among / PANDORA MEDIA, INC., / KKR CLASSIC INVESTORS LLC / and / THE OTHER PURCHASERS HERETO / Dated as of May 8, 2017
o=2   L1 : INVESTMENT AGREEMENT, dated as of May 8, 2017 ... (real preamble, rescued from TOC's last page marker body)
o=3   L1 : Article I Definitions ...
o=11  L1 : ARTICLE II Purchase and Sale ...
o=21  L1 : ARTICLE III Representations and Warranties of the Company ...
o=54  L1 : ARTICLE IV Representations and Warranties of the Purchasers ...
o=64  L1 : ARTICLE V Additional Agreements ...
o=140 L1 : ARTICLE VI Conditions to Closing ...
o=168 L1 : ARTICLE VII Termination; Survival ...
o=177 L1 : ARTICLE VIII Miscellaneous ...
o=192 L1 : [Remainder of page intentionally left blank] IN WITNESS WHEREOF ...
o=193 L1 : ANNEX I CERTIFICATE OF DESIGNATIONS              (161-record subtree)
o=354 L1 : ANNEX II REGISTRATION RIGHTS AGREEMENT           (58-record subtree)
o=412 L1 : EXHIBIT A DEFINED TERMS                          (7-record subtree)
o=419 L1 : ANNEX III ANNOUNCEMENT                           (3-record subtree)
o=420 L2 : PANDORA MEDIA, INC. / By: /s/ Naveen Chopra / ... (main-agreement sig block, positionally at end)
o=421 L2 : KKR CLASSIC INVESTORS LLC / By: /s/ Nicole Macarchuk / ... (sig block)

Known structural-debt items (acceptable; not blocking — inspector concurs)

  1. ANNEX III body content not in doc2dict output (~155 unique words missing) — doc2dict HTML-level loss, not parser bug
  2. 3 orphan "provided, however..." paragraphs emit at L1 (o=95, o=174, o=415) instead of nesting under their parent section
  3. Notice address blocks (Sidley Austin, KKR Credit Advisors) at L1 instead of inside Section 8.10 Notices
  4. IWW operating clause carries "[Remainder of page intentionally left blank]" prefix in its span
  5. Main-agreement signature blocks positionally land at end of file (after ANNEX III) instead of immediately after IWW — preserves doc2dict's natural ordering per the rubric

Test plan

  • uv run scripts/parse_doc2dict_with_config.py --limit 3 --no-truncate --output-dir data/auto_parse exits 0 with ok 3
  • uv run scripts/level_loop/freeze.py 2 --force reports word_coverage ≥ 90%
  • uv run scripts/level_loop/regress.py reports all 3 frozen idxs OK
  • Manual visual verification of all 4 fixes by independent inspector agent (PASS verdict)
  • All 4 subdocs (ANNEX I/II/III, EXHIBIT A) confirmed at L1

Source

http://www.sec.gov/Archives/edgar/data/1230276/000123027617000041/a101classicinvestmentagree.htm

🤖 Generated with Claude Code


CodeAnt-AI Description

Fix document parsing for attached sections, table of contents pages, and cover-page blocks

What Changed

  • Attached sections that were being treated as the main agreement header are now kept inside the correct annex or exhibit
  • Table of contents headers and page-number-only lines are removed when they are just layout chrome, while real preamble text on those pages is kept
  • Cover-page party/date lines that appear after the title are now combined into one normal section instead of several tiny fragments
  • The Pandora Media / KKR Investment Agreement baseline for idx=2 is now frozen with the corrected output

Impact

✅ Fewer missing annex sections
✅ Cleaner agreement outlines
✅ Fewer broken cover-page records

🔄 Retrigger CodeAnt AI Review

💡 Usage Guide

Checking Your Pull Request

Every time you make a pull request, our system automatically looks through it. We check for security issues, mistakes in how you're setting up your infrastructure, and common code problems. We do this to make sure your changes are solid and won't cause any trouble later.

Talking to CodeAnt AI

Got a question or need a hand with something in your pull request? You can easily get in touch with CodeAnt AI right here. Just type the following in a comment on your pull request, and replace "Your question here" with whatever you want to ask:

@codeant-ai ask: Your question here

This lets you have a chat with CodeAnt AI about your pull request, making it easier to understand and improve your code.

Example

@codeant-ai ask: Can you suggest a safer alternative to storing this secret?

Preserve Org Learnings with CodeAnt

You can record team preferences so CodeAnt AI applies them in future reviews. Reply directly to the specific CodeAnt AI suggestion (in the same thread) and replace "Your feedback here" with your input:

@codeant-ai: Your feedback here

This helps CodeAnt AI learn and adapt to your team's coding style and standards.

Example

@codeant-ai: Do not flag unused imports.

Retrigger review

Ask CodeAnt AI to review the PR again, by typing:

@codeant-ai: review

Check Your Repository Health

To analyze the health of your code repository, visit our dashboard at https://app.codeant.ai. This tool helps you identify potential issues and areas for improvement in your codebase, ensuring your repository maintains high standards of code health.

… post-title cover-block fixes

INVESTMENT AGREEMENT (Pandora Media / KKR Classic Investors), idx=2:
- word_coverage=91.3% char_ratio=93.7% (>= 90% bar)
- 1 L0, 23 L1, 192 L2, 148 L3, 48 L4, 10 L5 (max depth 5)

Four parser fixes, all purely structural (no phrase blocklists):

1. _fix_post_signature_envelope: extend the "envelope is really a real
   subdoc" detection to ALSO fire when the envelope row has a
   section/article ANCESTOR (Case B), not just sibling. doc2dict
   mis-parented ANNEX I CERTIFICATE OF DESIGNATIONS under
   ARTICLE VIII Miscellaneous, so its only siblings were promoted
   text leaves; the sibling-only check left ANNEX I marked as the SEC
   envelope and dropped its entire 35% of source content. The
   ancestor check restores ANNEX I as a real subdoc.

2. _reparent_orphan_top_level_into_enclosing_subdoc: new pass that
   re-parents mis-parented descendants of real subdocs by source
   position. doc2dict's HTML walker flattens
   ANNEX-II/ARTICLE-I/Section-1.1 chains to parent_node_id=None or
   to a mid-tree body sibling. Any node whose node_id falls strictly
   between two consecutive real-subdoc headers AND whose parent is
   outside that span gets re-parented to the enclosing subdoc.

3. _drop_toc_chrome_records: TABLE OF CONTENTS header + bare
   roman-numeral page-marker records (i, ii, iii) whose bodies are
   TOC continuation entries are dropped as layout chrome. The LAST
   page marker's body sometimes carries the real preamble — when so,
   the title is cleared (page-number prefix) but the body is preserved
   so it emits as the L1 preamble record.

4. _rescue_post_title_cover_block: mirrors the duplicate-title
   _rescue_cover_preamble_block but for filings whose L0 title is
   followed (rather than preceded) by a "by and among / parties /
   dated" cover block. Consecutive empty-body predicted-header lines
   after the title, terminating at the first body record or
   structural header, collapse into ONE synthetic L1 cover-preamble
   record. The originals are marked envelope so the parquet keeps
   them for audit.

idx=0 (75 records) and idx=1 (532 records) byte-identical under regress.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@blocksorg
Copy link
Copy Markdown

blocksorg Bot commented May 17, 2026

Mention Blocks like a regular teammate with your question or request:

@blocks review this pull request
@blocks make the following changes ...
@blocks create an issue from what was mentioned in the following comment ...
@blocks explain the following code ...
@blocks are there any security or performance concerns?

Run @blocks /help for more information.

Workspace settings | Disable this message

Copy link
Copy Markdown

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @arthrod! 👋

Your private repo does not have access to Sourcery.

Please upgrade to continue using Sourcery ✨

@qodo-code-review
Copy link
Copy Markdown

Qodo reviews are paused for this user.

Troubleshooting steps vary by plan Learn more →

On a Teams plan?
Reviews resume once this user has a paid seat and their Git account is linked in Qodo.
Link Git account →

Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center?
These require an Enterprise plan - Contact us
Contact us →

@codeant-ai
Copy link
Copy Markdown

codeant-ai Bot commented May 17, 2026

CodeAnt AI is reviewing your PR.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 17, 2026

Warning

Rate limit exceeded

@arthrod has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 1 minute and 39 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 151bda92-8ff3-45e8-bbb9-15dd247aa5ea

📥 Commits

Reviewing files that changed from the base of the PR and between e2a6054 and 9f9c160.

📒 Files selected for processing (3)
  • data/auto_parse/level_freeze/frozen/idx_2.jsonl
  • data/auto_parse/level_freeze/state.json
  • scripts/parse_doc2dict_with_config.py

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enhances the document parsing pipeline in scripts/parse_doc2dict_with_config.py by introducing ancestor-based envelope detection, re-parenting orphan nodes into their enclosing subdocs, and removing Table of Contents (TOC) layout chrome. It also adds a rescue mechanism for cover-page blocks that appear after the main title and updates the freeze history in state.json. Review feedback suggests refining the _TOC_BODY_ARTICLE_RE regex to be less restrictive, allowing it to match article markers followed by descriptive titles to improve TOC detection density.

_TOC_TITLE_RE = re.compile(r"^\s*TABLE\s+OF\s+CONTENTS\b", re.IGNORECASE)
_PAGE_MARKER_TITLE_RE = re.compile(r"^\s*[ivxlcdm]+\s*$", re.IGNORECASE)
_TOC_BODY_SECTION_RE = re.compile(r"Section\s+\d+\.\d+", re.IGNORECASE)
_TOC_BODY_ARTICLE_RE = re.compile(r"^ARTICLE\s+[IVXLCDM]+\s*$", re.MULTILINE | re.IGNORECASE)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The regex _TOC_BODY_ARTICLE_RE is very restrictive because it only matches lines that contain nothing but the article marker (e.g., "ARTICLE I"). In most Table of Contents, these markers are followed by descriptive titles on the same line (e.g., "ARTICLE I DEFINITIONS"). To improve detection density, consider allowing trailing text after the marker.

Suggested change
_TOC_BODY_ARTICLE_RE = re.compile(r"^ARTICLE\s+[IVXLCDM]+\s*$", re.MULTILINE | re.IGNORECASE)
_TOC_BODY_ARTICLE_RE = re.compile(r"^ARTICLE\s+[IVXLCDM]+\b", re.MULTILINE | re.IGNORECASE)

@codeant-ai codeant-ai Bot added the size:XL This PR changes 500-999 lines, ignoring generated files label May 17, 2026
@arthrod
Copy link
Copy Markdown
Owner Author

arthrod commented May 17, 2026

@coderabbitai conejo review plz with extra care the files you missed

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 17, 2026

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

Let me pull the full list of changed files and their diffs to make sure I review everything carefully.


🏁 Script executed:

#!/bin/bash
# Get the list of files changed in PR `#75`
gh pr view 75 --repo arthrod/clause-extract --json files --jq '.files[].path'
echo "---"
# Get the PR diff
gh pr diff 75 --repo arthrod/clause-extract

Length of output: 263054


The diff was truncated. Let me fetch the remaining files.


🏁 Script executed:

#!/bin/bash
# Get state.json changes
gh pr diff 75 --repo arthrod/clause-extract -- data/auto_parse/level_freeze/state.json

echo "====PARSER SCRIPT DIFF START===="
# Get parser script diff
gh pr diff 75 --repo arthrod/clause-extract -- scripts/parse_doc2dict_with_config.py

Length of output: 107


🏁 Script executed:

#!/bin/bash
# Get the state.json file content from the PR branch
cat data/auto_parse/level_freeze/state.json

echo "====PARSER SCRIPT START===="
# Get the parser script - show the full file
cat scripts/parse_doc2dict_with_config.py

Length of output: 178456


@arthrod Thanks for the nudge — I pulled the full diff and reviewed the files I missed with extra care. Here’s a focused pass:

What I checked

  • data/auto_parse/level_freeze/frozen/idx_2.jsonl
    • 422 records present; order is contiguous 0..421; single L0; max level 5.
    • Top-level anchors present as expected: ANNEX I, ANNEX II, EXHIBIT A, ANNEX III; signature lines with “/s/” appear at the end as L2.
  • data/auto_parse/level_freeze/state.json
    • Freeze history shows successive attempts, ending with two stable freezes for idx=2 at 422 records. Frozen list includes [0,1,2]. current_idx remains 0; if that field is only informational for the level_loop, fine — otherwise consider bumping or clarifying.
  • scripts/parse_doc2dict_with_config.py (deep skim + spot reading of the new/extended routines)
    • All 4 parser changes listed in the PR description are present and implemented as structural rules (no phrase blocklists), and they are sequenced in main as described:
      1. _fix_post_signature_envelope extends the envelope fix to sibling- and ancestor-based detection — prevents ANNEX I being mistaken for the SEC envelope.
      2. _reparent_orphan_top_level_into_enclosing_subdoc reparents subdoc body orphans by node_id span — matches your ANNEX II orphan recovery note.
      3. _drop_toc_chrome_records drops “TABLE OF CONTENTS” and roman page markers when their bodies are TOC-like, but keeps a real preamble if found in the last page marker — aligns with the Pandora rescue.
      4. _rescue_post_title_cover_block collapses consecutive empty-body predicted-header lines after the L0 title into a single cover block — mirrors your cover-page rescue.
    • Signature-page logic is structurally driven and places the IWW clause at L1 and per-party blocks at L2, then consolidates body-only fragments. Good fit with the rubric and the known “banner” and “By:/Name:/Title:” shapes.

Nitpicks and small fixes

  • Docstring mismatch: The script’s top-of-file JSONL schema still claims “only 3 keys per line” (idx, level, span), but the writer now emits idx, order, level, span. Suggest updating the docstring header and the “Outputs” section to list 4 keys and remove “ONLY 3” to avoid future confusion.
  • Minor doc inconsistency: In _consolidate_real_subdocs the comment says “This pass does TWO things” but lists three — trivial, but easy to tidy.
  • Robustness note: _split_inline_section_markers sets new parent_node_id to the original record’s parent and new depth to 1 + subdoc_penalty. That’s correct for top-level N. sections extracted out of mispacked table text. If you ever encounter N. segments genuinely meant to remain under a non-root ancestor, you might need to guard on the original record’s depth or title context; for idx=2 this looks fine.
  • State file clarity: state.json current_idx is 0 while frozen includes [0,1,2]. If tools don’t consume current_idx, no action. If they do, consider setting it to 2 or annotating its meaning in a comment to avoid reader confusion.

Verdict

  • The four new structural fixes are implemented as described and sequenced sensibly. The idx=2 freeze output (422 records) matches your summary: single L0, clean L1 outline with ANNEX/EXHIBIT placements, and signature lines as L2 at the end.
  • Only actionable item I’d ask to address in this PR is the JSONL schema docstring so the script docs match reality.

If you want, I can open a tiny follow-up to update the docstring and the “TWO things” comment — say the word and tell me whether you prefer GitHub issue or a quick PR.

🐇


🏁 Script executed:

#!/usr/bin/env bash
set -euo pipefail

echo "== Count lines =="
wc -l data/auto_parse/level_freeze/frozen/idx_2.jsonl

echo "== Validate contiguous order starting at 0 =="
awk '
  {
    if (match($0, /"order": *([0-9]+)/, a)) {
      if (a[1] != NR-1) { 
        printf("mismatch at line %d: saw order=%d\n", NR, a[1]); 
        exit 1 
      }
    } else { 
      printf("no order field at line %d\n", NR); 
      exit 1 
    }
  }
  END { print "order_ok total_lines=" NR }
' data/auto_parse/level_freeze/frozen/idx_2.jsonl

echo "== Level distribution =="
jq -r '.level' data/auto_parse/level_freeze/frozen/idx_2.jsonl | sort -n | uniq -c

echo "== First and last 3 records =="
head -n 3 data/auto_parse/level_freeze/frozen/idx_2.jsonl
tail -n 3 data/auto_parse/level_freeze/frozen/idx_2.jsonl

echo "== Check L0 title contains INVESTMENT AGREEMENT =="
head -n1 data/auto_parse/level_freeze/frozen/idx_2.jsonl | jq -r '.span,.level'

echo "== Find key anchors =="
rg -n 'ANNEX I|ANNEX II|EXHIBIT A|ANNEX III' data/auto_parse/level_freeze/frozen/idx_2.jsonl || true
rg -n '/s/' data/auto_parse/level_freeze/frozen/idx_2.jsonl | tail -n 4 || true

echo "== Max level present =="
jq -r '.level' data/auto_parse/level_freeze/frozen/idx_2.jsonl | sort -nr | head -n1

Length of output: 2923


@arthrod Thanks — I reran the checks and here’s the tight verdict:

What I verified

  • idx_2.jsonl
    • 422 lines; order is contiguous 0..421; exactly one L0; max level 5.
    • Level histogram matches your summary: L0=1, L1=23, L2=192, L3=148, L4=48, L5=10.
    • Anchors present: ANNEX I/II/III and EXHIBIT A at L1; two sig blocks at the end as L2 with “/s/”.
  • state.json
    • Freeze history shows multiple attempts culminating in stable idx=2 at 422. frozen = [0,1,2]. Looks good.
  • scripts/parse_doc2dict_with_config.py
    • The four structural fixes you listed are implemented and sequenced correctly in main; behavior lines up with the Pandora outcome.

Small, actionable nits

  • JSONL schema docstring: it still says “ONLY 3 keys per line” but the writer now emits 4 (idx, order, level, span). Update that to avoid confusion.
  • Truncation numbers in the top docstring mention 4000/3000/1000, but code uses 1000/800/200. Align the docs with the constants.
  • Minor comment typo: “This pass does TWO things” lists three items in _consolidate_real_subdocs.
  • Optional: if current_idx in state.json is meant to track latest freeze, consider bumping to 2 or clarifying its meaning in a comment.

Otherwise, looks solid. Want me to open a tiny follow-up PR to fix the docstring/comment mismatches, or would you prefer a GH issue for tracking?

Comment on lines +1022 to +1024
env_cls = (env_row.get("cls") or "")
if env_cls not in _SUBDOC_CLASSES:
continue
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: The new class gate skips envelope fixes for rows where is_envelope=True but cls is not in subdoc classes (for example introduction rows detected from EXHIBIT ... body text). Those rows no longer get unmarked even when they are clearly post-body attachments, so valid agreement content can be silently dropped from JSONL. Apply the sibling/ancestor envelope correction based on structure, not only on _SUBDOC_CLASSES. [incorrect condition logic]

Severity Level: Critical 🚨
- ❌ Attached annex/exhibit text dropped from JSONL output.
- ⚠️ Downstream clause-extraction misses post-signature attachments.
- ⚠️ Corpus freeze comparisons can hide lost attachment content.
Steps of Reproduction ✅
1. Run `main()` in `scripts/parse_doc2dict_with_config.py` (lines 1919–2059), which
iterates the corpus and calls `parse_one(idx, raw)` for each row at line 3995.

2. Inside `parse_one()` (lines 1771–1916), after `walk_sections()` flattens the doc2dict
tree, `sections = _fix_post_signature_envelope(sections)` is executed at line 1808, so
every document's section list flows through `_fix_post_signature_envelope()` (lines
971–1070).

3. Construct a representative `rows` list for `_fix_post_signature_envelope()` where:

   - There is a body section node `S` with `cls="section"`, `node_id=10`.

   - A later node `E` has `node_id=20`, `parent_node_id=10`, `cls="introduction"`,
   `is_envelope=True`, and a non-empty `body_direct` carrying the first line of an
   attached ANNEX/EXHIBIT that doc2dict mis-parented under `S` (consistent with the
   mis-parenting pattern described in the `_fix_post_signature_envelope` docstring at
   lines 974–989).

   - Other sibling/ancestor structure is such that `_has_body_ancestor(E)` (lines
   1002–1014) would return True (E is inside a `section`/`article` ancestor), so
   structurally E is post-body attachment, not the SEC envelope.

4. Call `_fix_post_signature_envelope(rows)` (line 971) with this synthetic `rows`:

   - `E` is included in `envelope_rows` (line 991) because `is_envelope` is True.

   - In the loop at lines 1018–1024, `env_cls = (env_row.get("cls") or "")` yields
   `"introduction"` and the guard `if env_cls not in _SUBDOC_CLASSES: continue` (lines
   1022–1024) fires, so the sibling/ancestor logic below never runs for `E`.

   - As a result, `E["is_envelope"]` remains True and `fixed` is never set because only
   subdoc-class envelopes are considered.

   - Downstream, the JSONL writer in `main()` (lines 2013–2029) skips any section with
   `is_envelope=True` (lines 2016–2017), so E's ANNEX/EXHIBIT body is silently dropped
   from JSONL even though it is clearly post-body attachment content by structure.

Fix in Cursor | Fix in VSCode Claude

(Use Cmd/Ctrl + Click for best experience)

Prompt for AI Agent 🤖
This is a comment left during a code review.

**Path:** scripts/parse_doc2dict_with_config.py
**Line:** 1022:1024
**Comment:**
	*Incorrect Condition Logic: The new class gate skips envelope fixes for rows where `is_envelope=True` but `cls` is not in subdoc classes (for example `introduction` rows detected from `EXHIBIT ...` body text). Those rows no longer get unmarked even when they are clearly post-body attachments, so valid agreement content can be silently dropped from JSONL. Apply the sibling/ancestor envelope correction based on structure, not only on `_SUBDOC_CLASSES`.

Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.
Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fix
👍 | 👎

Comment on lines +1139 to +1188
for i, sub in enumerate(real_subdoc_rows):
start = sub["node_id"]
if i + 1 < len(real_subdoc_rows):
end = real_subdoc_rows[i + 1]["node_id"]
else:
end = max_nid + 1
subdoc_spans.append((start, end, sub))

def _is_in_span(nid: int, start: int, end: int) -> bool:
return start < nid < end

# For each enclosing subdoc, compute the node_id span [start, end).
# A record at node_id N is "inside subdoc S" iff S.start <= N < S.end.
enclosing_span_of: dict[int, tuple[int, int]] = {}
for start, end, sub in subdoc_spans:
enclosing_span_of[sub["node_id"]] = (start, end)

for r in rows:
nid = r["node_id"]
# Skip the real subdoc headers themselves (they are boundary
# markers, never re-parented).
if nid in subdoc_id_set:
continue
# Skip records before the first real subdoc.
if nid <= subdoc_node_ids[0]:
continue
# Skip L0 title and envelope nodes.
if r.get("is_envelope") and r.get("depth", 1) == 0:
continue
if r.get("depth") == 0 and not r.get("is_envelope"):
continue
# Find the enclosing subdoc span by node_id.
enclosing: dict[str, Any] | None = None
for start, end, sub in subdoc_spans:
if _is_in_span(nid, start, end):
enclosing = sub
break
if enclosing is None:
continue
enc_start, enc_end = enclosing_span_of[enclosing["node_id"]]
# Check current parent: if already inside (or IS) the enclosing
# subdoc by node-id span, leave it alone.
cur_pid = r.get("parent_node_id")
if cur_pid is not None and cur_pid in by_node_id:
if cur_pid == enclosing["node_id"]:
continue
if enc_start <= cur_pid < enc_end:
continue
# Re-parent into the enclosing subdoc.
r["parent_node_id"] = enclosing["node_id"]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: The last subdoc span is defined through max_nid + 1, so every later node is considered inside that subdoc window and can be reparented to it, including unrelated trailing/signature records. This can incorrectly pull non-subdoc content into agreement scope and alter downstream depth/scope behavior. Restrict reparents to orphan patterns only (e.g., root/sibling misparents with expected classes) instead of all nodes in the terminal numeric range. [logic error]

Severity Level: Major ⚠️
- ⚠️ Signature/trailer records misattached under last subdocument.
- ⚠️ Annex bodies polluted with unrelated trailing agreement text.
- ⚠️ Scope-rule behavior diverges from intended trailer semantics.
Steps of Reproduction ✅
1. During `main()` in `scripts/parse_doc2dict_with_config.py` (lines 1919–2059), each
document is parsed via `parse_one(idx, raw)` at line 3995. Inside `parse_one()`, after
`walk_sections()`, `_reparent_orphan_top_level_into_enclosing_subdoc(sections)` is invoked
at lines 1817–1822, so every document's node list flows through this reparenting logic.

2. Within `_reparent_orphan_top_level_into_enclosing_subdoc()` (lines 1073–1190), real
subdocument headers are collected into `real_subdoc_rows` via `_is_real_subdoc_title()`
(lines 1118–1125). Assume a realistic structure:

   - A real subdoc header `ANNEX I CERTIFICATE OF DESIGNATIONS` with `node_id = 200`, `cls
   = "annex"`, `is_envelope = False`, so it is included in `real_subdoc_rows`.

   - Several **trailing signature or notices records** after the annex that belong to the
   main agreement trailer, not to the annex, e.g. a record `T` with `node_id = 260`, `cls
   = "predicted header"`, `title = "IN WITNESS WHEREOF"` or similar, `parent_node_id =
   None`, `depth = 1`, `is_envelope = False`.

   - No further real subdoc headers after `ANNEX I`, so `ANNEX I` is the last real subdoc.

3. When `_reparent_orphan_top_level_into_enclosing_subdoc()` runs:

   - `max_nid` is computed as the maximum `node_id` in `rows`, so `max_nid >= 260` (line
   1137).

   - For the last subdoc `ANNEX I`, the span `(start, end)` is constructed as `(200,
   max_nid + 1)` at lines 1139–1145, and stored in `subdoc_spans` and `enclosing_span_of`.

   - For the trailing record `T` with `nid = 260`, the loop at lines 1156–1188:

     * Skips none of the early guards (it is not the subdoc header itself, `nid >
     subdoc_node_ids[0]`, and `depth != 0`).

     * Finds `enclosing` as the last subdoc because `_is_in_span(nid, start, end)` (lines
     1147–1148, 1172–1175) returns True for `200 < 260 < max_nid + 1`.

     * Looks at `cur_pid = T["parent_node_id"]` (line 1181), which is `None` or some
     pre-subdoc parent outside the `[enc_start, enc_end)` interval, so neither `cur_pid ==
     enclosing["node_id"]` nor `enc_start <= cur_pid < enc_end` (lines 1183–1186) holds.

     * Falls through to `r["parent_node_id"] = enclosing["node_id"]` at line 1188,
     reparenting `T` under `ANNEX I` even though it is a signature/trailer record, not
     structurally part of the annex.

4. This reparenting affects downstream behavior in `parse_one()`:

   - `_apply_scope_rule()` (lines 535–656), called after reparenting at line 1823, now
   sees `T` as a descendant of a real subdoc (via `_is_descendant_of_subdoc`, lines
   573–584), and thus forces `scope="agreement"` for `T` even if it would otherwise be
   trailer content.

   - Subsequent subdoc consolidation in `_consolidate_real_subdocs()` (lines 1290–1637)
   may fold `T`'s body into the annex body record due to its new parent/descendant
   relationship (see the body-folding walker `_walk_for_body` at lines 1568–1597), causing
   unrelated signature/trailing text to appear inside the annex's consolidated L2 body.

   - The JSONL writer in `main()` (lines 2013–2030) then emits that signature/trailer text
   as part of the annex content for all affected documents, altering both depth and scope
   semantics of the last subdoc.

Fix in Cursor | Fix in VSCode Claude

(Use Cmd/Ctrl + Click for best experience)

Prompt for AI Agent 🤖
This is a comment left during a code review.

**Path:** scripts/parse_doc2dict_with_config.py
**Line:** 1139:1188
**Comment:**
	*Logic Error: The last subdoc span is defined through `max_nid + 1`, so every later node is considered inside that subdoc window and can be reparented to it, including unrelated trailing/signature records. This can incorrectly pull non-subdoc content into agreement scope and alter downstream depth/scope behavior. Restrict reparents to orphan patterns only (e.g., root/sibling misparents with expected classes) instead of all nodes in the terminal numeric range.

Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.
Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fix
👍 | 👎

@codeant-ai
Copy link
Copy Markdown

codeant-ai Bot commented May 17, 2026

CodeAnt AI finished reviewing your PR.

@arthrod
Copy link
Copy Markdown
Owner Author

arthrod commented May 17, 2026

Triage agent — PR #75 comment review (read-only pass, no code changes)

3 inline comments reviewed:

  1. gemini-code-assist @ parse_doc2dict...py:2796 — _TOC_BODY_ARTICLE_RE too restrictive (WILL-DEFER)
    Valid observation: the regex only matches bare article markers without titles (e.g. "ARTICLE I" standalone), missing the more common "ARTICLE I — Definitions" form in TOC lines. However, changing this regex risks matching non-TOC content. Needs a fixture-driven test before landing. Targeted for a future polish PR after the 40-PR foundation batch completes.

  2. codeant-ai @ parse_doc2dict...py:1024 — envelope fix skips non-subdoc is_envelope=True rows (WILL-DEFER)
    Legitimate edge case: introduction rows tagged is_envelope=True with non-subdoc cls values may not get the envelope-unmark pass. Worth investigating once the full freeze sequence is stable. Deferred to polish PR.

  3. codeant-ai @ parse_doc2dict...py:1188 — last subdoc span extends to max_nid + 1 (WILL-DEFER)
    The unbounded trailing span is a real risk for pulling trailing/signature records into the final subdoc. Needs a concrete failing idx before fixing to avoid regressing current freezes. Deferred to polish PR.

WILL-DEFER items (3): _TOC_BODY_ARTICLE_RE expansion, envelope-unmark for non-subdoc is_envelope rows, subdoc span upper-bound tightening.

Triage only — no code changes made this round.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:XL This PR changes 500-999 lines, ignoring generated files

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant