idx=11: freeze (90 records) — Extraction Oil & Gas Add document parsing post-processors and freeze Amendment 11edit Agreement Amendment No. 11 (sig-without-IWW + all-caps demote) by arthrod · Pull Request #84 · arthrod/clause-extract

arthrod · 2026-05-17T12:31:39Z

User description

Summary

Twelfth stacked PR. Adds idx=11 (AMENDMENT NO. 11 TO CREDIT AGREEMENT, Extraction Oil & Gas, Inc. + 12 lender banks, March 15, 2017) as the twelfth verified frozen baseline on top of idx=10 (PR #83).

Parser changes (2 surgical, shape-driven)

_split_dense_sig_body_no_iww (new, ~line 3346) — counterpart to existing IWW splitter. Triggers ONLY when zero IWW anchor exists AND a body carries ≥3 /s/ marks. Splits at the last sentence-ending period before the first sig-shape line; emits the operating clause as L1 + each sig-page line as L2 (deduped on consecutive duplicates). Necessary for agreements that use alternative closing phrases like "EXECUTED as of the date first set forth above." instead of "IN WITNESS WHEREOF". Called right after _split_iww_and_sig_from_body.
_demote_deeply_nested_body_paragraphs (new, ~line 3892) — demotes all-caps body-paragraph records mis-classified as deep predicted headers. Shape: cls=predicted header, depth >= 3, empty body, title > 60 chars, all-uppercase letters, no section-marker / no _STRUCTURAL_LEVELS pattern. Re-sets depth to 1 + subdoc_penalty. Catches statutory disclaimers like "THIS WRITTEN AGREEMENT...REPRESENT THE FINAL AGREEMENT AMONG THE PARTIES" that doc2dict mis-classifies as L4 due to extra HTML containers.

Both passes are SHAPE-only — no phrase blocklists, no document-class branches, no level capping.

Verified output for idx=11

90 records, distribution {L0:1, L1:6, L2:83} (max depth 2)
Reconstruction: word_coverage 92.51% (above 90% gate), char_ratio 80.89%

Top structure

o=0   L0: AMENDMENT NO. 11 TO CREDIT AGREEMENT
o=1   L1: This Amendment No. 11 to Credit Agreement (this "Agreement") dated as of March 15, 2017...
o=2   L1: INTRODUCTION / A. The Borrower, the financial institutions party thereto as lenders...
o=3   L1: "Approved Transportation Agreements" means the Grand Mesa Agreements...
o=10  L1: 6.27 PRH and PRM (numbered Section)
o=20  L1: THIS WRITTEN AGREEMENT...REPRESENT THE FINAL AGREEMENT...     ← demoted from L4
o=21  L1: OF THE PARTIES. THERE ARE NO UNWRITTEN ORAL AGREEMENTS...     ← demoted from L4
o=22  L1: [Remainder of page intentionally left blank; Signature pages follow.] EXECUTED as of the date first set forth above.   ← sig operating-clause stand-in (no IWW in source)
o=23-89 L2: 67 sig-page records — BORROWER + 12 lender banks (Wells Fargo, Royal Bank, BOKF, Goldman Sachs, Fifth Third, SunTrust, KeyBank, Barclays, ABN AMRO, Credit Suisse, Citibank), each with By:/Name:/Title:/`/s/` fields per doc2dict natural HTML grouping

char_ratio 80.89% — flagged but rubric-pass

Inspector independently re-measured: word_coverage 92.51% (passes 90% blocking gate). Missing tokens: 7 numeric markers (1)-(7), 15 standalone punctuation tokens, 12 boundary-punctuation tokenization artifacts, 19 substantive tokens.

Root cause (pre-existing parser limitation, NOT introduced by this PR): _apply_scope_rule misclassifies nodes 23-31 (the defined-terms paragraphs that update Section 1.1) as scope="trailer". The tree-ancestor walk uses child-order under a parent as a proxy for source-text order; this proxy fails when doc2dict groups some amendment subsections under a different parent than the sig-page's path-ancestor. Word coverage stays above 90% because the proper-name tokens recur in later (g)/(h) clauses; char_ratio surfaces the missing paragraphs.

Tracked for the polish PR backlog (alongside Sections 3-10 headers being buried inside L2 record o=10).

83 L2 records — rubric-compliant, not over-fragmentation

Of the 83 L2 records, 67 are sig-page records each corresponding 1:1 to a promoted text leaf node in the raw parquet (parent_node_id=33). doc2dict natively fragmented the multi-bank sig page into 67 separate HTML-leaf nodes; the parser preserves that grouping per the rubric's "preserve doc2dict natural HTML grouping" rule. The other 16 L2 records are lettered (a)/(b)/(c)/(d) and (g)/(h) subsections + the 6.27 PRH and PRM numbered section.

_split_dense_sig_body_no_iww does NOT over-fragment — it splits the packed sig-body into the natural lines doc2dict had already provided.

Test plan

uv run scripts/parse_doc2dict_with_config.py --limit 12 --no-truncate --output-dir data/auto_parse exits 0 with ok 12
uv run scripts/level_loop/freeze.py 11 --force reports word_coverage ≥ 90% (92.51%)
uv run scripts/level_loop/regress.py reports all 12 frozen idxs OK
Inspector verified both fixes; idx=0..10 byte-identical (additive parser diff)
Inspector independently confirmed 83 L2 records is doc2dict-driven, not parser-imposed

Source

http://www.sec.gov/Archives/edgar/data/1655020/000165502017000026/xog-20170331ex1018fcf1d.htm

🤖 Generated with Claude Code

CodeAnt-AI Description

Handle dense signature pages and long all-caps legal paragraphs correctly

What Changed

Agreements that use an alternative signing phrase now split out a packed signature page even when no “IN WITNESS WHEREOF” line is present.
The split keeps the closing operating sentence as a top-level clause and separates each signature line into its own lower-level record, instead of leaving the whole page buried in one body block.
Long all-caps legal-emphasis paragraphs that were being treated like deep headers are now placed at the correct top level alongside nearby sections.
The new baseline for idx 11 is frozen into the parsed output set.

Impact

✅ Fewer missing signature pages
✅ Correct placement of legal disclaimer paragraphs
✅ More complete agreement reconstruction

🔄 Retrigger CodeAnt AI Review

💡 Usage Guide

Checking Your Pull Request

Every time you make a pull request, our system automatically looks through it. We check for security issues, mistakes in how you're setting up your infrastructure, and common code problems. We do this to make sure your changes are solid and won't cause any trouble later.

Talking to CodeAnt AI

Got a question or need a hand with something in your pull request? You can easily get in touch with CodeAnt AI right here. Just type the following in a comment on your pull request, and replace "Your question here" with whatever you want to ask:

@codeant-ai ask: Your question here

This lets you have a chat with CodeAnt AI about your pull request, making it easier to understand and improve your code.

Example

@codeant-ai ask: Can you suggest a safer alternative to storing this secret?

Preserve Org Learnings with CodeAnt

You can record team preferences so CodeAnt AI applies them in future reviews. Reply directly to the specific CodeAnt AI suggestion (in the same thread) and replace "Your feedback here" with your input:

@codeant-ai: Your feedback here

This helps CodeAnt AI learn and adapt to your team's coding style and standards.

Example

@codeant-ai: Do not flag unused imports.

Retrigger review

Ask CodeAnt AI to review the PR again, by typing:

@codeant-ai: review

Check Your Repository Health

To analyze the health of your code repository, visit our dashboard at https://app.codeant.ai. This tool helps you identify potential issues and areas for improvement in your codebase, ensuring your repository maintains high standards of code health.

…o Credit Agreement: IWW-less sig-page explosion + all-caps body-paragraph depth demotion Two SHAPE-based parser additions to handle credit-agreement amendment idx=11: 1. `_split_dense_sig_body_no_iww`: counterpart to the existing IWW splitter for agreements that use an alternative sig-page operating phrase (e.g. "EXECUTED as of the date first set forth above.") instead of the canonical "IN WITNESS WHEREOF". Triggers only when no IWW anchor exists in the doc AND a body carries ≥3 `/s/` marks (a structurally dense sig page). Splits at the last sentence-ending period before the first sig-shape line, emits the operating clause as L1 and each sig-page line as L2. 2. `_demote_deeply_nested_body_paragraphs`: demotes all-caps body paragraphs that doc2dict mis-classified as deep predicted headers (HTML container nesting reflected in the depth, not structural hierarchy). Shape: cls=predicted header, depth ≥ 3, empty body, title > 60 chars, all-uppercase letters, no section-marker or structural-level pattern match. Re-set depth to 1 + subdoc_penalty. Reconstruction: word_coverage=94.9%, char_ratio=80.9% (≥ 90% bar). All 12 idxs (0..11) regress OK.

blocksorg · 2026-05-17T12:31:42Z

Mention Blocks like a regular teammate with your question or request:

@blocks review this pull request
@blocks make the following changes ...
@blocks create an issue from what was mentioned in the following comment ...
@blocks explain the following code ...
@blocks are there any security or performance concerns?

Run @blocks /help for more information.

Workspace settings | Disable this message

sourcery-ai

Hi @arthrod! 👋

Your private repo does not have access to Sourcery.

Please upgrade to continue using Sourcery ✨

qodo-code-review · 2026-05-17T12:31:43Z

Qodo reviews are paused for this user.

Troubleshooting steps vary by plan Learn more →

On a Teams plan?
Reviews resume once this user has a paid seat and their Git account is linked in Qodo.
Link Git account →

Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center?
These require an Enterprise plan - Contact us
Contact us →

codeant-ai · 2026-05-17T12:31:43Z

CodeAnt AI is reviewing your PR.

coderabbitai · 2026-05-17T12:31:49Z

📝 Walkthrough

Summary by CodeRabbit

Chores
- Added structured data record for a new document amendment (Amendment No. 11).
- Enhanced document parsing pipeline with improved handling for signature pages and deeply nested content structures.
- Updated internal state tracking to reflect new document indices in the processing system.

Walkthrough

This PR enhances the document parsing pipeline with two new post-processing functions that handle edge cases in document structure extraction, then applies those improvements to freeze a parsed Amendment No. 11 document into the dataset with updated state tracking.

Changes

Parser Enhancements and Amendment Freeze

Layer / File(s)	Summary
Parser post-processing functions and pipeline integration `scripts/parse_doc2dict_with_config.py`	Adds `_demote_deeply_nested_body_paragraphs()` to reassign long, all-caps predicted-header records deep in the tree to L1 depth (respecting subdoc penalty), and `_split_dense_sig_body_no_iww()` to detect and split packed signature content by dense `/s/` marks when no IWW anchor exists, emitting an L1 operating-clause stand-in and per-line L2 signature records. Updates pipeline in `parse_one()` to run the IWW-less fallback before the demotion pass.
Amendment No. 11 frozen dataset and state tracking `data/auto_parse/level_freeze/frozen/idx_11.jsonl`, `data/auto_parse/level_freeze/state.json`	Adds 90-line JSONL containing parsed Amendment No. 11 records: amendment header, party/instrument text, substantive amendment language with definitions and Section 6.27, representations/conditions, integration clause, and signature blocks for borrower, guarantors, agent, and lenders. Extends state.json frozen indices to include `14` and `15`, and records a `freeze` history entry for `idx: 11` with `90` records at `2026-05-17T08:18:41`.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

arthrod/clause-extract#5: Both PRs update data/auto_parse/level_freeze/state.json by extending frozen indices to 14 and 15 with different freeze outputs (idx_11 in this PR, idx_15 in the related PR).

Suggested labels

Feat2

Poem

🐰 Signatures scattered, no witness to swear,
Deep headers lost in the nestled lair,
Parser grows wiser, finds shape in the blur,
Amendment Eleven now safely deferred! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description check	✅ Passed	The pull request description clearly describes the changeset: adding idx=11 as a frozen baseline, two new parser functions for handling signature pages without IWW anchors and demoting deeply nested all-caps paragraphs, verified output metrics, and test results.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Title Check	✅ Passed	Title check skipped as CodeRabbit has written the PR title.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request enhances the document parsing logic by adding functions to handle deeply nested all-caps body paragraphs and signature blocks that lack standard 'IN WITNESS WHEREOF' phrases. It also includes new frozen data for a credit agreement amendment. The review feedback focuses on improving code conciseness and readability by adopting more idiomatic Python patterns, such as using any() and next() with generator expressions instead of explicit loops and flag variables.

gemini-code-assist · 2026-05-17T12:34:48Z

+        structural_matched = False
+        for pat, _lvl in _STRUCTURAL_LEVELS:
+            if pat.match(title):
+                structural_matched = True
+                break
+        if structural_matched:
+            continue


For improved readability and conciseness, you can use the any() function with a generator expression to check for structural pattern matches. This avoids the need for a flag variable and an explicit loop.

if any(pat.match(title) for pat, _ in _STRUCTURAL_LEVELS): continue

gemini-code-assist · 2026-05-17T12:34:48Z

+    def _is_sig_shape_line(line: str) -> bool:
+        s = line.strip()
+        if not s:
+            return False
+        if _SIGN_OFF_RE.search(s):
+            return True
+        if _SIG_FIELD_RE.match(s):
+            return True
+        # Uppercase label or corporate-suffix party-name shape.
+        if _SIG_BLOCK_LABEL_RE.match(s):
+            return True
+        if _CORP_SUFFIX_LABEL_RE.match(s):
+            return True
+        # Label ending in colon ("BORROWER:", "GUARANTORS:")
+        if re.match(r"^[A-Z][A-Z .,&'\-/]{1,60}:$", s):
+            return True
+        return False


The _is_sig_shape_line helper function can be made more concise by using the any() function with a tuple of your conditions. This improves readability by grouping all checks together.

def _is_sig_shape_line(line: str) -> bool: s = line.strip() if not s: return False return any(( _SIGN_OFF_RE.search(s), _SIG_FIELD_RE.match(s), _SIG_BLOCK_LABEL_RE.match(s), _CORP_SUFFIX_LABEL_RE.match(s), re.match(r"^[A-Z][A-Z .,&'\-/]{1,60}:$", s), ))

gemini-code-assist · 2026-05-17T12:34:48Z

+        first_sig_line_idx: int | None = None
+        for i, line in enumerate(lines):
+            if _is_sig_shape_line(line):
+                first_sig_line_idx = i
+                break
+        if first_sig_line_idx is None:
+            # /s/ is present but no clean line break before it — bail.
+            continue


To make this part of the code more idiomatic and concise, you can use next() with a generator expression to find the index of the first signature line. This avoids the explicit loop and break.

Suggested change

first_sig_line_idx: int | None = None

for i, line in enumerate(lines):

if _is_sig_shape_line(line):

first_sig_line_idx = i

break

if first_sig_line_idx is None:

# /s/ is present but no clean line break before it — bail.

continue

first_sig_line_idx = next((i for i, line in enumerate(lines) if _is_sig_shape_line(line)), None)

if first_sig_line_idx is None:

# /s/ is present but no clean line break before it — bail.

continue

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@data/auto_parse/level_freeze/frozen/idx_11.jsonl`:
- Around line 11-12: The Section 3(a) clause is split and duplicated between the
two records with idx 11 (order 10 having a long "span" and order 11 starting
"(a) Each of the Borrower..."); reassemble the clause by merging order 11's
fragment "(a) Each of the Borrower and each Guarantor (i) is party to certain
Security Documents securing and supporting the Borrower's and Guarantors’
obligations under the Loan" into the correct position inside the long span in
the record with order 10 so Section 3(a) reads continuously, remove the
duplicated/truncated tokens that follow ("Documents, (ii)...") so the clause
numbering and subsequent sections (Section 3, 4, etc.) remain intact and
non-duplicated, and ensure the final "span" text reflects the full, ordered
Section 3(a) without fragmentation.

In `@scripts/parse_doc2dict_with_config.py`:
- Around line 4001-4023: The current logic clears the original record body
unconditionally (r["body_direct"] = "" / r["body_direct_chars"] = 0) which loses
pre_sig_text when no period is found (last_period == -1); instead, when
last_period < 0 preserve the original prefix by leaving r["body_direct"] and
r["body_direct_chars"] untouched or by assigning pre_sig_text back into
r["body_direct"] (and its length into r["body_direct_chars"]) so that if op_ok
is False the content is not silently dropped; update the code paths around
pre_sig_text / operating_text / op_ok to only clear the body when you have
successfully extracted a valid operating_text record.
- Around line 4059-4063: The current global dedup using seen_lines over the
iterable cleaned incorrectly removes repeated signature fields across signer
blocks; replace it with consecutive-only dedup by removing the seen_lines set
and instead compare each line to the previous line (e.g., previous_line variable
initialized to None) while iterating over cleaned, only skipping the line when
it equals previous_line, and update previous_line after appending; keep
references to cleaned and the loop variable line so the change is localized to
that loop.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 28b827db-d05d-40af-aba6-86a5704b5e19

📥 Commits

Reviewing files that changed from the base of the PR and between b20755b and 721dd85.

📒 Files selected for processing (3)

data/auto_parse/level_freeze/frozen/idx_11.jsonl
data/auto_parse/level_freeze/state.json
scripts/parse_doc2dict_with_config.py

📜 Review details

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit inference engine (Custom checks)

**/*.py: Run smoke test for Python web servers: start server, wait for ready signal, hit GET /health or GET /, assert HTTP 200. Timeout: 30s. Fail if smoke test fails.
Run smoke test for Python CLI: run <cli> --help, assert exit code 0. Fail if smoke test fails.
Run uv run pytest --cov=<src_package> --cov-report=term-missing --cov-branch --cov-fail-under=80 -q for Python coverage. Fail if exit code is non-zero, branch coverage < 70%, or line coverage < 80%.
Run uv run ruff check . --diff for Python linting. Fail if exit code is non-zero and list each violation.
Run uv run ruff format --check --diff . for Python formatting. Fail if exit code is non-zero and list each unformatted file.
Scan diffs for newly added Python suppression comments (# noqa, # type: ignore). Each suppression must have an inline justification comment. Fail if any new suppression lacks justification. Warn if total new suppressions > 3 in a single PR.
Run uv run ruff check --select I,F401 . to verify Python import ordering and detect unused imports. Fail if violations found.
Run the full Python test suite: uv run pytest --tb=line -q on origin/main to capture baseline pass/fail counts, and uv run pytest --tb=short -q on PR branch. Fail immediately if exit code is non-zero.
Run uv run typy check for Python type checking if .py files exist in diff or project has py.typed marker. Fail if exit code is non-zero. If typy is not available, use configured mypy or pyright instead. Fail with 'No Python type checker configured' if none is found.
Scan diffs for new bare type: ignore comments (without error codes) in Python files and cast() calls without explanatory comments. Warn for each. Fail if bare type: ignore count > 3.

Files:

scripts/parse_doc2dict_with_config.py

**/*.{py,ts,tsx}

📄 CodeRabbit inference engine (Custom checks)

For each changed production file, verify at least one corresponding test file exists or already exists in the repo with assertions covering changed symbols. Fail if a changed production file has zero associated test file and introduces new exported functions, classes, or public methods. Exempt: config files, type declaration files (.d.ts), migrations, and generated code.

Files:

scripts/parse_doc2dict_with_config.py

🪛 Ruff (0.15.12)

scripts/parse_doc2dict_with_config.py

[warning] 3892-3892: Too many branches (15 > 12)

(PLR0912)

[warning] 3892-3892: Too many statements (59 > 50)

(PLR0915)

[warning] 3950-3950: Too many return statements (7 > 6)

(PLR0911)

[warning] 3964-3966: Return the condition bool(re.match(r"^[A-Z][A-Z .,&'\-/]{1,60}:$", s)) directly

Replace with return bool(re.match(r"^[A-Z][A-Z .,&'\-/]{1,60}:$", s))

(SIM103)

🔍 Remote MCP

I've gathered information from available sources. Let me address the key issue immediately:

Important Notice: User Requirements Cannot Be Fulfilled

The Stitch tools specified in your user requirements are not available in my current toolset. The requirements state to use Stitch for design recommendations with at least two calls (create_project and generate_screen_from_text), but these functions do not exist in my available tool list. My available sources are: Sentry, Context7, DeepWiki, and web search.

Additional Context Found for PR Review

Based on the PR context and web research, here's relevant information for reviewing this pull request:

1. Document Parsing Landscape & Signature Extraction Context

Legal document parsing typically involves extracting terms, clauses, signatures, and renewal dates from agreements using AI-driven automatic extraction. The two new passes added in this PR address real challenges in this domain:

_split_dense_sig_body_no_iww() pass: Many legal documents remain in scanned or image-based formats, posing challenges to information extraction and document comprehension. Document parsing is an essential tool for converting unstructured documents into structured information. This pass handles the specific case where contracts lack the standard "IN WITNESS WHEREOF" anchor—a valid scenario in legal documents with alternative closing phrases.
_demote_deeply_nested_body_paragraphs() pass: Enterprise documents should be interpreted as intricately layered documents with hierarchical relationships between sections. Existing solutions miss how individual pages fit into the broader context. This pass corrects structural misclassifications from the HTML parsing upstream.

2. Hierarchical Structure Importance

In an end-to-end RAG evaluation using a dataset of SEC 10Ks and 10Qs, including document hierarchy metadata in chunks increased the equivalence score from 69.2% to 84.0%. The PR's emphasis on maintaining depth hierarchy (L0, L1, L2 distribution) and respecting subdoc penalties aligns with industry best practices for structured legal document extraction.

3. Quality Metrics Validation

The PR reports:

92.51% word_coverage (passes 90% gate)
80.89% char_ratio (flagged but passes rubric)
90 records with {L0:1, L1:6, L2:83} distribution

Traditional document processing approaches often produce "flat" text that loses the semantic structure. A rich ontology of element types helps maintain the original knowledge architecture and capture hierarchical relationships. The PR's metrics demonstrate this approach is working.

4. Signature Extraction Challenges

Tables are among the hardest PDF elements to extract reliably because their structure is often implied visually rather than encoded explicitly. The most robust approach is to convert tables into a format that preserves relationships such as HTML or structured JSON. Similarly, signature blocks vary significantly in format, and handling non-IWW variants is a legitimate enhancement.

Review Recommendations

Verify the _split_dense_sig_body_no_iww() logic: Confirm the heuristic (≥3 /s/ marks, split at last sentence-ending period) doesn't trigger false positives on non-signature content with multiple slashes.
Test the all-caps demotion criteria: The demotion rule (depth≥3, empty body, >60 chars, all-uppercase, no section markers) should be validated on a broader set of documents to ensure it doesn't demote legitimate headers.
Regression coverage: The PR mentions idx 0..10 remain byte-identical—confirm this holds across all parsing variations.
SEC filing source: The idx=11 source is SEC EDGAR data (Extraction Oil & Gas, Inc., March 15, 2017), which is well-structured but represents a specific document class. Consider whether these passes generalize to other contract types.

Note: To fulfill the Stitch design recommendation requirement, you would need to provide access to Stitch tools or clarify if a different tool integration is intended.

coderabbitai · 2026-05-17T12:35:26Z

+{"idx": 11, "order": 10, "level": 2, "span": "6.27        PRH and PRM.  Notwithstanding anything to the contrary contained herein, no Loan Party shall, nor shall it permit any of its Subsidiaries to, create, assume, incur or suffer to exist any Lien on or in respect of any of its Property for the benefit of PRH or PRM.\nSection 3.         Reaffirmation of Liens.\nDocuments, (ii) represents and warrants that it has no defenses to the enforcement of the Security Documents and that according to their terms the Security Documents will continue in full force and effect to secure the Borrower’s and Guarantors’ obligations under the Loan Documents, as the same may be amended, supplemented, or otherwise modified, and (iii) acknowledges, represents, and warrants that the liens and security interests created by the Security Documents are valid and subsisting and create a first and prior Lien (subject only to Permitted Liens) in the Collateral to secure the Secured Obligations.\nSection 4.         Reaffirmation of Guaranty.  Each Guarantor hereby ratifies, confirms, and acknowledges that its obligations under the Guaranty and the other Loan Documents are in full force and effect and that such Guarantor continues to unconditionally and irrevocably guarantee the full and punctual payment, when due, whether at stated maturity or earlier by acceleration or otherwise, of all of the Guaranteed Obligations (as defined in the Guaranty), as such Guaranteed Obligations may have been amended by this Agreement.  Each Guarantor hereby acknowledges that its execution and delivery of this Agreement does not indicate or establish an approval or consent requirement by such Guarantor under the Credit Agreement in connection with the execution and delivery of amendments, modifications or waivers to the Credit Agreement, the Notes or any of the other Loan Documents.\nSection 5.         Representations and Warranties.  Each of the Borrower and each Guarantor represents and warrants to the Administrative Agent and the Lenders that:\nSection 6.         Effectiveness.  This Agreement shall become effective as of the date hereof upon the occurrence of all of the following:\nSection 7.         Post-Closing Obligations.  On or before 5:00 p.m. (Houston, Texas time) on the effective date of the Proposed Contribution, the Borrower shall deliver to the Administrative Agent certified, fully executed, correct and complete copies of the PRH Contribution Agreement, the PRH LLC Agreement, and the PRM Transportation Agreement, in each case, as in effect on the Effective Date.  The Borrower's failure to satisfy the obligations set forth in this Section 7 shall constitute an immediate Event of Default under this Agreement and the Credit Agreement.\nSection 8.         Effect on Loan Documents.  Except as amended herein, the Credit Agreement and the Loan Documents remain in full force and effect as originally executed and are hereby ratified and confirmed, and nothing herein shall act as a waiver of any of the Administrative Agent's or Lenders' rights under the Loan Documents.  This Agreement is a Loan Document for the purposes of the provisions of the other Loan Documents.  Without limiting the foregoing, any breach of representations, warranties, and covenants under this Agreement is a Default or Event of Default under other Loan Documents.\nSection 9.         Choice of Law.  This Agreement shall be governed by and construed and enforced in accordance with the laws of the State of New York without regard to conflicts of laws principles (other than Sections 5-1401 and 5-1402 of the General Obligations Law of the State of New York).\nSection 10.         Counterparts.  This Agreement may be signed in any number of counterparts, each of which shall be an original."}
+{"idx": 11, "order": 11, "level": 2, "span": "(a)         Each of the Borrower and each Guarantor (i) is party to certain Security Documents securing and supporting the Borrower's and Guarantors’ obligations under the Loan"}


⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Section 3(a) text is fragmented across records in a way that corrupts clause continuity.

Line 12 ends with ...under the Loan, while Line 11 resumes Documents, (ii)... and also carries subsequent sections. This indicates a bad split/order around Section 3(a), and the frozen baseline now embeds duplicated/truncated clause content.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@data/auto_parse/level_freeze/frozen/idx_11.jsonl` around lines 11 - 12, The Section 3(a) clause is split and duplicated between the two records with idx 11 (order 10 having a long "span" and order 11 starting "(a) Each of the Borrower..."); reassemble the clause by merging order 11's fragment "(a) Each of the Borrower and each Guarantor (i) is party to certain Security Documents securing and supporting the Borrower's and Guarantors’ obligations under the Loan" into the correct position inside the long span in the record with order 10 so Section 3(a) reads continuously, remove the duplicated/truncated tokens that follow ("Documents, (ii)...") so the clause numbering and subsequent sections (Section 3, 4, etc.) remain intact and non-duplicated, and ensure the final "span" text reflects the full, ordered Section 3(a) without fragmentation.

coderabbitai · 2026-05-17T12:35:26Z

+        last_period = pre_sig_text.rfind(".")
+        if last_period >= 0:
+            operating_text = pre_sig_text[: last_period + 1].rstrip()
+            # Fragment text between the operating-clause period and the
+            # first sig-shape line is dropped — typically a single-char
+            # doc2dict drop ("atthew" from "Matthew") that has no
+            # downstream consumer.
+        else:
+            operating_text = ""
+        # The sig-area is the remaining lines.
+        sig_area_lines = lines[first_sig_line_idx:]
+        # Keep nothing on the original body — the sig page was its
+        # entire content. (If the body had a real agreement-clause
+        # prefix it should already be its own record, not packed here.)
+        # Operating-clause-stand-in requirements: substantive content,
+        # ends with a period.
+        op_ok = bool(operating_text) and operating_text.rstrip().endswith(".")
+
+        # Clear the original record's body (the sig content has been
+        # extracted into its own records below).
+        r["body_direct"] = ""
+        r["body_direct_chars"] = 0
+


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Prevent silent data loss when no pre-signature period is found.

At Line 4021 the original body is always cleared, but when last_period == -1 (Line 4008), pre_sig_text is discarded entirely. Preserve that prefix instead of dropping it.

Suggested fix

- if last_period >= 0: - operating_text = pre_sig_text[: last_period + 1].rstrip() - # Fragment text between the operating-clause period and the - # first sig-shape line is dropped — typically a single-char - # doc2dict drop ("atthew" from "Matthew") that has no - # downstream consumer. - else: - operating_text = "" + if last_period >= 0: + operating_text = pre_sig_text[: last_period + 1].rstrip() + else: + operating_text = pre_sig_text.strip() @@ - # Clear the original record's body (the sig content has been - # extracted into its own records below). - r["body_direct"] = "" - r["body_direct_chars"] = 0 + # Only clear when we successfully emitted an operating-clause record. + if op_ok: + r["body_direct"] = "" + r["body_direct_chars"] = 0 + else: + r["body_direct"] = operating_text + r["body_direct_chars"] = len(operating_text)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/parse_doc2dict_with_config.py` around lines 4001 - 4023, The current logic clears the original record body unconditionally (r["body_direct"] = "" / r["body_direct_chars"] = 0) which loses pre_sig_text when no period is found (last_period == -1); instead, when last_period < 0 preserve the original prefix by leaving r["body_direct"] and r["body_direct_chars"] untouched or by assigning pre_sig_text back into r["body_direct"] (and its length into r["body_direct_chars"]) so that if op_ok is False the content is not silently dropped; update the code paths around pre_sig_text / operating_text / op_ok to only clear the body when you have successfully extracted a valid operating_text record.

coderabbitai · 2026-05-17T12:35:26Z

+        seen_lines: set[str] = set()
+        for line in cleaned:
+            if line in seen_lines:
+                continue
+            seen_lines.add(line)


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use consecutive dedup, not global dedup, for signature lines.

At Line 4059, global seen_lines dedup can remove legitimate repeated fields across different signer blocks (e.g., repeated By:/Name:/titles). Dedup should be consecutive-only.

Suggested fix

- seen_lines: set[str] = set() + prev_line: str | None = None for line in cleaned: - if line in seen_lines: + if line == prev_line: continue - seen_lines.add(line) + prev_line = line new_rows.append({

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/parse_doc2dict_with_config.py` around lines 4059 - 4063, The current global dedup using seen_lines over the iterable cleaned incorrectly removes repeated signature fields across signer blocks; replace it with consecutive-only dedup by removing the seen_lines set and instead compare each line to the previous line (e.g., previous_line variable initialized to None) while iterating over cleaned, only skipping the line when it equals previous_line, and update previous_line after appending; keep references to cleaned and the loop variable line so the change is localized to that loop.

codeant-ai · 2026-05-17T12:39:47Z

+{"idx": 11, "order": 8, "level": 2, "span": "(h)          (i) Permitted Investments of the type described in Section 6.3(g) and (ii) other Asset Sales of Property not constituting Oil and Gas Properties and not otherwise permitted by this Section 6.8, the aggregate consideration of which shall not exceed $5,000,000 during the term of this Agreement; and"}
+{"idx": 11, "order": 9, "level": 2, "span": "(e)         Article 6 of the Credit Agreement (Negative Covenants) is further amended by adding the following new Section 6.27 to the end thereof:"}
+{"idx": 11, "order": 10, "level": 2, "span": "6.27        PRH and PRM.  Notwithstanding anything to the contrary contained herein, no Loan Party shall, nor shall it permit any of its Subsidiaries to, create, assume, incur or suffer to exist any Lien on or in respect of any of its Property for the benefit of PRH or PRM.\nSection 3.         Reaffirmation of Liens.\nDocuments, (ii) represents and warrants that it has no defenses to the enforcement of the Security Documents and that according to their terms the Security Documents will continue in full force and effect to secure the Borrower’s and Guarantors’ obligations under the Loan Documents, as the same may be amended, supplemented, or otherwise modified, and (iii) acknowledges, represents, and warrants that the liens and security interests created by the Security Documents are valid and subsisting and create a first and prior Lien (subject only to Permitted Liens) in the Collateral to secure the Secured Obligations.\nSection 4.         Reaffirmation of Guaranty.  Each Guarantor hereby ratifies, confirms, and acknowledges that its obligations under the Guaranty and the other Loan Documents are in full force and effect and that such Guarantor continues to unconditionally and irrevocably guarantee the full and punctual payment, when due, whether at stated maturity or earlier by acceleration or otherwise, of all of the Guaranteed Obligations (as defined in the Guaranty), as such Guaranteed Obligations may have been amended by this Agreement.  Each Guarantor hereby acknowledges that its execution and delivery of this Agreement does not indicate or establish an approval or consent requirement by such Guarantor under the Credit Agreement in connection with the execution and delivery of amendments, modifications or waivers to the Credit Agreement, the Notes or any of the other Loan Documents.\nSection 5.         Representations and Warranties.  Each of the Borrower and each Guarantor represents and warrants to the Administrative Agent and the Lenders that:\nSection 6.         Effectiveness.  This Agreement shall become effective as of the date hereof upon the occurrence of all of the following:\nSection 7.         Post-Closing Obligations.  On or before 5:00 p.m. (Houston, Texas time) on the effective date of the Proposed Contribution, the Borrower shall deliver to the Administrative Agent certified, fully executed, correct and complete copies of the PRH Contribution Agreement, the PRH LLC Agreement, and the PRM Transportation Agreement, in each case, as in effect on the Effective Date.  The Borrower's failure to satisfy the obligations set forth in this Section 7 shall constitute an immediate Event of Default under this Agreement and the Credit Agreement.\nSection 8.         Effect on Loan Documents.  Except as amended herein, the Credit Agreement and the Loan Documents remain in full force and effect as originally executed and are hereby ratified and confirmed, and nothing herein shall act as a waiver of any of the Administrative Agent's or Lenders' rights under the Loan Documents.  This Agreement is a Loan Document for the purposes of the provisions of the other Loan Documents.  Without limiting the foregoing, any breach of representations, warranties, and covenants under this Agreement is a Default or Event of Default under other Loan Documents.\nSection 9.         Choice of Law.  This Agreement shall be governed by and construed and enforced in accordance with the laws of the State of New York without regard to conflicts of laws principles (other than Sections 5-1401 and 5-1402 of the General Obligations Law of the State of New York).\nSection 10.         Counterparts.  This Agreement may be signed in any number of counterparts, each of which shall be an original."}
+{"idx": 11, "order": 11, "level": 2, "span": "(a)         Each of the Borrower and each Guarantor (i) is party to certain Security Documents securing and supporting the Borrower's and Guarantors’ obligations under the Loan"}


Suggestion: This span is truncated mid-clause (...under the Loan) and no longer forms a complete legal sentence, which will break downstream clause-level consumers that expect each record to be semantically complete. Regenerate this baseline after fixing the signature/body split so this record contains the full (a) clause text (or is merged with its continuation). [incomplete implementation]

Severity Level: Critical 🚨

- ❌ Level-freeze baseline for idx=11 stores truncated clause. - ❌ Regression checks lock parser into emitting incomplete clause. - ⚠️ Clause-level consumers misinterpret `(a)` obligations text.

Steps of Reproduction ✅

1. Run the configured parser as described in `README.md` lines 83–85 (`uv run scripts/parse_doc2dict_with_config.py --output-dir data/auto_parse ...`), which writes the current parser output for all agreements to `data/auto_parse/parse_doc2dict_with_config_nodes.jsonl` (see `scripts/level_loop/freeze.py` lines 8–11 and 41). 2. Freeze idx=11 by running `uv run scripts/level_loop/freeze.py 11`, which calls `filter_idx()` in `scripts/level_loop/freeze.py` lines 63–77 to collect all records with `idx == 11` from `parse_doc2dict_with_config_nodes.jsonl`, validates them in `validate_records()` (lines 513–615), and writes them verbatim to `data/auto_parse/level_freeze/frozen/idx_11.jsonl` in `main()` lines 618–691. 3. Inspect the frozen baseline at `data/auto_parse/level_freeze/frozen/idx_11.jsonl` (tool output `Read` lines 11–12): record `order=10` (line 11) ends with the text `"... Security Documents ... obligations under the Loan Documents, as the same may be amended..."`, while record `order=11` (line 12, shown in this suggestion) contains only the prefix `(a) Each of the Borrower and each Guarantor (i) is party to certain Security Documents ... obligations under the Loan` and stops mid-sentence before the word `Documents`, leaving the `(a)` clause span truncated. 4. Downstream consumers rely on each JSONL record being a full clause: `task_rules/level_rubric.md` lines 5–7 and 124–142 define the parser's goal as slicing into clauses so that each `span` is heading+body and the concatenation of spans in `order` reconstructs the source; `scripts/level_loop/regress.py` lines 45–58 and 66–78 load the frozen baseline and compare future parser output record-by-record on `(idx, level, span)`. Any attempt to fix the parser so that the `(a)` clause emits as a single complete span will make the new `order=11` record differ from this truncated frozen baseline, causing `regress.py` to report a regression for idx=11 even though the new output is semantically correct.

Fix in Cursor | Fix in VSCode Claude

(Use Cmd/Ctrl + Click for best experience)

Prompt for AI Agent 🤖

This is a comment left during a code review. **Path:** data/auto_parse/level_freeze/frozen/idx_11.jsonl **Line:** 12:12 **Comment:** *Incomplete Implementation: This span is truncated mid-clause (`...under the Loan`) and no longer forms a complete legal sentence, which will break downstream clause-level consumers that expect each record to be semantically complete. Regenerate this baseline after fixing the signature/body split so this record contains the full `(a)` clause text (or is merged with its continuation). Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise. Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fix

👍 | 👎

codeant-ai · 2026-05-17T12:39:51Z

CodeAnt AI finished reviewing your PR.

sourcery-ai Bot reviewed May 17, 2026

View reviewed changes

codeant-ai Bot added the size:L This PR changes 100-499 lines, ignoring generated files label May 17, 2026

coderabbitai Bot added the Feat2 label May 17, 2026

gemini-code-assist Bot reviewed May 17, 2026

View reviewed changes

coderabbitai Bot requested changes May 17, 2026

View reviewed changes

codeant-ai Bot reviewed May 17, 2026

View reviewed changes

arthrod mentioned this pull request May 17, 2026

idx=12: freeze (412 records) — Triton Container Ninth Restated Improve IWW detection in signature-page explosion logicedit Agreement (widened IWW + ancestor up-walk) #85

Open

6 tasks

		{"idx": 11, "order": 10, "level": 2, "span": "6.27 PRH and PRM. Notwithstanding anything to the contrary contained herein, no Loan Party shall, nor shall it permit any of its Subsidiaries to, create, assume, incur or suffer to exist any Lien on or in respect of any of its Property for the benefit of PRH or PRM.\nSection 3. Reaffirmation of Liens.\nDocuments, (ii) represents and warrants that it has no defenses to the enforcement of the Security Documents and that according to their terms the Security Documents will continue in full force and effect to secure the Borrower’s and Guarantors’ obligations under the Loan Documents, as the same may be amended, supplemented, or otherwise modified, and (iii) acknowledges, represents, and warrants that the liens and security interests created by the Security Documents are valid and subsisting and create a first and prior Lien (subject only to Permitted Liens) in the Collateral to secure the Secured Obligations.\nSection 4. Reaffirmation of Guaranty. Each Guarantor hereby ratifies, confirms, and acknowledges that its obligations under the Guaranty and the other Loan Documents are in full force and effect and that such Guarantor continues to unconditionally and irrevocably guarantee the full and punctual payment, when due, whether at stated maturity or earlier by acceleration or otherwise, of all of the Guaranteed Obligations (as defined in the Guaranty), as such Guaranteed Obligations may have been amended by this Agreement. Each Guarantor hereby acknowledges that its execution and delivery of this Agreement does not indicate or establish an approval or consent requirement by such Guarantor under the Credit Agreement in connection with the execution and delivery of amendments, modifications or waivers to the Credit Agreement, the Notes or any of the other Loan Documents.\nSection 5. Representations and Warranties. Each of the Borrower and each Guarantor represents and warrants to the Administrative Agent and the Lenders that:\nSection 6. Effectiveness. This Agreement shall become effective as of the date hereof upon the occurrence of all of the following:\nSection 7. Post-Closing Obligations. On or before 5:00 p.m. (Houston, Texas time) on the effective date of the Proposed Contribution, the Borrower shall deliver to the Administrative Agent certified, fully executed, correct and complete copies of the PRH Contribution Agreement, the PRH LLC Agreement, and the PRM Transportation Agreement, in each case, as in effect on the Effective Date. The Borrower's failure to satisfy the obligations set forth in this Section 7 shall constitute an immediate Event of Default under this Agreement and the Credit Agreement.\nSection 8. Effect on Loan Documents. Except as amended herein, the Credit Agreement and the Loan Documents remain in full force and effect as originally executed and are hereby ratified and confirmed, and nothing herein shall act as a waiver of any of the Administrative Agent's or Lenders' rights under the Loan Documents. This Agreement is a Loan Document for the purposes of the provisions of the other Loan Documents. Without limiting the foregoing, any breach of representations, warranties, and covenants under this Agreement is a Default or Event of Default under other Loan Documents.\nSection 9. Choice of Law. This Agreement shall be governed by and construed and enforced in accordance with the laws of the State of New York without regard to conflicts of laws principles (other than Sections 5-1401 and 5-1402 of the General Obligations Law of the State of New York).\nSection 10. Counterparts. This Agreement may be signed in any number of counterparts, each of which shall be an original."}
		{"idx": 11, "order": 11, "level": 2, "span": "(a) Each of the Borrower and each Guarantor (i) is party to certain Security Documents securing and supporting the Borrower's and Guarantors’ obligations under the Loan"}

Conversation

arthrod commented May 17, 2026 • edited by codeant-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

User description

Summary

Parser changes (2 surgical, shape-driven)

Verified output for idx=11

Top structure

char_ratio 80.89% — flagged but rubric-pass

83 L2 records — rubric-compliant, not over-fragmentation

Test plan

Source

CodeAnt-AI Description

What Changed

Impact

Checking Your Pull Request

Talking to CodeAnt AI

Example

Preserve Org Learnings with CodeAnt

Example

Retrigger review

Check Your Repository Health

Uh oh!

blocksorg Bot commented May 17, 2026

Uh oh!

sourcery-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

qodo-code-review Bot commented May 17, 2026

Qodo reviews are paused for this user.

Uh oh!

codeant-ai Bot commented May 17, 2026

Uh oh!

coderabbitai Bot commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 17, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 17, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 17, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Important Notice: User Requirements Cannot Be Fulfilled

Additional Context Found for PR Review

1. Document Parsing Landscape & Signature Extraction Context

2. Hierarchical Structure Importance

3. Quality Metrics Validation

4. Signature Extraction Challenges

Review Recommendations

Uh oh!

coderabbitai Bot May 17, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 17, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 17, 2026

Choose a reason for hiding this comment

Uh oh!

codeant-ai Bot May 17, 2026

Choose a reason for hiding this comment

Uh oh!

codeant-ai Bot commented May 17, 2026

arthrod commented May 17, 2026 •

edited by codeant-ai Bot

Loading

coderabbitai Bot commented May 17, 2026 •

edited

Loading