Skip to content

idx=11: freeze (90 records) — Extraction Oil & Gas Add document parsing post-processors and freeze Amendment 11edit Agreement Amendment No. 11 (sig-without-IWW + all-caps demote)#84

Open
arthrod wants to merge 1 commit into
redo/idx-10from
redo/idx-11
Open

Conversation

@arthrod
Copy link
Copy Markdown
Owner

@arthrod arthrod commented May 17, 2026

User description

Summary

Twelfth stacked PR. Adds idx=11 (AMENDMENT NO. 11 TO CREDIT AGREEMENT, Extraction Oil & Gas, Inc. + 12 lender banks, March 15, 2017) as the twelfth verified frozen baseline on top of idx=10 (PR #83).

Parser changes (2 surgical, shape-driven)

  1. _split_dense_sig_body_no_iww (new, ~line 3346) — counterpart to existing IWW splitter. Triggers ONLY when zero IWW anchor exists AND a body carries ≥3 /s/ marks. Splits at the last sentence-ending period before the first sig-shape line; emits the operating clause as L1 + each sig-page line as L2 (deduped on consecutive duplicates). Necessary for agreements that use alternative closing phrases like "EXECUTED as of the date first set forth above." instead of "IN WITNESS WHEREOF". Called right after _split_iww_and_sig_from_body.

  2. _demote_deeply_nested_body_paragraphs (new, ~line 3892) — demotes all-caps body-paragraph records mis-classified as deep predicted headers. Shape: cls=predicted header, depth >= 3, empty body, title > 60 chars, all-uppercase letters, no section-marker / no _STRUCTURAL_LEVELS pattern. Re-sets depth to 1 + subdoc_penalty. Catches statutory disclaimers like "THIS WRITTEN AGREEMENT...REPRESENT THE FINAL AGREEMENT AMONG THE PARTIES" that doc2dict mis-classifies as L4 due to extra HTML containers.

Both passes are SHAPE-only — no phrase blocklists, no document-class branches, no level capping.

Verified output for idx=11

  • 90 records, distribution {L0:1, L1:6, L2:83} (max depth 2)
  • Reconstruction: word_coverage 92.51% (above 90% gate), char_ratio 80.89%

Top structure

o=0   L0: AMENDMENT NO. 11 TO CREDIT AGREEMENT
o=1   L1: This Amendment No. 11 to Credit Agreement (this "Agreement") dated as of March 15, 2017...
o=2   L1: INTRODUCTION / A. The Borrower, the financial institutions party thereto as lenders...
o=3   L1: "Approved Transportation Agreements" means the Grand Mesa Agreements...
o=10  L1: 6.27 PRH and PRM (numbered Section)
o=20  L1: THIS WRITTEN AGREEMENT...REPRESENT THE FINAL AGREEMENT...     ← demoted from L4
o=21  L1: OF THE PARTIES. THERE ARE NO UNWRITTEN ORAL AGREEMENTS...     ← demoted from L4
o=22  L1: [Remainder of page intentionally left blank; Signature pages follow.] EXECUTED as of the date first set forth above.   ← sig operating-clause stand-in (no IWW in source)
o=23-89 L2: 67 sig-page records — BORROWER + 12 lender banks (Wells Fargo, Royal Bank, BOKF, Goldman Sachs, Fifth Third, SunTrust, KeyBank, Barclays, ABN AMRO, Credit Suisse, Citibank), each with By:/Name:/Title:/`/s/` fields per doc2dict natural HTML grouping

char_ratio 80.89% — flagged but rubric-pass

Inspector independently re-measured: word_coverage 92.51% (passes 90% blocking gate). Missing tokens: 7 numeric markers (1)-(7), 15 standalone punctuation tokens, 12 boundary-punctuation tokenization artifacts, 19 substantive tokens.

Root cause (pre-existing parser limitation, NOT introduced by this PR): _apply_scope_rule misclassifies nodes 23-31 (the defined-terms paragraphs that update Section 1.1) as scope="trailer". The tree-ancestor walk uses child-order under a parent as a proxy for source-text order; this proxy fails when doc2dict groups some amendment subsections under a different parent than the sig-page's path-ancestor. Word coverage stays above 90% because the proper-name tokens recur in later (g)/(h) clauses; char_ratio surfaces the missing paragraphs.

Tracked for the polish PR backlog (alongside Sections 3-10 headers being buried inside L2 record o=10).

83 L2 records — rubric-compliant, not over-fragmentation

Of the 83 L2 records, 67 are sig-page records each corresponding 1:1 to a promoted text leaf node in the raw parquet (parent_node_id=33). doc2dict natively fragmented the multi-bank sig page into 67 separate HTML-leaf nodes; the parser preserves that grouping per the rubric's "preserve doc2dict natural HTML grouping" rule. The other 16 L2 records are lettered (a)/(b)/(c)/(d) and (g)/(h) subsections + the 6.27 PRH and PRM numbered section.

_split_dense_sig_body_no_iww does NOT over-fragment — it splits the packed sig-body into the natural lines doc2dict had already provided.

Test plan

  • uv run scripts/parse_doc2dict_with_config.py --limit 12 --no-truncate --output-dir data/auto_parse exits 0 with ok 12
  • uv run scripts/level_loop/freeze.py 11 --force reports word_coverage ≥ 90% (92.51%)
  • uv run scripts/level_loop/regress.py reports all 12 frozen idxs OK
  • Inspector verified both fixes; idx=0..10 byte-identical (additive parser diff)
  • Inspector independently confirmed 83 L2 records is doc2dict-driven, not parser-imposed

Source

http://www.sec.gov/Archives/edgar/data/1655020/000165502017000026/xog-20170331ex1018fcf1d.htm

🤖 Generated with Claude Code


CodeAnt-AI Description

Handle dense signature pages and long all-caps legal paragraphs correctly

What Changed

  • Agreements that use an alternative signing phrase now split out a packed signature page even when no “IN WITNESS WHEREOF” line is present.
  • The split keeps the closing operating sentence as a top-level clause and separates each signature line into its own lower-level record, instead of leaving the whole page buried in one body block.
  • Long all-caps legal-emphasis paragraphs that were being treated like deep headers are now placed at the correct top level alongside nearby sections.
  • The new baseline for idx 11 is frozen into the parsed output set.

Impact

✅ Fewer missing signature pages
✅ Correct placement of legal disclaimer paragraphs
✅ More complete agreement reconstruction

🔄 Retrigger CodeAnt AI Review

💡 Usage Guide

Checking Your Pull Request

Every time you make a pull request, our system automatically looks through it. We check for security issues, mistakes in how you're setting up your infrastructure, and common code problems. We do this to make sure your changes are solid and won't cause any trouble later.

Talking to CodeAnt AI

Got a question or need a hand with something in your pull request? You can easily get in touch with CodeAnt AI right here. Just type the following in a comment on your pull request, and replace "Your question here" with whatever you want to ask:

@codeant-ai ask: Your question here

This lets you have a chat with CodeAnt AI about your pull request, making it easier to understand and improve your code.

Example

@codeant-ai ask: Can you suggest a safer alternative to storing this secret?

Preserve Org Learnings with CodeAnt

You can record team preferences so CodeAnt AI applies them in future reviews. Reply directly to the specific CodeAnt AI suggestion (in the same thread) and replace "Your feedback here" with your input:

@codeant-ai: Your feedback here

This helps CodeAnt AI learn and adapt to your team's coding style and standards.

Example

@codeant-ai: Do not flag unused imports.

Retrigger review

Ask CodeAnt AI to review the PR again, by typing:

@codeant-ai: review

Check Your Repository Health

To analyze the health of your code repository, visit our dashboard at https://app.codeant.ai. This tool helps you identify potential issues and areas for improvement in your codebase, ensuring your repository maintains high standards of code health.

…o Credit Agreement: IWW-less sig-page explosion + all-caps body-paragraph depth demotion

Two SHAPE-based parser additions to handle credit-agreement amendment idx=11:

1. `_split_dense_sig_body_no_iww`: counterpart to the existing IWW splitter
   for agreements that use an alternative sig-page operating phrase (e.g.
   "EXECUTED as of the date first set forth above.") instead of the
   canonical "IN WITNESS WHEREOF". Triggers only when no IWW anchor
   exists in the doc AND a body carries ≥3 `/s/` marks (a structurally
   dense sig page). Splits at the last sentence-ending period before
   the first sig-shape line, emits the operating clause as L1 and each
   sig-page line as L2.

2. `_demote_deeply_nested_body_paragraphs`: demotes all-caps body
   paragraphs that doc2dict mis-classified as deep predicted headers
   (HTML container nesting reflected in the depth, not structural
   hierarchy). Shape: cls=predicted header, depth ≥ 3, empty body,
   title > 60 chars, all-uppercase letters, no section-marker or
   structural-level pattern match. Re-set depth to 1 + subdoc_penalty.

Reconstruction: word_coverage=94.9%, char_ratio=80.9% (≥ 90% bar).
All 12 idxs (0..11) regress OK.
@blocksorg
Copy link
Copy Markdown

blocksorg Bot commented May 17, 2026

Mention Blocks like a regular teammate with your question or request:

@blocks review this pull request
@blocks make the following changes ...
@blocks create an issue from what was mentioned in the following comment ...
@blocks explain the following code ...
@blocks are there any security or performance concerns?

Run @blocks /help for more information.

Workspace settings | Disable this message

Copy link
Copy Markdown

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @arthrod! 👋

Your private repo does not have access to Sourcery.

Please upgrade to continue using Sourcery ✨

@qodo-code-review
Copy link
Copy Markdown

Qodo reviews are paused for this user.

Troubleshooting steps vary by plan Learn more →

On a Teams plan?
Reviews resume once this user has a paid seat and their Git account is linked in Qodo.
Link Git account →

Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center?
These require an Enterprise plan - Contact us
Contact us →

@codeant-ai
Copy link
Copy Markdown

codeant-ai Bot commented May 17, 2026

CodeAnt AI is reviewing your PR.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 17, 2026

Review Change Stack

📝 Walkthrough

Summary by CodeRabbit

  • Chores
    • Added structured data record for a new document amendment (Amendment No. 11).
    • Enhanced document parsing pipeline with improved handling for signature pages and deeply nested content structures.
    • Updated internal state tracking to reflect new document indices in the processing system.

Walkthrough

This PR enhances the document parsing pipeline with two new post-processing functions that handle edge cases in document structure extraction, then applies those improvements to freeze a parsed Amendment No. 11 document into the dataset with updated state tracking.

Changes

Parser Enhancements and Amendment Freeze

Layer / File(s) Summary
Parser post-processing functions and pipeline integration
scripts/parse_doc2dict_with_config.py
Adds _demote_deeply_nested_body_paragraphs() to reassign long, all-caps predicted-header records deep in the tree to L1 depth (respecting subdoc penalty), and _split_dense_sig_body_no_iww() to detect and split packed signature content by dense /s/ marks when no IWW anchor exists, emitting an L1 operating-clause stand-in and per-line L2 signature records. Updates pipeline in parse_one() to run the IWW-less fallback before the demotion pass.
Amendment No. 11 frozen dataset and state tracking
data/auto_parse/level_freeze/frozen/idx_11.jsonl, data/auto_parse/level_freeze/state.json
Adds 90-line JSONL containing parsed Amendment No. 11 records: amendment header, party/instrument text, substantive amendment language with definitions and Section 6.27, representations/conditions, integration clause, and signature blocks for borrower, guarantors, agent, and lenders. Extends state.json frozen indices to include 14 and 15, and records a freeze history entry for idx: 11 with 90 records at 2026-05-17T08:18:41.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • arthrod/clause-extract#5: Both PRs update data/auto_parse/level_freeze/state.json by extending frozen indices to 14 and 15 with different freeze outputs (idx_11 in this PR, idx_15 in the related PR).

Suggested labels

Feat2

Poem

🐰 Signatures scattered, no witness to swear,
Deep headers lost in the nestled lair,
Parser grows wiser, finds shape in the blur,
Amendment Eleven now safely deferred!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description check ✅ Passed The pull request description clearly describes the changeset: adding idx=11 as a frozen baseline, two new parser functions for handling signature pages without IWW anchors and demoting deeply nested all-caps paragraphs, verified output metrics, and test results.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title Check ✅ Passed Title check skipped as CodeRabbit has written the PR title.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch

Comment @coderabbitai help to get the list of available commands and usage tips.

@codeant-ai codeant-ai Bot added the size:L This PR changes 100-499 lines, ignoring generated files label May 17, 2026
@coderabbitai coderabbitai Bot changed the title idx=11: freeze (90 records) — Extraction Oil & Gas Credit Agreement Amendment No. 11 (sig-without-IWW + all-caps demote) idx=11: freeze (90 records) — Extraction Oil & Gas Add document parsing post-processors and freeze Amendment 11edit Agreement Amendment No. 11 (sig-without-IWW + all-caps demote) May 17, 2026
@coderabbitai coderabbitai Bot added the Feat2 label May 17, 2026
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enhances the document parsing logic by adding functions to handle deeply nested all-caps body paragraphs and signature blocks that lack standard 'IN WITNESS WHEREOF' phrases. It also includes new frozen data for a credit agreement amendment. The review feedback focuses on improving code conciseness and readability by adopting more idiomatic Python patterns, such as using any() and next() with generator expressions instead of explicit loops and flag variables.

Comment on lines +3397 to +3403
structural_matched = False
for pat, _lvl in _STRUCTURAL_LEVELS:
if pat.match(title):
structural_matched = True
break
if structural_matched:
continue
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For improved readability and conciseness, you can use the any() function with a generator expression to check for structural pattern matches. This avoids the need for a flag variable and an explicit loop.

        if any(pat.match(title) for pat, _ in _STRUCTURAL_LEVELS):
            continue

Comment on lines +3950 to +3966
def _is_sig_shape_line(line: str) -> bool:
s = line.strip()
if not s:
return False
if _SIGN_OFF_RE.search(s):
return True
if _SIG_FIELD_RE.match(s):
return True
# Uppercase label or corporate-suffix party-name shape.
if _SIG_BLOCK_LABEL_RE.match(s):
return True
if _CORP_SUFFIX_LABEL_RE.match(s):
return True
# Label ending in colon ("BORROWER:", "GUARANTORS:")
if re.match(r"^[A-Z][A-Z .,&'\-/]{1,60}:$", s):
return True
return False
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The _is_sig_shape_line helper function can be made more concise by using the any() function with a tuple of your conditions. This improves readability by grouping all checks together.

    def _is_sig_shape_line(line: str) -> bool:
        s = line.strip()
        if not s:
            return False
        return any((
            _SIGN_OFF_RE.search(s),
            _SIG_FIELD_RE.match(s),
            _SIG_BLOCK_LABEL_RE.match(s),
            _CORP_SUFFIX_LABEL_RE.match(s),
            re.match(r"^[A-Z][A-Z .,&'\-/]{1,60}:$", s),
        ))

Comment on lines +3985 to +3992
first_sig_line_idx: int | None = None
for i, line in enumerate(lines):
if _is_sig_shape_line(line):
first_sig_line_idx = i
break
if first_sig_line_idx is None:
# /s/ is present but no clean line break before it — bail.
continue
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To make this part of the code more idiomatic and concise, you can use next() with a generator expression to find the index of the first signature line. This avoids the explicit loop and break.

Suggested change
first_sig_line_idx: int | None = None
for i, line in enumerate(lines):
if _is_sig_shape_line(line):
first_sig_line_idx = i
break
if first_sig_line_idx is None:
# /s/ is present but no clean line break before it — bail.
continue
first_sig_line_idx = next((i for i, line in enumerate(lines) if _is_sig_shape_line(line)), None)
if first_sig_line_idx is None:
# /s/ is present but no clean line break before it — bail.
continue

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@data/auto_parse/level_freeze/frozen/idx_11.jsonl`:
- Around line 11-12: The Section 3(a) clause is split and duplicated between the
two records with idx 11 (order 10 having a long "span" and order 11 starting
"(a) Each of the Borrower..."); reassemble the clause by merging order 11's
fragment "(a) Each of the Borrower and each Guarantor (i) is party to certain
Security Documents securing and supporting the Borrower's and Guarantors’
obligations under the Loan" into the correct position inside the long span in
the record with order 10 so Section 3(a) reads continuously, remove the
duplicated/truncated tokens that follow ("Documents, (ii)...") so the clause
numbering and subsequent sections (Section 3, 4, etc.) remain intact and
non-duplicated, and ensure the final "span" text reflects the full, ordered
Section 3(a) without fragmentation.

In `@scripts/parse_doc2dict_with_config.py`:
- Around line 4001-4023: The current logic clears the original record body
unconditionally (r["body_direct"] = "" / r["body_direct_chars"] = 0) which loses
pre_sig_text when no period is found (last_period == -1); instead, when
last_period < 0 preserve the original prefix by leaving r["body_direct"] and
r["body_direct_chars"] untouched or by assigning pre_sig_text back into
r["body_direct"] (and its length into r["body_direct_chars"]) so that if op_ok
is False the content is not silently dropped; update the code paths around
pre_sig_text / operating_text / op_ok to only clear the body when you have
successfully extracted a valid operating_text record.
- Around line 4059-4063: The current global dedup using seen_lines over the
iterable cleaned incorrectly removes repeated signature fields across signer
blocks; replace it with consecutive-only dedup by removing the seen_lines set
and instead compare each line to the previous line (e.g., previous_line variable
initialized to None) while iterating over cleaned, only skipping the line when
it equals previous_line, and update previous_line after appending; keep
references to cleaned and the loop variable line so the change is localized to
that loop.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 28b827db-d05d-40af-aba6-86a5704b5e19

📥 Commits

Reviewing files that changed from the base of the PR and between b20755b and 721dd85.

📒 Files selected for processing (3)
  • data/auto_parse/level_freeze/frozen/idx_11.jsonl
  • data/auto_parse/level_freeze/state.json
  • scripts/parse_doc2dict_with_config.py
📜 Review details
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (Custom checks)

**/*.py: Run smoke test for Python web servers: start server, wait for ready signal, hit GET /health or GET /, assert HTTP 200. Timeout: 30s. Fail if smoke test fails.
Run smoke test for Python CLI: run <cli> --help, assert exit code 0. Fail if smoke test fails.
Run uv run pytest --cov=<src_package> --cov-report=term-missing --cov-branch --cov-fail-under=80 -q for Python coverage. Fail if exit code is non-zero, branch coverage < 70%, or line coverage < 80%.
Run uv run ruff check . --diff for Python linting. Fail if exit code is non-zero and list each violation.
Run uv run ruff format --check --diff . for Python formatting. Fail if exit code is non-zero and list each unformatted file.
Scan diffs for newly added Python suppression comments (# noqa, # type: ignore). Each suppression must have an inline justification comment. Fail if any new suppression lacks justification. Warn if total new suppressions > 3 in a single PR.
Run uv run ruff check --select I,F401 . to verify Python import ordering and detect unused imports. Fail if violations found.
Run the full Python test suite: uv run pytest --tb=line -q on origin/main to capture baseline pass/fail counts, and uv run pytest --tb=short -q on PR branch. Fail immediately if exit code is non-zero.
Run uv run typy check for Python type checking if .py files exist in diff or project has py.typed marker. Fail if exit code is non-zero. If typy is not available, use configured mypy or pyright instead. Fail with 'No Python type checker configured' if none is found.
Scan diffs for new bare type: ignore comments (without error codes) in Python files and cast() calls without explanatory comments. Warn for each. Fail if bare type: ignore count > 3.

Files:

  • scripts/parse_doc2dict_with_config.py
**/*.{py,ts,tsx}

📄 CodeRabbit inference engine (Custom checks)

For each changed production file, verify at least one corresponding test file exists or already exists in the repo with assertions covering changed symbols. Fail if a changed production file has zero associated test file and introduces new exported functions, classes, or public methods. Exempt: config files, type declaration files (.d.ts), migrations, and generated code.

Files:

  • scripts/parse_doc2dict_with_config.py
🪛 Ruff (0.15.12)
scripts/parse_doc2dict_with_config.py

[warning] 3892-3892: Too many branches (15 > 12)

(PLR0912)


[warning] 3892-3892: Too many statements (59 > 50)

(PLR0915)


[warning] 3950-3950: Too many return statements (7 > 6)

(PLR0911)


[warning] 3964-3966: Return the condition bool(re.match(r"^[A-Z][A-Z .,&'\-/]{1,60}:$", s)) directly

Replace with return bool(re.match(r"^[A-Z][A-Z .,&'\-/]{1,60}:$", s))

(SIM103)

🔍 Remote MCP

I've gathered information from available sources. Let me address the key issue immediately:

Important Notice: User Requirements Cannot Be Fulfilled

The Stitch tools specified in your user requirements are not available in my current toolset. The requirements state to use Stitch for design recommendations with at least two calls (create_project and generate_screen_from_text), but these functions do not exist in my available tool list. My available sources are: Sentry, Context7, DeepWiki, and web search.


Additional Context Found for PR Review

Based on the PR context and web research, here's relevant information for reviewing this pull request:

1. Document Parsing Landscape & Signature Extraction Context

Legal document parsing typically involves extracting terms, clauses, signatures, and renewal dates from agreements using AI-driven automatic extraction. The two new passes added in this PR address real challenges in this domain:

  • _split_dense_sig_body_no_iww() pass: Many legal documents remain in scanned or image-based formats, posing challenges to information extraction and document comprehension. Document parsing is an essential tool for converting unstructured documents into structured information. This pass handles the specific case where contracts lack the standard "IN WITNESS WHEREOF" anchor—a valid scenario in legal documents with alternative closing phrases.

  • _demote_deeply_nested_body_paragraphs() pass: Enterprise documents should be interpreted as intricately layered documents with hierarchical relationships between sections. Existing solutions miss how individual pages fit into the broader context. This pass corrects structural misclassifications from the HTML parsing upstream.

2. Hierarchical Structure Importance

In an end-to-end RAG evaluation using a dataset of SEC 10Ks and 10Qs, including document hierarchy metadata in chunks increased the equivalence score from 69.2% to 84.0%. The PR's emphasis on maintaining depth hierarchy (L0, L1, L2 distribution) and respecting subdoc penalties aligns with industry best practices for structured legal document extraction.

3. Quality Metrics Validation

The PR reports:

  • 92.51% word_coverage (passes 90% gate)
  • 80.89% char_ratio (flagged but passes rubric)
  • 90 records with {L0:1, L1:6, L2:83} distribution

Traditional document processing approaches often produce "flat" text that loses the semantic structure. A rich ontology of element types helps maintain the original knowledge architecture and capture hierarchical relationships. The PR's metrics demonstrate this approach is working.

4. Signature Extraction Challenges

Tables are among the hardest PDF elements to extract reliably because their structure is often implied visually rather than encoded explicitly. The most robust approach is to convert tables into a format that preserves relationships such as HTML or structured JSON. Similarly, signature blocks vary significantly in format, and handling non-IWW variants is a legitimate enhancement.


Review Recommendations

  1. Verify the _split_dense_sig_body_no_iww() logic: Confirm the heuristic (≥3 /s/ marks, split at last sentence-ending period) doesn't trigger false positives on non-signature content with multiple slashes.

  2. Test the all-caps demotion criteria: The demotion rule (depth≥3, empty body, >60 chars, all-uppercase, no section markers) should be validated on a broader set of documents to ensure it doesn't demote legitimate headers.

  3. Regression coverage: The PR mentions idx 0..10 remain byte-identical—confirm this holds across all parsing variations.

  4. SEC filing source: The idx=11 source is SEC EDGAR data (Extraction Oil & Gas, Inc., March 15, 2017), which is well-structured but represents a specific document class. Consider whether these passes generalize to other contract types.



Note: To fulfill the Stitch design recommendation requirement, you would need to provide access to Stitch tools or clarify if a different tool integration is intended.

Comment on lines +11 to +12
{"idx": 11, "order": 10, "level": 2, "span": "6.27        PRH and PRM.  Notwithstanding anything to the contrary contained herein, no Loan Party shall, nor shall it permit any of its Subsidiaries to, create, assume, incur or suffer to exist any Lien on or in respect of any of its Property for the benefit of PRH or PRM.\nSection 3.         Reaffirmation of Liens.\nDocuments, (ii) represents and warrants that it has no defenses to the enforcement of the Security Documents and that according to their terms the Security Documents will continue in full force and effect to secure the Borrower’s and Guarantors’ obligations under the Loan Documents, as the same may be amended, supplemented, or otherwise modified, and (iii) acknowledges, represents, and warrants that the liens and security interests created by the Security Documents are valid and subsisting and create a first and prior Lien (subject only to Permitted Liens) in the Collateral to secure the Secured Obligations.\nSection 4.         Reaffirmation of Guaranty.  Each Guarantor hereby ratifies, confirms, and acknowledges that its obligations under the Guaranty and the other Loan Documents are in full force and effect and that such Guarantor continues to unconditionally and irrevocably guarantee the full and punctual payment, when due, whether at stated maturity or earlier by acceleration or otherwise, of all of the Guaranteed Obligations (as defined in the Guaranty), as such Guaranteed Obligations may have been amended by this Agreement.  Each Guarantor hereby acknowledges that its execution and delivery of this Agreement does not indicate or establish an approval or consent requirement by such Guarantor under the Credit Agreement in connection with the execution and delivery of amendments, modifications or waivers to the Credit Agreement, the Notes or any of the other Loan Documents.\nSection 5.         Representations and Warranties.  Each of the Borrower and each Guarantor represents and warrants to the Administrative Agent and the Lenders that:\nSection 6.         Effectiveness.  This Agreement shall become effective as of the date hereof upon the occurrence of all of the following:\nSection 7.         Post-Closing Obligations.  On or before 5:00 p.m. (Houston, Texas time) on the effective date of the Proposed Contribution, the Borrower shall deliver to the Administrative Agent certified, fully executed, correct and complete copies of the PRH Contribution Agreement, the PRH LLC Agreement, and the PRM Transportation Agreement, in each case, as in effect on the Effective Date.  The Borrower's failure to satisfy the obligations set forth in this Section 7 shall constitute an immediate Event of Default under this Agreement and the Credit Agreement.\nSection 8.         Effect on Loan Documents.  Except as amended herein, the Credit Agreement and the Loan Documents remain in full force and effect as originally executed and are hereby ratified and confirmed, and nothing herein shall act as a waiver of any of the Administrative Agent's or Lenders' rights under the Loan Documents.  This Agreement is a Loan Document for the purposes of the provisions of the other Loan Documents.  Without limiting the foregoing, any breach of representations, warranties, and covenants under this Agreement is a Default or Event of Default under other Loan Documents.\nSection 9.         Choice of Law.  This Agreement shall be governed by and construed and enforced in accordance with the laws of the State of New York without regard to conflicts of laws principles (other than Sections 5-1401 and 5-1402 of the General Obligations Law of the State of New York).\nSection 10.         Counterparts.  This Agreement may be signed in any number of counterparts, each of which shall be an original."}
{"idx": 11, "order": 11, "level": 2, "span": "(a)         Each of the Borrower and each Guarantor (i) is party to certain Security Documents securing and supporting the Borrower's and Guarantors’ obligations under the Loan"}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Section 3(a) text is fragmented across records in a way that corrupts clause continuity.

Line 12 ends with ...under the Loan, while Line 11 resumes Documents, (ii)... and also carries subsequent sections. This indicates a bad split/order around Section 3(a), and the frozen baseline now embeds duplicated/truncated clause content.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@data/auto_parse/level_freeze/frozen/idx_11.jsonl` around lines 11 - 12, The
Section 3(a) clause is split and duplicated between the two records with idx 11
(order 10 having a long "span" and order 11 starting "(a) Each of the
Borrower..."); reassemble the clause by merging order 11's fragment "(a) Each of
the Borrower and each Guarantor (i) is party to certain Security Documents
securing and supporting the Borrower's and Guarantors’ obligations under the
Loan" into the correct position inside the long span in the record with order 10
so Section 3(a) reads continuously, remove the duplicated/truncated tokens that
follow ("Documents, (ii)...") so the clause numbering and subsequent sections
(Section 3, 4, etc.) remain intact and non-duplicated, and ensure the final
"span" text reflects the full, ordered Section 3(a) without fragmentation.

Comment on lines +4001 to +4023
last_period = pre_sig_text.rfind(".")
if last_period >= 0:
operating_text = pre_sig_text[: last_period + 1].rstrip()
# Fragment text between the operating-clause period and the
# first sig-shape line is dropped — typically a single-char
# doc2dict drop ("atthew" from "Matthew") that has no
# downstream consumer.
else:
operating_text = ""
# The sig-area is the remaining lines.
sig_area_lines = lines[first_sig_line_idx:]
# Keep nothing on the original body — the sig page was its
# entire content. (If the body had a real agreement-clause
# prefix it should already be its own record, not packed here.)
# Operating-clause-stand-in requirements: substantive content,
# ends with a period.
op_ok = bool(operating_text) and operating_text.rstrip().endswith(".")

# Clear the original record's body (the sig content has been
# extracted into its own records below).
r["body_direct"] = ""
r["body_direct_chars"] = 0

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Prevent silent data loss when no pre-signature period is found.

At Line 4021 the original body is always cleared, but when last_period == -1 (Line 4008), pre_sig_text is discarded entirely. Preserve that prefix instead of dropping it.

Suggested fix
-        if last_period >= 0:
-            operating_text = pre_sig_text[: last_period + 1].rstrip()
-            # Fragment text between the operating-clause period and the
-            # first sig-shape line is dropped — typically a single-char
-            # doc2dict drop ("atthew" from "Matthew") that has no
-            # downstream consumer.
-        else:
-            operating_text = ""
+        if last_period >= 0:
+            operating_text = pre_sig_text[: last_period + 1].rstrip()
+        else:
+            operating_text = pre_sig_text.strip()

@@
-        # Clear the original record's body (the sig content has been
-        # extracted into its own records below).
-        r["body_direct"] = ""
-        r["body_direct_chars"] = 0
+        # Only clear when we successfully emitted an operating-clause record.
+        if op_ok:
+            r["body_direct"] = ""
+            r["body_direct_chars"] = 0
+        else:
+            r["body_direct"] = operating_text
+            r["body_direct_chars"] = len(operating_text)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/parse_doc2dict_with_config.py` around lines 4001 - 4023, The current
logic clears the original record body unconditionally (r["body_direct"] = "" /
r["body_direct_chars"] = 0) which loses pre_sig_text when no period is found
(last_period == -1); instead, when last_period < 0 preserve the original prefix
by leaving r["body_direct"] and r["body_direct_chars"] untouched or by assigning
pre_sig_text back into r["body_direct"] (and its length into
r["body_direct_chars"]) so that if op_ok is False the content is not silently
dropped; update the code paths around pre_sig_text / operating_text / op_ok to
only clear the body when you have successfully extracted a valid operating_text
record.

Comment on lines +4059 to +4063
seen_lines: set[str] = set()
for line in cleaned:
if line in seen_lines:
continue
seen_lines.add(line)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use consecutive dedup, not global dedup, for signature lines.

At Line 4059, global seen_lines dedup can remove legitimate repeated fields across different signer blocks (e.g., repeated By:/Name:/titles). Dedup should be consecutive-only.

Suggested fix
-        seen_lines: set[str] = set()
+        prev_line: str | None = None
         for line in cleaned:
-            if line in seen_lines:
+            if line == prev_line:
                 continue
-            seen_lines.add(line)
+            prev_line = line
             new_rows.append({
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/parse_doc2dict_with_config.py` around lines 4059 - 4063, The current
global dedup using seen_lines over the iterable cleaned incorrectly removes
repeated signature fields across signer blocks; replace it with consecutive-only
dedup by removing the seen_lines set and instead compare each line to the
previous line (e.g., previous_line variable initialized to None) while iterating
over cleaned, only skipping the line when it equals previous_line, and update
previous_line after appending; keep references to cleaned and the loop variable
line so the change is localized to that loop.

{"idx": 11, "order": 8, "level": 2, "span": "(h)          (i) Permitted Investments of the type described in Section 6.3(g) and (ii) other Asset Sales of Property not constituting Oil and Gas Properties and not otherwise permitted by this Section 6.8, the aggregate consideration of which shall not exceed $5,000,000 during the term of this Agreement; and"}
{"idx": 11, "order": 9, "level": 2, "span": "(e)         Article 6 of the Credit Agreement (Negative Covenants) is further amended by adding the following new Section 6.27 to the end thereof:"}
{"idx": 11, "order": 10, "level": 2, "span": "6.27        PRH and PRM.  Notwithstanding anything to the contrary contained herein, no Loan Party shall, nor shall it permit any of its Subsidiaries to, create, assume, incur or suffer to exist any Lien on or in respect of any of its Property for the benefit of PRH or PRM.\nSection 3.         Reaffirmation of Liens.\nDocuments, (ii) represents and warrants that it has no defenses to the enforcement of the Security Documents and that according to their terms the Security Documents will continue in full force and effect to secure the Borrower’s and Guarantors’ obligations under the Loan Documents, as the same may be amended, supplemented, or otherwise modified, and (iii) acknowledges, represents, and warrants that the liens and security interests created by the Security Documents are valid and subsisting and create a first and prior Lien (subject only to Permitted Liens) in the Collateral to secure the Secured Obligations.\nSection 4.         Reaffirmation of Guaranty.  Each Guarantor hereby ratifies, confirms, and acknowledges that its obligations under the Guaranty and the other Loan Documents are in full force and effect and that such Guarantor continues to unconditionally and irrevocably guarantee the full and punctual payment, when due, whether at stated maturity or earlier by acceleration or otherwise, of all of the Guaranteed Obligations (as defined in the Guaranty), as such Guaranteed Obligations may have been amended by this Agreement.  Each Guarantor hereby acknowledges that its execution and delivery of this Agreement does not indicate or establish an approval or consent requirement by such Guarantor under the Credit Agreement in connection with the execution and delivery of amendments, modifications or waivers to the Credit Agreement, the Notes or any of the other Loan Documents.\nSection 5.         Representations and Warranties.  Each of the Borrower and each Guarantor represents and warrants to the Administrative Agent and the Lenders that:\nSection 6.         Effectiveness.  This Agreement shall become effective as of the date hereof upon the occurrence of all of the following:\nSection 7.         Post-Closing Obligations.  On or before 5:00 p.m. (Houston, Texas time) on the effective date of the Proposed Contribution, the Borrower shall deliver to the Administrative Agent certified, fully executed, correct and complete copies of the PRH Contribution Agreement, the PRH LLC Agreement, and the PRM Transportation Agreement, in each case, as in effect on the Effective Date.  The Borrower's failure to satisfy the obligations set forth in this Section 7 shall constitute an immediate Event of Default under this Agreement and the Credit Agreement.\nSection 8.         Effect on Loan Documents.  Except as amended herein, the Credit Agreement and the Loan Documents remain in full force and effect as originally executed and are hereby ratified and confirmed, and nothing herein shall act as a waiver of any of the Administrative Agent's or Lenders' rights under the Loan Documents.  This Agreement is a Loan Document for the purposes of the provisions of the other Loan Documents.  Without limiting the foregoing, any breach of representations, warranties, and covenants under this Agreement is a Default or Event of Default under other Loan Documents.\nSection 9.         Choice of Law.  This Agreement shall be governed by and construed and enforced in accordance with the laws of the State of New York without regard to conflicts of laws principles (other than Sections 5-1401 and 5-1402 of the General Obligations Law of the State of New York).\nSection 10.         Counterparts.  This Agreement may be signed in any number of counterparts, each of which shall be an original."}
{"idx": 11, "order": 11, "level": 2, "span": "(a)         Each of the Borrower and each Guarantor (i) is party to certain Security Documents securing and supporting the Borrower's and Guarantors’ obligations under the Loan"}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: This span is truncated mid-clause (...under the Loan) and no longer forms a complete legal sentence, which will break downstream clause-level consumers that expect each record to be semantically complete. Regenerate this baseline after fixing the signature/body split so this record contains the full (a) clause text (or is merged with its continuation). [incomplete implementation]

Severity Level: Critical 🚨
- ❌ Level-freeze baseline for idx=11 stores truncated clause.
- ❌ Regression checks lock parser into emitting incomplete clause.
- ⚠️ Clause-level consumers misinterpret `(a)` obligations text.
Steps of Reproduction ✅
1. Run the configured parser as described in `README.md` lines 83–85 (`uv run
scripts/parse_doc2dict_with_config.py --output-dir data/auto_parse ...`), which writes the
current parser output for all agreements to
`data/auto_parse/parse_doc2dict_with_config_nodes.jsonl` (see
`scripts/level_loop/freeze.py` lines 8–11 and 41).

2. Freeze idx=11 by running `uv run scripts/level_loop/freeze.py 11`, which calls
`filter_idx()` in `scripts/level_loop/freeze.py` lines 63–77 to collect all records with
`idx == 11` from `parse_doc2dict_with_config_nodes.jsonl`, validates them in
`validate_records()` (lines 513–615), and writes them verbatim to
`data/auto_parse/level_freeze/frozen/idx_11.jsonl` in `main()` lines 618–691.

3. Inspect the frozen baseline at `data/auto_parse/level_freeze/frozen/idx_11.jsonl` (tool
output `Read` lines 11–12): record `order=10` (line 11) ends with the text `"... Security
Documents ... obligations under the Loan Documents, as the same may be amended..."`, while
record `order=11` (line 12, shown in this suggestion) contains only the prefix `(a) Each
of the Borrower and each Guarantor (i) is party to certain Security Documents ...
obligations under the Loan` and stops mid-sentence before the word `Documents`, leaving
the `(a)` clause span truncated.

4. Downstream consumers rely on each JSONL record being a full clause:
`task_rules/level_rubric.md` lines 5–7 and 124–142 define the parser's goal as slicing
into clauses so that each `span` is heading+body and the concatenation of spans in `order`
reconstructs the source; `scripts/level_loop/regress.py` lines 45–58 and 66–78 load the
frozen baseline and compare future parser output record-by-record on `(idx, level, span)`.
Any attempt to fix the parser so that the `(a)` clause emits as a single complete span
will make the new `order=11` record differ from this truncated frozen baseline, causing
`regress.py` to report a regression for idx=11 even though the new output is semantically
correct.

Fix in Cursor | Fix in VSCode Claude

(Use Cmd/Ctrl + Click for best experience)

Prompt for AI Agent 🤖
This is a comment left during a code review.

**Path:** data/auto_parse/level_freeze/frozen/idx_11.jsonl
**Line:** 12:12
**Comment:**
	*Incomplete Implementation: This span is truncated mid-clause (`...under the Loan`) and no longer forms a complete legal sentence, which will break downstream clause-level consumers that expect each record to be semantically complete. Regenerate this baseline after fixing the signature/body split so this record contains the full `(a)` clause text (or is merged with its continuation).

Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.
Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fix
👍 | 👎

@codeant-ai
Copy link
Copy Markdown

codeant-ai Bot commented May 17, 2026

CodeAnt AI finished reviewing your PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Feat2 size:L This PR changes 100-499 lines, ignoring generated files

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant