idx=7: freeze (25 records) — Ingevity/WestRock Amendment No.1, multi-line title + page-footer exhibit strip by arthrod · Pull Request #80 · arthrod/clause-extract

arthrod · 2026-05-17T11:19:04Z

User description

Summary

Eighth stacked PR. Adds idx=7 (AMENDMENT NO.1 TO CRUDE TALL OIL AND BLACK LIQUOR SOAP SKIMMINGS AGREEMENT between Ingevity Corporation and WestRock Shared Services / WestRock MWV, March 2017) as the eighth verified frozen baseline on top of idx=6 (PR #79).

Parser changes (2 surgical, shape-driven)

_merge_multiline_l0_title (~line 2317) — handles multi-line agreement titles. When the L0 record has a preceding sibling (same parent_node_id, smaller node_id) that's an UPPERCASE predicted-header continuation line, merges them. Disqualifier regex blocks ARTICLE/SECTION/EXHIBIT/SCHEDULE/APPENDIX/ANNEX/WITNESSETH/WHEREAS/RECITALS/NOW THEREFORE/IN WITNESS WHEREOF/numbered/lettered patterns. Inspector noted doc2dict already pre-merged idx=7's title at node_id=3; the new pass mainly suppresses the duplicate first-line sibling for this idx. The more general pattern handles future cases where doc2dict fails to merge.
_strip_page_footer_exhibit_titles (~line 3258) — handles doc2dict promoting repeating page-header tags ("Exhibit 10.7" stamped on every page) into spurious cls=exhibit records. Without the fix, those bare-identifier titles match the FIRST source occurrence (the cover-page header) via the sort probe, dragging records to document start where they get dropped as pre-title chrome. Fix: clears the title on cls in {exhibit, schedule, appendix, annex} records that are NOT is_envelope, NOT real subdocs (per _is_real_subdoc_title), have a bare-identifier title, AND have substantive body (≥60 chars). The 60-char threshold protects idx=5's tiny "Schedule 1.1" + "Ex. B-98" trailing artifact.

Verified output for idx=7

25 records, distribution {L0:1, L1:9, L2:12, L3:3}
Reconstruction: word_coverage 97.0%, char_ratio 97.7%
Max depth: 3

Top-level structure (verbatim L0 + sig area)

o=0  L0: AMENDMENT NO.1 TO CRUDE TALL OIL AND
         BLACK LIQUOR SOAP SKIMMINGS AGREEMENT
o=1  L1: This Amendment No.1 (this "Amendment") to the Supply Agreement, dated as of March 1, 2017...
...
o=12 L1: IN WITNESS WHEREOF...                              (IWW operating clause)
o=13-24 L2: INGEVITY CORPORATION / WESTROCK SHARED / SERVICES, LLC / By:_/S/ S. Edward Woodcock, Jr. / By:_/S/ John D. Stakel / Name: ... / Title: ... / Date: March 1, 2017 / Date: March 8, 2017 / WESTROCK MWV, LLC (doc2dict natural per-line grouping)

Test plan

uv run scripts/parse_doc2dict_with_config.py --limit 8 --no-truncate --output-dir data/auto_parse exits 0 with ok 8
uv run scripts/level_loop/freeze.py 7 --force reports word_coverage ≥ 90% (97.0%)
uv run scripts/level_loop/regress.py reports all 8 frozen idxs OK
Inspector verified multi-line title verbatim, no spurious "Exhibit 10.7" records, idx=5 trailing artifact preserved (60-char threshold works), idx=2 real subdocs unaffected

Source

http://www.sec.gov/Archives/edgar/data/1653477/000165347717000051/ex107amendmentno1toctoagre.htm

Why this matters for the corpus

Multi-line title support: many SEC EX-10 filings have titles that wrap across two lines. The merge pass handles both cases — when doc2dict succeeds at merging (and we just need to drop the duplicate sibling) and when doc2dict fails (and we need to do the merge ourselves).
Page-footer exhibit-tag strip: SEC filings that stamp the exhibit identifier ("Exhibit 10.7") on every page get this tag promoted by doc2dict into spurious cls=exhibit sections with substantive body. Without the strip, those records get sort-ordered to document start and dropped as pre-title chrome — losing the body content. The strip clears the title while preserving the body at its true position.

🤖 Generated with Claude Code

CodeAnt-AI Description

Freeze idx=7: preserve a multi-line agreement title and keep exhibit pages in the right place

What Changed

Multi-line agreement titles now stay together as one top-level title, instead of splitting the first line into a separate section
Repeating page-header exhibit labels no longer pull exhibit pages to the start of the document, so their body text stays in place
Added a frozen baseline for idx=7

Impact

✅ Correct agreement titles
✅ Fewer lost exhibit sections
✅ More complete document extractions

🔄 Retrigger CodeAnt AI Review

💡 Usage Guide

Checking Your Pull Request

Every time you make a pull request, our system automatically looks through it. We check for security issues, mistakes in how you're setting up your infrastructure, and common code problems. We do this to make sure your changes are solid and won't cause any trouble later.

Talking to CodeAnt AI

Got a question or need a hand with something in your pull request? You can easily get in touch with CodeAnt AI right here. Just type the following in a comment on your pull request, and replace "Your question here" with whatever you want to ask:

@codeant-ai ask: Your question here

This lets you have a chat with CodeAnt AI about your pull request, making it easier to understand and improve your code.

Example

@codeant-ai ask: Can you suggest a safer alternative to storing this secret?

Preserve Org Learnings with CodeAnt

You can record team preferences so CodeAnt AI applies them in future reviews. Reply directly to the specific CodeAnt AI suggestion (in the same thread) and replace "Your feedback here" with your input:

@codeant-ai: Your feedback here

This helps CodeAnt AI learn and adapt to your team's coding style and standards.

Example

@codeant-ai: Do not flag unused imports.

Retrigger review

Ask CodeAnt AI to review the PR again, by typing:

@codeant-ai: review

Check Your Repository Health

To analyze the health of your code repository, visit our dashboard at https://app.codeant.ai. This tool helps you identify potential issues and areas for improvement in your codebase, ensuring your repository maintains high standards of code health.

…line L0 title merge + page-footer exhibit-title strip idx=7 is an Amendment No.1 to the Crude Tall Oil and Black Liquor Soap Skimmings Agreement between Ingevity Corp (Buyer) and WestRock (Seller, two affiliates). Title is typeset across two visual lines: AMENDMENT NO.1 TO CRUDE TALL OIL AND BLACK LIQUOR SOAP SKIMMINGS AGREEMENT Two parser additions, both purely structural (SHAPE-based, no phrase blocklists): 1. `_merge_multiline_l0_title` (new): when the L0 has a preceding sibling at the same parent (same parent_node_id, smaller node_id) that is a predicted-header with empty body, an UPPERCASE title (≥60% uppercase alphabetic chars), no structural-header / section / recital / signature pattern, and no sentence-terminator, it is a title-continuation line. Merge it into the L0 title (continuation first, L0 second) joined by newline; mark the continuation record as is_envelope=True so it drops from JSONL but stays in parquet. Runs BEFORE `_split_l0_title_from_preamble` so the preamble split sees the complete title. The disqualifier regex blocks ARTICLE / SECTION / EXHIBIT / WHEREAS / numbered-marker / lettered-marker / etc. patterns so body lines after the title aren't absorbed. 2. `_strip_page_footer_exhibit_titles` (new): doc2dict's HTML walker promotes each repeating page-header tag ("Exhibit 10.7" stamped on every page) into its own cls=exhibit section. The FIRST one is the legitimate SEC envelope (is_envelope=True already). Subsequent ones carry substantive body content from the page they head but their title is the bare exhibit identifier — which becomes the highest- priority probe in `_sort_records_by_source_position` and matches the FIRST source occurrence, dragging the record to the document start where `_drop_pre_title_position_records` would drop it as pre-title chrome. Fix: clear the title on cls=exhibit/schedule/ appendix/annex records that are not is_envelope, not real subdocs (per `_is_real_subdoc_title`), have a bare-identifier title, and have substantive body (≥60 chars). The sorter falls back to the body probe and locates the record at its true source position. The 60-char threshold protects records like idx=5's "Schedule 1.1" + " \\n \\nEx. B-98" body (page chrome, not substantive) from being stripped. Runs: - `_merge_multiline_l0_title` before `_split_l0_title_from_preamble`. - `_strip_page_footer_exhibit_titles` after `_explode_signature_block_lines` and before `_sort_records_by_source_position`. idx=7 final stats: - 25 records (1 L0, 9 L1, 12 L2, 3 L3) - reconstruction: word_coverage 97.0%, char_ratio 97.7% (≥ 90% bar) - L0 = "AMENDMENT NO.1 TO CRUDE TALL OIL AND\\nBLACK LIQUOR SOAP SKIMMINGS AGREEMENT" Regression: idx=0..7 all OK (75, 532, 422, 102, 77, 18, 69, 25 records).

blocksorg · 2026-05-17T11:19:07Z

Mention Blocks like a regular teammate with your question or request:

@blocks review this pull request
@blocks make the following changes ...
@blocks create an issue from what was mentioned in the following comment ...
@blocks explain the following code ...
@blocks are there any security or performance concerns?

Run @blocks /help for more information.

Workspace settings | Disable this message

qodo-code-review · 2026-05-17T11:19:07Z

Qodo reviews are paused for this user.

Troubleshooting steps vary by plan Learn more →

On a Teams plan?
Reviews resume once this user has a paid seat and their Git account is linked in Qodo.
Link Git account →

Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center?
These require an Enterprise plan - Contact us
Contact us →

sourcery-ai

Hi @arthrod! 👋

Your private repo does not have access to Sourcery.

Please upgrade to continue using Sourcery ✨

codeant-ai · 2026-05-17T11:19:08Z

CodeAnt AI is reviewing your PR.

coderabbitai · 2026-05-17T11:19:16Z

📝 Walkthrough

Summary by CodeRabbit

New Features
- Added new contract amendment data to the frozen dataset.
Chores
- Enhanced document parsing pipeline with improved multi-line title detection and merging for agreement headers.
- Refined exhibit and schedule identifier processing to eliminate spurious page-header markers, improving document structure clarity.

Walkthrough

The PR enhances the document parsing pipeline with two new post-processing heuristics: one merges multi-line agreement titles by detecting uppercase-dominant continuations, and another strips spurious page-header exhibit identifiers before positional sorting. Both are integrated into parse_one() and validated by a new frozen amendment sample dataset.

Changes

Parsing Pipeline Enhancements

Layer / File(s)	Summary
Multi-line title continuation merging `scripts/parse_doc2dict_with_config.py`	Adds `_looks_like_title_continuation` predicate to identify uppercase, non-header agreement title fragments and the `_merge_multiline_l0_title` function to prepend those fragments to the L0 record while marking them as envelopes. Integrated before `_split_l0_title_from_preamble` in `parse_one`.
Page footer exhibit title stripping `scripts/parse_doc2dict_with_config.py`	Adds `_strip_page_footer_exhibit_titles` to detect and clear bare exhibit-style titles (e.g., "EXHIBIT 10.7") in non-envelope records with substantive bodies, preventing incorrect early positional anchoring. Integrated before `_sort_records_by_source_position` in `parse_one`.
Frozen amendment sample data `data/auto_parse/level_freeze/frozen/idx_7.jsonl`, `data/auto_parse/level_freeze/state.json`	Adds a new JSONL sample document (idx: 7, 25 records) containing an amendment to a crude tall oil and black liquor soap skimmings agreement, with state tracking recording the freeze event at timestamp `2026-05-17T07:10:24`.

Estimated Code Review Effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly Related PRs

arthrod/clause-extract#12: Modifies the same parse_doc2dict_with_config.py file to merge multi-line agreement L0 title fragments before fallback logic.
arthrod/clause-extract#18: Adjusts title-fragment L0 merging and strips SEC page-header exhibit identifiers in the same post-processing stage.
arthrod/clause-extract#20: Adds post-processing section heuristics and envelope handling into the parse_one pipeline around title-fragment merging.

Suggested Labels

Feat2

Poem

🐰 A parsing pipeline grows tall,

With titles merged and headers fall,

The amendments now sing as one whole,

No fragments lost, no spurious toll—

The frozen sample shows it all!

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main changes: freezing idx=7 (an Ingevity/WestRock amendment) and the two key parser improvements (multi-line title merging and page-footer exhibit identifier stripping).
Description check	✅ Passed	The description comprehensively covers the changeset: documents the new idx=7 baseline, explains both parser heuristics in detail with line references, provides verification metrics (25 records, 97.0% word coverage), documents test results, and explains the impact for the broader corpus.
Docstring Coverage	✅ Passed	Docstring coverage is 87.50% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request introduces logic to handle multi-line agreement titles and spurious page-header exhibit titles in the document parsing pipeline. Specifically, it adds functions to detect and merge multi-line titles into a single L0 record and to strip bare exhibit identifiers from page footers to ensure correct sorting and retention of substantive body content. Additionally, it updates the frozen state and adds a new test record for document index 7. I have no feedback to provide as there were no review comments.

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

data/auto_parse/level_freeze/frozen/idx_7.jsonl (1)
26-26: 🧹 Nitpick | 🔵 Trivial | 💤 Low value

Remove trailing empty line.

JSONL files should end after the last record without an additional empty line. The trailing newline serves no purpose and may cause parsing issues in tools that count lines.
♻️ Proposed fix
 {"idx": 7, "order": 24, "level": 2, "span": "WESTROCK MWV, LLC"}
-
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@data/auto_parse/level_freeze/frozen/idx_7.jsonl` at line 26, The file has an
extra trailing empty line after the final JSON record; remove that blank line so
the file ends immediately after the last JSON object (no additional empty
newline), ensuring the final JSONL record is the last line in the file.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@scripts/parse_doc2dict_with_config.py`:
- Line 2475: The line building combined uses list concatenation; replace it with
unpacking to be more idiomatic and efficient by constructing the join input as
"\n".join([*continuation_titles, existing_l0_title]) — update the expression
where combined is assigned (variable combined, and the identifiers
continuation_titles and existing_l0_title) accordingly.
- Around line 2343-2380: The function _looks_like_title_continuation contains
multiple early-return checks that can be collapsed into a single boolean
expression; replace the sequence of if-blocks with one return combining:
non-empty stripped, not matching _TITLE_CONTINUATION_DISQUALIFIERS,
re.search(r"[A-Za-z]{2,}", stripped), existence of alphabetic chars and
uppercase ratio >= 0.60 (compute upper_ratio from alphabetic chars), and not
stripped.endswith(('.', ':', ';', '?', '!')) so the function returns that
combined boolean directly while preserving the same helper names (stripped,
_TITLE_CONTINUATION_DISQUALIFIERS, alpha/upper_ratio logic).

---

Outside diff comments:
In `@data/auto_parse/level_freeze/frozen/idx_7.jsonl`:
- Line 26: The file has an extra trailing empty line after the final JSON
record; remove that blank line so the file ends immediately after the last JSON
object (no additional empty newline), ensuring the final JSONL record is the
last line in the file.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 96eabe5a-b4a5-466a-935f-232d64ef6cf3

📥 Commits

Reviewing files that changed from the base of the PR and between 79dcef3 and ec36b24.

📒 Files selected for processing (3)

data/auto_parse/level_freeze/frozen/idx_7.jsonl
data/auto_parse/level_freeze/state.json
scripts/parse_doc2dict_with_config.py

📜 Review details

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit inference engine (Custom checks)

**/*.py: Run smoke test for Python web servers: start server, wait for ready signal, hit GET /health or GET /, assert HTTP 200. Timeout: 30s. Fail if smoke test fails.
Run smoke test for Python CLI: run <cli> --help, assert exit code 0. Fail if smoke test fails.
Run uv run pytest --cov=<src_package> --cov-report=term-missing --cov-branch --cov-fail-under=80 -q for Python coverage. Fail if exit code is non-zero, branch coverage < 70%, or line coverage < 80%.
Run uv run ruff check . --diff for Python linting. Fail if exit code is non-zero and list each violation.
Run uv run ruff format --check --diff . for Python formatting. Fail if exit code is non-zero and list each unformatted file.
Scan diffs for newly added Python suppression comments (# noqa, # type: ignore). Each suppression must have an inline justification comment. Fail if any new suppression lacks justification. Warn if total new suppressions > 3 in a single PR.
Run uv run ruff check --select I,F401 . to verify Python import ordering and detect unused imports. Fail if violations found.
Run the full Python test suite: uv run pytest --tb=line -q on origin/main to capture baseline pass/fail counts, and uv run pytest --tb=short -q on PR branch. Fail immediately if exit code is non-zero.
Run uv run typy check for Python type checking if .py files exist in diff or project has py.typed marker. Fail if exit code is non-zero. If typy is not available, use configured mypy or pyright instead. Fail with 'No Python type checker configured' if none is found.
Scan diffs for new bare type: ignore comments (without error codes) in Python files and cast() calls without explanatory comments. Warn for each. Fail if bare type: ignore count > 3.

Files:

scripts/parse_doc2dict_with_config.py

**/*.{py,ts,tsx}

📄 CodeRabbit inference engine (Custom checks)

For each changed production file, verify at least one corresponding test file exists or already exists in the repo with assertions covering changed symbols. Fail if a changed production file has zero associated test file and introduces new exported functions, classes, or public methods. Exempt: config files, type declaration files (.d.ts), migrations, and generated code.

Files:

scripts/parse_doc2dict_with_config.py

🪛 Ruff (0.15.12)

scripts/parse_doc2dict_with_config.py

[warning] 2343-2343: Too many return statements (8 > 6)

(PLR0911)

[warning] 2377-2379: Return the condition not stripped.endswith(('.', ':', ';', '?', '!')) directly

Replace with return not stripped.endswith(('.', ':', ';', '?', '!'))

(SIM103)

[warning] 2382-2382: Too many branches (13 > 12)

(PLR0912)

[warning] 2475-2475: Consider [*continuation_titles, existing_l0_title] instead of concatenation

Replace with [*continuation_titles, existing_l0_title]

(RUF005)

🔍 Remote MCP

Based on the available information and the tool limitations encountered, let me provide you with a focused analysis of the PR review context:

PR Review Context Summary

Code Changes Analysis

File: scripts/parse_doc2dict_with_config.py [275 new lines added]

The PR introduces two complementary post-processing heuristics to the EX-10 document parsing pipeline:

_merge_multiline_l0_title() - Addresses multi-line agreement titles
- Problem solved: Agreement titles spanning multiple lines (typical in legal documents) were being treated as separate records
- Solution: Detects preceding sibling with UPPERCASE-dominant text marked as predicted-header
- Key details:
  - Runs BEFORE _split_l0_title_from_preamble() to process full merged title
  - Uses disqualifier regexes to avoid incorrectly merging section headers (ARTICLE, SECTION, EXHIBIT, WHEREAS patterns)
  - Marks merged continuation records as is_envelope=True to suppress duplicate JSONL output while preserving parquet records
- Validation: Tested against idx=7 document; title successfully merged
_strip_page_footer_exhibit_titles() - Removes spurious page headers
- Problem solved: Page-footer/header exhibit identifiers (e.g., "EXHIBIT 10.7") repeating across pages were being treated as high-priority title probes, causing incorrect document repositioning
- Solution: Clears bare-identifier titles on exhibits/schedules/appendices/annexes
- Key constraints:
  - Only targets non-envelope, non-subdoc records (preserves legitimate structural headers)
  - Requires substantive body content (≥60 characters) to avoid removing legitimate small artifacts
  - Runs BEFORE the source-position sorter so the system falls back to body content analysis
- Validation: Preserves idx=5 artifacts; removes spurious "Exhibit 10.7" records from idx=7

Data Changes

Files: data/auto_parse/level_freeze/ [33 lines total, +8/-1 in state.json]

idx_7.jsonl: 25 records representing the AMENDMENT document (Ingevity/WestRock, March 2017)
- Composition: L0:1 (root), L1:9, L2:12, L3:3 (max depth 3)
- Quality metrics: 97.0% word_coverage, 97.7% char_ratio (indicating accurate reconstruction)
state.json: Updated freeze history with new event tracking idx=7's 25-record baseline

Related PR Overlaps

The related PRs (#12, #15, #18, #20) all modify the same parse_doc2dict_with_config.py file with overlapping post-processing logic:

PR #12: Title-fragment merging (similar multiline title handling)
PR #15: Pipeline heuristics insertion around envelope/title handling
PR #18: Title merging + SEC page-header exhibit stripping (direct overlap)
PR #20: Envelope handling before L0/title merging

Observation: These PRs appear to be iterative refinements to the same parsing pipeline stage. Code review should verify:

Merge order consistency across related PRs to avoid conflicts
That heuristics don't duplicate or contradict each other
Whether these changes should be consolidated

Testing & Validation

Per PR description, validation completed:

✅ Parsing tests passed
✅ Freeze baseline tests passed
✅ Regression tests passed (idx=0..7)
✅ Inspector verification: multiline title merge confirmed, spurious "Exhibit 10.7" removed, idx=5 artifacts preserved
✅ Source document URL provided

Note: The user requirements mention "Stitch" for design generation, but this does not apply to code/parsing logic review. Context7 and DeepWiki tools encountered resource limits, but sufficient information was extracted from the PR context and related metadata to support review.

🔇 Additional comments (2)

data/auto_parse/level_freeze/state.json (1)

10-11: LGTM!

Also applies to: 169-174
scripts/parse_doc2dict_with_config.py (1)
2317-2484: The coding guideline exempts private helper functions from strict test coverage requirements.

The new functions _merge_multiline_l0_title and _strip_page_footer_exhibit_titles are private (prefixed with _), not exported public APIs. The guideline states: "Fail if a changed production file has zero associated test file and introduces new exported functions, classes, or public methods" — private functions are explicitly outside this requirement. Additionally, scripts/parse_doc2dict_with_config.py is a utility script file (not part of the main clause_extract package), and the functions are called from parse_one (lines 4350, 4409), providing indirect integration coverage.
			> Likely an incorrect or invalid review comment.

coderabbitai · 2026-05-17T11:22:22Z

+def _looks_like_title_continuation(title: str) -> bool:
+    """Return True if `title` looks like an upper-line of a multi-line
+    agreement title (not a section heading, not a body fragment).
+
+    A title-continuation line:
+      - Has alphabetic content (at least one word of 2+ letters).
+      - Is predominantly uppercase letters (≥ 60% of alphabetic chars
+        are uppercase) — agreement titles are typeset in ALL CAPS.
+      - Doesn't match any structural-header / section / recital /
+        signature pattern.
+      - Doesn't end with a sentence-terminator (`.`, `:`, `;`, `?`, `!`)
+        — agreement titles usually break mid-phrase across lines, often
+        ending with conjunctions ("AND", "OF") or nouns.
+    """
+    if not title:
+        return False
+    stripped = title.strip()
+    if not stripped:
+        return False
+    # Must not match any structural / section / recital / signature pattern.
+    if _TITLE_CONTINUATION_DISQUALIFIERS.match(stripped):
+        return False
+    # Must have at least one alphabetic word of 2+ letters.
+    if not re.search(r"[A-Za-z]{2,}", stripped):
+        return False
+    # Compute uppercase-ratio of alphabetic chars (>= 60% uppercase).
+    alpha = [c for c in stripped if c.isalpha()]
+    if not alpha:
+        return False
+    upper_ratio = sum(1 for c in alpha if c.isupper()) / len(alpha)
+    if upper_ratio < 0.60:
+        return False
+    # Must not end with a sentence-terminating punctuation. Title lines
+    # are noun-phrases that wrap visually; bodies end with periods.
+    if stripped.endswith(('.', ':', ';', '?', '!')):
+        return False
+    return True
+


🧹 Nitpick | 🔵 Trivial | 💤 Low value

Consider simplifying the final return statement.

The function has a straightforward boolean condition that can be returned directly.

♻️ Proposed simplification

- if stripped.endswith(('.', ':', ';', '?', '!')): - return False - return True + return not stripped.endswith(('.', ':', ';', '?', '!'))

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def _looks_like_title_continuation(title: str) -> bool:

"""Return True if `title` looks like an upper-line of a multi-line

agreement title (not a section heading, not a body fragment).

A title-continuation line:

- Has alphabetic content (at least one word of 2+ letters).

- Is predominantly uppercase letters (≥ 60% of alphabetic chars

are uppercase) — agreement titles are typeset in ALL CAPS.

- Doesn't match any structural-header / section / recital /

signature pattern.

- Doesn't end with a sentence-terminator (`.`, `:`, `;`, `?`, `!`)

— agreement titles usually break mid-phrase across lines, often

ending with conjunctions ("AND", "OF") or nouns.

"""

if not title:

return False

stripped = title.strip()

if not stripped:

return False

# Must not match any structural / section / recital / signature pattern.

if _TITLE_CONTINUATION_DISQUALIFIERS.match(stripped):

return False

# Must have at least one alphabetic word of 2+ letters.

if not re.search(r"[A-Za-z]{2,}", stripped):

return False

# Compute uppercase-ratio of alphabetic chars (>= 60% uppercase).

alpha = [c for c in stripped if c.isalpha()]

if not alpha:

return False

upper_ratio = sum(1 for c in alpha if c.isupper()) / len(alpha)

if upper_ratio < 0.60:

return False

# Must not end with a sentence-terminating punctuation. Title lines

# are noun-phrases that wrap visually; bodies end with periods.

if stripped.endswith(('.', ':', ';', '?', '!')):

return False

return True

def _looks_like_title_continuation(title: str) -> bool:

"""Return True if `title` looks like an upper-line of a multi-line

agreement title (not a section heading, not a body fragment).

A title-continuation line:

- Has alphabetic content (at least one word of 2+ letters).

- Is predominantly uppercase letters (≥ 60% of alphabetic chars

are uppercase) — agreement titles are typeset in ALL CAPS.

- Doesn't match any structural-header / section / recital /

signature pattern.

- Doesn't end with a sentence-terminator (`.`, `:`, `;`, `?`, `!`)

— agreement titles usually break mid-phrase across lines, often

ending with conjunctions ("AND", "OF") or nouns.

"""

if not title:

return False

stripped = title.strip()

if not stripped:

return False

# Must not match any structural / section / recital / signature pattern.

if _TITLE_CONTINUATION_DISQUALIFIERS.match(stripped):

return False

# Must have at least one alphabetic word of 2+ letters.

if not re.search(r"[A-Za-z]{2,}", stripped):

return False

# Compute uppercase-ratio of alphabetic chars (>= 60% uppercase).

alpha = [c for c in stripped if c.isalpha()]

if not alpha:

return False

upper_ratio = sum(1 for c in alpha if c.isupper()) / len(alpha)

if upper_ratio < 0.60:

return False

# Must not end with a sentence-terminating punctuation. Title lines

# are noun-phrases that wrap visually; bodies end with periods.

return not stripped.endswith(('.', ':', ';', '?', '!'))

🧰 Tools

🪛 Ruff (0.15.12)

[warning] 2343-2343: Too many return statements (8 > 6)

(PLR0911)

[warning] 2377-2379: Return the condition not stripped.endswith(('.', ':', ';', '?', '!')) directly

Replace with return not stripped.endswith(('.', ':', ';', '?', '!'))

(SIM103)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/parse_doc2dict_with_config.py` around lines 2343 - 2380, The function _looks_like_title_continuation contains multiple early-return checks that can be collapsed into a single boolean expression; replace the sequence of if-blocks with one return combining: non-empty stripped, not matching _TITLE_CONTINUATION_DISQUALIFIERS, re.search(r"[A-Za-z]{2,}", stripped), existence of alphabetic chars and uppercase ratio >= 0.60 (compute upper_ratio from alphabetic chars), and not stripped.endswith(('.', ':', ';', '?', '!')) so the function returns that combined boolean directly while preserving the same helper names (stripped, _TITLE_CONTINUATION_DISQUALIFIERS, alpha/upper_ratio logic).

coderabbitai · 2026-05-17T11:22:23Z

+    continuation.reverse()  # source order (oldest first)
+    continuation_titles = [(c.get("title") or "").strip() for c in continuation]
+    existing_l0_title = (l0.get("title") or "").strip()
+    combined = "\n".join(continuation_titles + [existing_l0_title])


🧹 Nitpick | 🔵 Trivial | 💤 Low value

Consider using unpacking instead of concatenation.

Python unpacking is more idiomatic and slightly more efficient for list construction.

♻️ Proposed refactor

- combined = "\n".join(continuation_titles + [existing_l0_title]) + combined = "\n".join([*continuation_titles, existing_l0_title])

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

combined = "\n".join(continuation_titles + [existing_l0_title])

combined = "\n".join([*continuation_titles, existing_l0_title])

🧰 Tools

🪛 Ruff (0.15.12)

[warning] 2475-2475: Consider [*continuation_titles, existing_l0_title] instead of concatenation

Replace with [*continuation_titles, existing_l0_title]

(RUF005)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/parse_doc2dict_with_config.py` at line 2475, The line building combined uses list concatenation; replace it with unpacking to be more idiomatic and efficient by constructing the join input as "\n".join([*continuation_titles, existing_l0_title]) — update the expression where combined is assigned (variable combined, and the identifiers continuation_titles and existing_l0_title) accordingly.

codeant-ai · 2026-05-17T11:25:45Z

+        if (sib.get("cls") or "") != "predicted header":
+            break
+        if (sib.get("body_direct") or "").strip():
+            break
+        title = (sib.get("title") or "").strip()
+        if not _looks_like_title_continuation(title):
+            break
+        continuation.append(sib)


Suggestion: The title-continuation heuristic is broad enough to classify uppercase cover metadata (for example, registrant/company lines immediately before the agreement title) as a continuation line, so those rows can be merged into the L0 title and then hidden as envelope records. This corrupts the agreement title and prevents the normal pre-title cover cleanup from handling those records correctly. Tighten the continuation predicate to explicitly exclude company/party-name style lines (or require stronger title-shape signals than uppercase ratio alone). [incorrect condition logic]

Severity Level: Major ⚠️

- ❌ Agreement titles include registrant/company names for affected docs. - ⚠️ Downstream consumers mis-index or group agreements by title.

Steps of Reproduction ✅

1. In a Python shell, import the helper from `scripts/parse_doc2dict_with_config.py` lines 103–204: `from scripts.parse_doc2dict_with_config import _merge_multiline_l0_title`. 2. Construct a `rows` list with at least two section dicts sharing the same `parent_node_id`: (a) an L0 agreement title record `l0` with `depth=0`, `scope="agreement"`, non-empty `title`, and `node_id=20`; (b) a preceding cover-metadata sibling with `node_id=19`, `cls="predicted header"`, `scope="agreement"`, `is_envelope=False`, empty `body_direct`, and an all-caps registrant-style title like `"INGEVITY CORPORATION"`. 3. Call `_merge_multiline_l0_title(rows)` (implementation at lines 138–204). In the backward sibling walk at lines 169–186, the cover line passes all guards: it is not envelope/trailer (lines 174–177), has `cls=="predicted header"` (line 179), has empty `body_direct` (line 181), and `_looks_like_title_continuation()` at lines 64–100 returns True because the title is all caps, not matched by `_TITLE_CONTINUATION_DISQUALIFIERS` (lines 41–60), and has no sentence-ending punctuation. 4. Observe that the function appends this cover line into the L0 title (lines 191–197) and marks the sibling `is_envelope=True` (lines 199–202). Downstream, `_drop_pre_title_cover_records()` at lines 3230–3256 skips envelope records and can no longer treat this row as cover metadata, so the final JSONL agreement title incorrectly contains the registrant/company line.

Fix in Cursor | Fix in VSCode Claude

(Use Cmd/Ctrl + Click for best experience)

Prompt for AI Agent 🤖

This is a comment left during a code review. **Path:** scripts/parse_doc2dict_with_config.py **Line:** 2458:2465 **Comment:** *Incorrect Condition Logic: The title-continuation heuristic is broad enough to classify uppercase cover metadata (for example, registrant/company lines immediately before the agreement title) as a continuation line, so those rows can be merged into the L0 title and then hidden as envelope records. This corrupts the agreement title and prevents the normal pre-title cover cleanup from handling those records correctly. Tighten the continuation predicate to explicitly exclude company/party-name style lines (or require stronger title-shape signals than uppercase ratio alone). Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise. Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fix

👍 | 👎

codeant-ai · 2026-05-17T11:25:50Z

CodeAnt AI finished reviewing your PR.

arthrod · 2026-05-17T11:55:33Z

Triage agent — PR #80 comment review (read-only pass, no code changes)

3 inline comments reviewed:

coderabbitai @ parse_doc2dict...py:2380 — simplify boolean return (WONT-FIX)
Nitpick/Low value: returning stripped.endswith(...) directly vs. an if/else. No semantic difference. Current form is equally readable. Skipping cosmetic refactors during active stacking.
coderabbitai @ parse_doc2dict...py:2475 — use list unpacking instead of concatenation (WONT-FIX)
Nitpick/Low value: [*list1, item] vs list1 + [item]. Both are idiomatic Python. No correctness or performance concern at the scale used here.
codeant-ai @ parse_doc2dict...py:2465 — title-continuation heuristic too broad, can absorb cover metadata (WILL-DEFER)
Legitimate concern: uppercase registrant/company lines appearing immediately before the agreement title could be classified as continuation lines and merged into the L0 title, then hidden as envelope records. This could corrupt L0. Needs a test case with a cover-page document before fixing. Deferred to polish PR.

WILL-DEFER items (1): Title-continuation heuristic — add guard to exclude registrant/company lines from L0 merge.

Triage only — no code changes made this round.

sourcery-ai Bot reviewed May 17, 2026

View reviewed changes

codeant-ai Bot added the size:L This PR changes 100-499 lines, ignoring generated files label May 17, 2026

coderabbitai Bot added the Feat2 label May 17, 2026

gemini-code-assist Bot reviewed May 17, 2026

View reviewed changes

coderabbitai Bot requested changes May 17, 2026

View reviewed changes

codeant-ai Bot reviewed May 17, 2026

View reviewed changes

arthrod mentioned this pull request May 17, 2026

idx=8: freeze (26 records) — Allegiance Bancshares Restricted Stock Agreement template (IWW without /s/) #81

Open

4 tasks

	combined = "\n".join(continuation_titles + [existing_l0_title])
	combined = "\n".join([*continuation_titles, existing_l0_title])

Conversation

arthrod commented May 17, 2026 • edited by codeant-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

User description

Summary

Parser changes (2 surgical, shape-driven)

Verified output for idx=7

Top-level structure (verbatim L0 + sig area)

Test plan

Source

Why this matters for the corpus

CodeAnt-AI Description

What Changed

Impact

Checking Your Pull Request

Talking to CodeAnt AI

Example

Preserve Org Learnings with CodeAnt

Example

Retrigger review

Check Your Repository Health

Uh oh!

blocksorg Bot commented May 17, 2026

Uh oh!

qodo-code-review Bot commented May 17, 2026

Qodo reviews are paused for this user.

Uh oh!

sourcery-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

codeant-ai Bot commented May 17, 2026

Uh oh!

coderabbitai Bot commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Walkthrough

Changes

Estimated Code Review Effort

Possibly Related PRs

Suggested Labels

Poem

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

PR Review Context Summary

Code Changes Analysis

Data Changes

Related PR Overlaps

Testing & Validation

Uh oh!

coderabbitai Bot May 17, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 17, 2026

Choose a reason for hiding this comment

Uh oh!

codeant-ai Bot May 17, 2026

Choose a reason for hiding this comment

Uh oh!

codeant-ai Bot commented May 17, 2026

Uh oh!

arthrod commented May 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

arthrod commented May 17, 2026 •

edited by codeant-ai Bot

Loading

coderabbitai Bot commented May 17, 2026 •

edited

Loading