Skip to content

idx=7: freeze (25 records) — Ingevity/WestRock Amendment No.1, multi-line title + page-footer exhibit strip#80

Open
arthrod wants to merge 1 commit into
redo/idx-6from
redo/idx-7
Open

idx=7: freeze (25 records) — Ingevity/WestRock Amendment No.1, multi-line title + page-footer exhibit strip#80
arthrod wants to merge 1 commit into
redo/idx-6from
redo/idx-7

Conversation

@arthrod
Copy link
Copy Markdown
Owner

@arthrod arthrod commented May 17, 2026

User description

Summary

Eighth stacked PR. Adds idx=7 (AMENDMENT NO.1 TO CRUDE TALL OIL AND BLACK LIQUOR SOAP SKIMMINGS AGREEMENT between Ingevity Corporation and WestRock Shared Services / WestRock MWV, March 2017) as the eighth verified frozen baseline on top of idx=6 (PR #79).

Parser changes (2 surgical, shape-driven)

  1. _merge_multiline_l0_title (~line 2317) — handles multi-line agreement titles. When the L0 record has a preceding sibling (same parent_node_id, smaller node_id) that's an UPPERCASE predicted-header continuation line, merges them. Disqualifier regex blocks ARTICLE/SECTION/EXHIBIT/SCHEDULE/APPENDIX/ANNEX/WITNESSETH/WHEREAS/RECITALS/NOW THEREFORE/IN WITNESS WHEREOF/numbered/lettered patterns. Inspector noted doc2dict already pre-merged idx=7's title at node_id=3; the new pass mainly suppresses the duplicate first-line sibling for this idx. The more general pattern handles future cases where doc2dict fails to merge.

  2. _strip_page_footer_exhibit_titles (~line 3258) — handles doc2dict promoting repeating page-header tags ("Exhibit 10.7" stamped on every page) into spurious cls=exhibit records. Without the fix, those bare-identifier titles match the FIRST source occurrence (the cover-page header) via the sort probe, dragging records to document start where they get dropped as pre-title chrome. Fix: clears the title on cls in {exhibit, schedule, appendix, annex} records that are NOT is_envelope, NOT real subdocs (per _is_real_subdoc_title), have a bare-identifier title, AND have substantive body (≥60 chars). The 60-char threshold protects idx=5's tiny "Schedule 1.1" + "Ex. B-98" trailing artifact.

Verified output for idx=7

  • 25 records, distribution {L0:1, L1:9, L2:12, L3:3}
  • Reconstruction: word_coverage 97.0%, char_ratio 97.7%
  • Max depth: 3

Top-level structure (verbatim L0 + sig area)

o=0  L0: AMENDMENT NO.1 TO CRUDE TALL OIL AND
         BLACK LIQUOR SOAP SKIMMINGS AGREEMENT
o=1  L1: This Amendment No.1 (this "Amendment") to the Supply Agreement, dated as of March 1, 2017...
...
o=12 L1: IN WITNESS WHEREOF...                              (IWW operating clause)
o=13-24 L2: INGEVITY CORPORATION / WESTROCK SHARED / SERVICES, LLC / By:_/S/ S. Edward Woodcock, Jr. / By:_/S/ John D. Stakel / Name: ... / Title: ... / Date: March 1, 2017 / Date: March 8, 2017 / WESTROCK MWV, LLC (doc2dict natural per-line grouping)

Test plan

  • uv run scripts/parse_doc2dict_with_config.py --limit 8 --no-truncate --output-dir data/auto_parse exits 0 with ok 8
  • uv run scripts/level_loop/freeze.py 7 --force reports word_coverage ≥ 90% (97.0%)
  • uv run scripts/level_loop/regress.py reports all 8 frozen idxs OK
  • Inspector verified multi-line title verbatim, no spurious "Exhibit 10.7" records, idx=5 trailing artifact preserved (60-char threshold works), idx=2 real subdocs unaffected

Source

http://www.sec.gov/Archives/edgar/data/1653477/000165347717000051/ex107amendmentno1toctoagre.htm

Why this matters for the corpus

  • Multi-line title support: many SEC EX-10 filings have titles that wrap across two lines. The merge pass handles both cases — when doc2dict succeeds at merging (and we just need to drop the duplicate sibling) and when doc2dict fails (and we need to do the merge ourselves).
  • Page-footer exhibit-tag strip: SEC filings that stamp the exhibit identifier ("Exhibit 10.7") on every page get this tag promoted by doc2dict into spurious cls=exhibit sections with substantive body. Without the strip, those records get sort-ordered to document start and dropped as pre-title chrome — losing the body content. The strip clears the title while preserving the body at its true position.

🤖 Generated with Claude Code


CodeAnt-AI Description

Freeze idx=7: preserve a multi-line agreement title and keep exhibit pages in the right place

What Changed

  • Multi-line agreement titles now stay together as one top-level title, instead of splitting the first line into a separate section
  • Repeating page-header exhibit labels no longer pull exhibit pages to the start of the document, so their body text stays in place
  • Added a frozen baseline for idx=7

Impact

✅ Correct agreement titles
✅ Fewer lost exhibit sections
✅ More complete document extractions

🔄 Retrigger CodeAnt AI Review

💡 Usage Guide

Checking Your Pull Request

Every time you make a pull request, our system automatically looks through it. We check for security issues, mistakes in how you're setting up your infrastructure, and common code problems. We do this to make sure your changes are solid and won't cause any trouble later.

Talking to CodeAnt AI

Got a question or need a hand with something in your pull request? You can easily get in touch with CodeAnt AI right here. Just type the following in a comment on your pull request, and replace "Your question here" with whatever you want to ask:

@codeant-ai ask: Your question here

This lets you have a chat with CodeAnt AI about your pull request, making it easier to understand and improve your code.

Example

@codeant-ai ask: Can you suggest a safer alternative to storing this secret?

Preserve Org Learnings with CodeAnt

You can record team preferences so CodeAnt AI applies them in future reviews. Reply directly to the specific CodeAnt AI suggestion (in the same thread) and replace "Your feedback here" with your input:

@codeant-ai: Your feedback here

This helps CodeAnt AI learn and adapt to your team's coding style and standards.

Example

@codeant-ai: Do not flag unused imports.

Retrigger review

Ask CodeAnt AI to review the PR again, by typing:

@codeant-ai: review

Check Your Repository Health

To analyze the health of your code repository, visit our dashboard at https://app.codeant.ai. This tool helps you identify potential issues and areas for improvement in your codebase, ensuring your repository maintains high standards of code health.

…line L0 title merge + page-footer exhibit-title strip

idx=7 is an Amendment No.1 to the Crude Tall Oil and Black Liquor Soap
Skimmings Agreement between Ingevity Corp (Buyer) and WestRock (Seller,
two affiliates). Title is typeset across two visual lines:

    AMENDMENT NO.1 TO CRUDE TALL OIL AND
    BLACK LIQUOR SOAP SKIMMINGS AGREEMENT

Two parser additions, both purely structural (SHAPE-based, no phrase
blocklists):

1. `_merge_multiline_l0_title` (new): when the L0 has a preceding
   sibling at the same parent (same parent_node_id, smaller node_id)
   that is a predicted-header with empty body, an UPPERCASE title (≥60%
   uppercase alphabetic chars), no structural-header / section /
   recital / signature pattern, and no sentence-terminator, it is a
   title-continuation line. Merge it into the L0 title (continuation
   first, L0 second) joined by newline; mark the continuation record
   as is_envelope=True so it drops from JSONL but stays in parquet.
   Runs BEFORE `_split_l0_title_from_preamble` so the preamble split
   sees the complete title. The disqualifier regex blocks ARTICLE /
   SECTION / EXHIBIT / WHEREAS / numbered-marker / lettered-marker /
   etc. patterns so body lines after the title aren't absorbed.

2. `_strip_page_footer_exhibit_titles` (new): doc2dict's HTML walker
   promotes each repeating page-header tag ("Exhibit 10.7" stamped on
   every page) into its own cls=exhibit section. The FIRST one is the
   legitimate SEC envelope (is_envelope=True already). Subsequent ones
   carry substantive body content from the page they head but their
   title is the bare exhibit identifier — which becomes the highest-
   priority probe in `_sort_records_by_source_position` and matches
   the FIRST source occurrence, dragging the record to the document
   start where `_drop_pre_title_position_records` would drop it as
   pre-title chrome. Fix: clear the title on cls=exhibit/schedule/
   appendix/annex records that are not is_envelope, not real subdocs
   (per `_is_real_subdoc_title`), have a bare-identifier title, and
   have substantive body (≥60 chars). The sorter falls back to the
   body probe and locates the record at its true source position.
   The 60-char threshold protects records like idx=5's "Schedule 1.1"
   + " \\n \\nEx. B-98" body (page chrome, not substantive) from being
   stripped.

Runs:
- `_merge_multiline_l0_title` before `_split_l0_title_from_preamble`.
- `_strip_page_footer_exhibit_titles` after `_explode_signature_block_lines`
  and before `_sort_records_by_source_position`.

idx=7 final stats:
- 25 records (1 L0, 9 L1, 12 L2, 3 L3)
- reconstruction: word_coverage 97.0%, char_ratio 97.7% (≥ 90% bar)
- L0 = "AMENDMENT NO.1 TO CRUDE TALL OIL AND\\nBLACK LIQUOR SOAP SKIMMINGS AGREEMENT"

Regression: idx=0..7 all OK (75, 532, 422, 102, 77, 18, 69, 25 records).
@blocksorg
Copy link
Copy Markdown

blocksorg Bot commented May 17, 2026

Mention Blocks like a regular teammate with your question or request:

@blocks review this pull request
@blocks make the following changes ...
@blocks create an issue from what was mentioned in the following comment ...
@blocks explain the following code ...
@blocks are there any security or performance concerns?

Run @blocks /help for more information.

Workspace settings | Disable this message

@qodo-code-review
Copy link
Copy Markdown

Qodo reviews are paused for this user.

Troubleshooting steps vary by plan Learn more →

On a Teams plan?
Reviews resume once this user has a paid seat and their Git account is linked in Qodo.
Link Git account →

Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center?
These require an Enterprise plan - Contact us
Contact us →

Copy link
Copy Markdown

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @arthrod! 👋

Your private repo does not have access to Sourcery.

Please upgrade to continue using Sourcery ✨

@codeant-ai
Copy link
Copy Markdown

codeant-ai Bot commented May 17, 2026

CodeAnt AI is reviewing your PR.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 17, 2026

Review Change Stack

📝 Walkthrough

Summary by CodeRabbit

  • New Features

    • Added new contract amendment data to the frozen dataset.
  • Chores

    • Enhanced document parsing pipeline with improved multi-line title detection and merging for agreement headers.
    • Refined exhibit and schedule identifier processing to eliminate spurious page-header markers, improving document structure clarity.

Walkthrough

The PR enhances the document parsing pipeline with two new post-processing heuristics: one merges multi-line agreement titles by detecting uppercase-dominant continuations, and another strips spurious page-header exhibit identifiers before positional sorting. Both are integrated into parse_one() and validated by a new frozen amendment sample dataset.

Changes

Parsing Pipeline Enhancements

Layer / File(s) Summary
Multi-line title continuation merging
scripts/parse_doc2dict_with_config.py
Adds _looks_like_title_continuation predicate to identify uppercase, non-header agreement title fragments and the _merge_multiline_l0_title function to prepend those fragments to the L0 record while marking them as envelopes. Integrated before _split_l0_title_from_preamble in parse_one.
Page footer exhibit title stripping
scripts/parse_doc2dict_with_config.py
Adds _strip_page_footer_exhibit_titles to detect and clear bare exhibit-style titles (e.g., "EXHIBIT 10.7") in non-envelope records with substantive bodies, preventing incorrect early positional anchoring. Integrated before _sort_records_by_source_position in parse_one.
Frozen amendment sample data
data/auto_parse/level_freeze/frozen/idx_7.jsonl, data/auto_parse/level_freeze/state.json
Adds a new JSONL sample document (idx: 7, 25 records) containing an amendment to a crude tall oil and black liquor soap skimmings agreement, with state tracking recording the freeze event at timestamp 2026-05-17T07:10:24.

Estimated Code Review Effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly Related PRs

  • arthrod/clause-extract#12: Modifies the same parse_doc2dict_with_config.py file to merge multi-line agreement L0 title fragments before fallback logic.
  • arthrod/clause-extract#18: Adjusts title-fragment L0 merging and strips SEC page-header exhibit identifiers in the same post-processing stage.
  • arthrod/clause-extract#20: Adds post-processing section heuristics and envelope handling into the parse_one pipeline around title-fragment merging.

Suggested Labels

Feat2

Poem

🐰 A parsing pipeline grows tall,

With titles merged and headers fall,

The amendments now sing as one whole,

No fragments lost, no spurious toll—

The frozen sample shows it all!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main changes: freezing idx=7 (an Ingevity/WestRock amendment) and the two key parser improvements (multi-line title merging and page-footer exhibit identifier stripping).
Description check ✅ Passed The description comprehensively covers the changeset: documents the new idx=7 baseline, explains both parser heuristics in detail with line references, provides verification metrics (25 records, 97.0% word coverage), documents test results, and explains the impact for the broader corpus.
Docstring Coverage ✅ Passed Docstring coverage is 87.50% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch

Comment @coderabbitai help to get the list of available commands and usage tips.

@codeant-ai codeant-ai Bot added the size:L This PR changes 100-499 lines, ignoring generated files label May 17, 2026
@coderabbitai coderabbitai Bot added the Feat2 label May 17, 2026
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces logic to handle multi-line agreement titles and spurious page-header exhibit titles in the document parsing pipeline. Specifically, it adds functions to detect and merge multi-line titles into a single L0 record and to strip bare exhibit identifiers from page footers to ensure correct sorting and retention of substantive body content. Additionally, it updates the frozen state and adds a new test record for document index 7. I have no feedback to provide as there were no review comments.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
data/auto_parse/level_freeze/frozen/idx_7.jsonl (1)

26-26: 🧹 Nitpick | 🔵 Trivial | 💤 Low value

Remove trailing empty line.

JSONL files should end after the last record without an additional empty line. The trailing newline serves no purpose and may cause parsing issues in tools that count lines.

♻️ Proposed fix
 {"idx": 7, "order": 24, "level": 2, "span": "WESTROCK MWV, LLC"}
-
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@data/auto_parse/level_freeze/frozen/idx_7.jsonl` at line 26, The file has an
extra trailing empty line after the final JSON record; remove that blank line so
the file ends immediately after the last JSON object (no additional empty
newline), ensuring the final JSONL record is the last line in the file.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@scripts/parse_doc2dict_with_config.py`:
- Line 2475: The line building combined uses list concatenation; replace it with
unpacking to be more idiomatic and efficient by constructing the join input as
"\n".join([*continuation_titles, existing_l0_title]) — update the expression
where combined is assigned (variable combined, and the identifiers
continuation_titles and existing_l0_title) accordingly.
- Around line 2343-2380: The function _looks_like_title_continuation contains
multiple early-return checks that can be collapsed into a single boolean
expression; replace the sequence of if-blocks with one return combining:
non-empty stripped, not matching _TITLE_CONTINUATION_DISQUALIFIERS,
re.search(r"[A-Za-z]{2,}", stripped), existence of alphabetic chars and
uppercase ratio >= 0.60 (compute upper_ratio from alphabetic chars), and not
stripped.endswith(('.', ':', ';', '?', '!')) so the function returns that
combined boolean directly while preserving the same helper names (stripped,
_TITLE_CONTINUATION_DISQUALIFIERS, alpha/upper_ratio logic).

---

Outside diff comments:
In `@data/auto_parse/level_freeze/frozen/idx_7.jsonl`:
- Line 26: The file has an extra trailing empty line after the final JSON
record; remove that blank line so the file ends immediately after the last JSON
object (no additional empty newline), ensuring the final JSONL record is the
last line in the file.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 96eabe5a-b4a5-466a-935f-232d64ef6cf3

📥 Commits

Reviewing files that changed from the base of the PR and between 79dcef3 and ec36b24.

📒 Files selected for processing (3)
  • data/auto_parse/level_freeze/frozen/idx_7.jsonl
  • data/auto_parse/level_freeze/state.json
  • scripts/parse_doc2dict_with_config.py
📜 Review details
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (Custom checks)

**/*.py: Run smoke test for Python web servers: start server, wait for ready signal, hit GET /health or GET /, assert HTTP 200. Timeout: 30s. Fail if smoke test fails.
Run smoke test for Python CLI: run <cli> --help, assert exit code 0. Fail if smoke test fails.
Run uv run pytest --cov=<src_package> --cov-report=term-missing --cov-branch --cov-fail-under=80 -q for Python coverage. Fail if exit code is non-zero, branch coverage < 70%, or line coverage < 80%.
Run uv run ruff check . --diff for Python linting. Fail if exit code is non-zero and list each violation.
Run uv run ruff format --check --diff . for Python formatting. Fail if exit code is non-zero and list each unformatted file.
Scan diffs for newly added Python suppression comments (# noqa, # type: ignore). Each suppression must have an inline justification comment. Fail if any new suppression lacks justification. Warn if total new suppressions > 3 in a single PR.
Run uv run ruff check --select I,F401 . to verify Python import ordering and detect unused imports. Fail if violations found.
Run the full Python test suite: uv run pytest --tb=line -q on origin/main to capture baseline pass/fail counts, and uv run pytest --tb=short -q on PR branch. Fail immediately if exit code is non-zero.
Run uv run typy check for Python type checking if .py files exist in diff or project has py.typed marker. Fail if exit code is non-zero. If typy is not available, use configured mypy or pyright instead. Fail with 'No Python type checker configured' if none is found.
Scan diffs for new bare type: ignore comments (without error codes) in Python files and cast() calls without explanatory comments. Warn for each. Fail if bare type: ignore count > 3.

Files:

  • scripts/parse_doc2dict_with_config.py
**/*.{py,ts,tsx}

📄 CodeRabbit inference engine (Custom checks)

For each changed production file, verify at least one corresponding test file exists or already exists in the repo with assertions covering changed symbols. Fail if a changed production file has zero associated test file and introduces new exported functions, classes, or public methods. Exempt: config files, type declaration files (.d.ts), migrations, and generated code.

Files:

  • scripts/parse_doc2dict_with_config.py
🪛 Ruff (0.15.12)
scripts/parse_doc2dict_with_config.py

[warning] 2343-2343: Too many return statements (8 > 6)

(PLR0911)


[warning] 2377-2379: Return the condition not stripped.endswith(('.', ':', ';', '?', '!')) directly

Replace with return not stripped.endswith(('.', ':', ';', '?', '!'))

(SIM103)


[warning] 2382-2382: Too many branches (13 > 12)

(PLR0912)


[warning] 2475-2475: Consider [*continuation_titles, existing_l0_title] instead of concatenation

Replace with [*continuation_titles, existing_l0_title]

(RUF005)

🔍 Remote MCP

Based on the available information and the tool limitations encountered, let me provide you with a focused analysis of the PR review context:

PR Review Context Summary

Code Changes Analysis

File: scripts/parse_doc2dict_with_config.py [275 new lines added]

The PR introduces two complementary post-processing heuristics to the EX-10 document parsing pipeline:

  1. _merge_multiline_l0_title() - Addresses multi-line agreement titles

    • Problem solved: Agreement titles spanning multiple lines (typical in legal documents) were being treated as separate records
    • Solution: Detects preceding sibling with UPPERCASE-dominant text marked as predicted-header
    • Key details:
      • Runs BEFORE _split_l0_title_from_preamble() to process full merged title
      • Uses disqualifier regexes to avoid incorrectly merging section headers (ARTICLE, SECTION, EXHIBIT, WHEREAS patterns)
      • Marks merged continuation records as is_envelope=True to suppress duplicate JSONL output while preserving parquet records
    • Validation: Tested against idx=7 document; title successfully merged
  2. _strip_page_footer_exhibit_titles() - Removes spurious page headers

    • Problem solved: Page-footer/header exhibit identifiers (e.g., "EXHIBIT 10.7") repeating across pages were being treated as high-priority title probes, causing incorrect document repositioning
    • Solution: Clears bare-identifier titles on exhibits/schedules/appendices/annexes
    • Key constraints:
      • Only targets non-envelope, non-subdoc records (preserves legitimate structural headers)
      • Requires substantive body content (≥60 characters) to avoid removing legitimate small artifacts
      • Runs BEFORE the source-position sorter so the system falls back to body content analysis
    • Validation: Preserves idx=5 artifacts; removes spurious "Exhibit 10.7" records from idx=7

Data Changes

Files: data/auto_parse/level_freeze/ [33 lines total, +8/-1 in state.json]

  • idx_7.jsonl: 25 records representing the AMENDMENT document (Ingevity/WestRock, March 2017)
    • Composition: L0:1 (root), L1:9, L2:12, L3:3 (max depth 3)
    • Quality metrics: 97.0% word_coverage, 97.7% char_ratio (indicating accurate reconstruction)
  • state.json: Updated freeze history with new event tracking idx=7's 25-record baseline

Related PR Overlaps

The related PRs (#12, #15, #18, #20) all modify the same parse_doc2dict_with_config.py file with overlapping post-processing logic:

  • PR #12: Title-fragment merging (similar multiline title handling)
  • PR #15: Pipeline heuristics insertion around envelope/title handling
  • PR #18: Title merging + SEC page-header exhibit stripping (direct overlap)
  • PR #20: Envelope handling before L0/title merging

Observation: These PRs appear to be iterative refinements to the same parsing pipeline stage. Code review should verify:

  1. Merge order consistency across related PRs to avoid conflicts
  2. That heuristics don't duplicate or contradict each other
  3. Whether these changes should be consolidated

Testing & Validation

Per PR description, validation completed:

  • ✅ Parsing tests passed
  • ✅ Freeze baseline tests passed
  • ✅ Regression tests passed (idx=0..7)
  • ✅ Inspector verification: multiline title merge confirmed, spurious "Exhibit 10.7" removed, idx=5 artifacts preserved
  • ✅ Source document URL provided


Note: The user requirements mention "Stitch" for design generation, but this does not apply to code/parsing logic review. Context7 and DeepWiki tools encountered resource limits, but sufficient information was extracted from the PR context and related metadata to support review.

🔇 Additional comments (2)
data/auto_parse/level_freeze/state.json (1)

10-11: LGTM!

Also applies to: 169-174

scripts/parse_doc2dict_with_config.py (1)

2317-2484: The coding guideline exempts private helper functions from strict test coverage requirements.

The new functions _merge_multiline_l0_title and _strip_page_footer_exhibit_titles are private (prefixed with _), not exported public APIs. The guideline states: "Fail if a changed production file has zero associated test file and introduces new exported functions, classes, or public methods" — private functions are explicitly outside this requirement. Additionally, scripts/parse_doc2dict_with_config.py is a utility script file (not part of the main clause_extract package), and the functions are called from parse_one (lines 4350, 4409), providing indirect integration coverage.

			> Likely an incorrect or invalid review comment.

Comment on lines +2343 to +2380
def _looks_like_title_continuation(title: str) -> bool:
"""Return True if `title` looks like an upper-line of a multi-line
agreement title (not a section heading, not a body fragment).

A title-continuation line:
- Has alphabetic content (at least one word of 2+ letters).
- Is predominantly uppercase letters (≥ 60% of alphabetic chars
are uppercase) — agreement titles are typeset in ALL CAPS.
- Doesn't match any structural-header / section / recital /
signature pattern.
- Doesn't end with a sentence-terminator (`.`, `:`, `;`, `?`, `!`)
— agreement titles usually break mid-phrase across lines, often
ending with conjunctions ("AND", "OF") or nouns.
"""
if not title:
return False
stripped = title.strip()
if not stripped:
return False
# Must not match any structural / section / recital / signature pattern.
if _TITLE_CONTINUATION_DISQUALIFIERS.match(stripped):
return False
# Must have at least one alphabetic word of 2+ letters.
if not re.search(r"[A-Za-z]{2,}", stripped):
return False
# Compute uppercase-ratio of alphabetic chars (>= 60% uppercase).
alpha = [c for c in stripped if c.isalpha()]
if not alpha:
return False
upper_ratio = sum(1 for c in alpha if c.isupper()) / len(alpha)
if upper_ratio < 0.60:
return False
# Must not end with a sentence-terminating punctuation. Title lines
# are noun-phrases that wrap visually; bodies end with periods.
if stripped.endswith(('.', ':', ';', '?', '!')):
return False
return True

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial | 💤 Low value

Consider simplifying the final return statement.

The function has a straightforward boolean condition that can be returned directly.

♻️ Proposed simplification
-    if stripped.endswith(('.', ':', ';', '?', '!')):
-        return False
-    return True
+    return not stripped.endswith(('.', ':', ';', '?', '!'))
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def _looks_like_title_continuation(title: str) -> bool:
"""Return True if `title` looks like an upper-line of a multi-line
agreement title (not a section heading, not a body fragment).
A title-continuation line:
- Has alphabetic content (at least one word of 2+ letters).
- Is predominantly uppercase letters (≥ 60% of alphabetic chars
are uppercase) — agreement titles are typeset in ALL CAPS.
- Doesn't match any structural-header / section / recital /
signature pattern.
- Doesn't end with a sentence-terminator (`.`, `:`, `;`, `?`, `!`)
agreement titles usually break mid-phrase across lines, often
ending with conjunctions ("AND", "OF") or nouns.
"""
if not title:
return False
stripped = title.strip()
if not stripped:
return False
# Must not match any structural / section / recital / signature pattern.
if _TITLE_CONTINUATION_DISQUALIFIERS.match(stripped):
return False
# Must have at least one alphabetic word of 2+ letters.
if not re.search(r"[A-Za-z]{2,}", stripped):
return False
# Compute uppercase-ratio of alphabetic chars (>= 60% uppercase).
alpha = [c for c in stripped if c.isalpha()]
if not alpha:
return False
upper_ratio = sum(1 for c in alpha if c.isupper()) / len(alpha)
if upper_ratio < 0.60:
return False
# Must not end with a sentence-terminating punctuation. Title lines
# are noun-phrases that wrap visually; bodies end with periods.
if stripped.endswith(('.', ':', ';', '?', '!')):
return False
return True
def _looks_like_title_continuation(title: str) -> bool:
"""Return True if `title` looks like an upper-line of a multi-line
agreement title (not a section heading, not a body fragment).
A title-continuation line:
- Has alphabetic content (at least one word of 2+ letters).
- Is predominantly uppercase letters (≥ 60% of alphabetic chars
are uppercase) — agreement titles are typeset in ALL CAPS.
- Doesn't match any structural-header / section / recital /
signature pattern.
- Doesn't end with a sentence-terminator (`.`, `:`, `;`, `?`, `!`)
agreement titles usually break mid-phrase across lines, often
ending with conjunctions ("AND", "OF") or nouns.
"""
if not title:
return False
stripped = title.strip()
if not stripped:
return False
# Must not match any structural / section / recital / signature pattern.
if _TITLE_CONTINUATION_DISQUALIFIERS.match(stripped):
return False
# Must have at least one alphabetic word of 2+ letters.
if not re.search(r"[A-Za-z]{2,}", stripped):
return False
# Compute uppercase-ratio of alphabetic chars (>= 60% uppercase).
alpha = [c for c in stripped if c.isalpha()]
if not alpha:
return False
upper_ratio = sum(1 for c in alpha if c.isupper()) / len(alpha)
if upper_ratio < 0.60:
return False
# Must not end with a sentence-terminating punctuation. Title lines
# are noun-phrases that wrap visually; bodies end with periods.
return not stripped.endswith(('.', ':', ';', '?', '!'))
🧰 Tools
🪛 Ruff (0.15.12)

[warning] 2343-2343: Too many return statements (8 > 6)

(PLR0911)


[warning] 2377-2379: Return the condition not stripped.endswith(('.', ':', ';', '?', '!')) directly

Replace with return not stripped.endswith(('.', ':', ';', '?', '!'))

(SIM103)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/parse_doc2dict_with_config.py` around lines 2343 - 2380, The function
_looks_like_title_continuation contains multiple early-return checks that can be
collapsed into a single boolean expression; replace the sequence of if-blocks
with one return combining: non-empty stripped, not matching
_TITLE_CONTINUATION_DISQUALIFIERS, re.search(r"[A-Za-z]{2,}", stripped),
existence of alphabetic chars and uppercase ratio >= 0.60 (compute upper_ratio
from alphabetic chars), and not stripped.endswith(('.', ':', ';', '?', '!')) so
the function returns that combined boolean directly while preserving the same
helper names (stripped, _TITLE_CONTINUATION_DISQUALIFIERS, alpha/upper_ratio
logic).

continuation.reverse() # source order (oldest first)
continuation_titles = [(c.get("title") or "").strip() for c in continuation]
existing_l0_title = (l0.get("title") or "").strip()
combined = "\n".join(continuation_titles + [existing_l0_title])
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial | 💤 Low value

Consider using unpacking instead of concatenation.

Python unpacking is more idiomatic and slightly more efficient for list construction.

♻️ Proposed refactor
-    combined = "\n".join(continuation_titles + [existing_l0_title])
+    combined = "\n".join([*continuation_titles, existing_l0_title])
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
combined = "\n".join(continuation_titles + [existing_l0_title])
combined = "\n".join([*continuation_titles, existing_l0_title])
🧰 Tools
🪛 Ruff (0.15.12)

[warning] 2475-2475: Consider [*continuation_titles, existing_l0_title] instead of concatenation

Replace with [*continuation_titles, existing_l0_title]

(RUF005)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/parse_doc2dict_with_config.py` at line 2475, The line building
combined uses list concatenation; replace it with unpacking to be more idiomatic
and efficient by constructing the join input as "\n".join([*continuation_titles,
existing_l0_title]) — update the expression where combined is assigned (variable
combined, and the identifiers continuation_titles and existing_l0_title)
accordingly.

Comment on lines +2458 to +2465
if (sib.get("cls") or "") != "predicted header":
break
if (sib.get("body_direct") or "").strip():
break
title = (sib.get("title") or "").strip()
if not _looks_like_title_continuation(title):
break
continuation.append(sib)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: The title-continuation heuristic is broad enough to classify uppercase cover metadata (for example, registrant/company lines immediately before the agreement title) as a continuation line, so those rows can be merged into the L0 title and then hidden as envelope records. This corrupts the agreement title and prevents the normal pre-title cover cleanup from handling those records correctly. Tighten the continuation predicate to explicitly exclude company/party-name style lines (or require stronger title-shape signals than uppercase ratio alone). [incorrect condition logic]

Severity Level: Major ⚠️
- ❌ Agreement titles include registrant/company names for affected docs.
- ⚠️ Downstream consumers mis-index or group agreements by title.
Steps of Reproduction ✅
1. In a Python shell, import the helper from `scripts/parse_doc2dict_with_config.py` lines
103–204: `from scripts.parse_doc2dict_with_config import _merge_multiline_l0_title`.

2. Construct a `rows` list with at least two section dicts sharing the same
`parent_node_id`: (a) an L0 agreement title record `l0` with `depth=0`,
`scope="agreement"`, non-empty `title`, and `node_id=20`; (b) a preceding cover-metadata
sibling with `node_id=19`, `cls="predicted header"`, `scope="agreement"`,
`is_envelope=False`, empty `body_direct`, and an all-caps registrant-style title like
`"INGEVITY CORPORATION"`.

3. Call `_merge_multiline_l0_title(rows)` (implementation at lines 138–204). In the
backward sibling walk at lines 169–186, the cover line passes all guards: it is not
envelope/trailer (lines 174–177), has `cls=="predicted header"` (line 179), has empty
`body_direct` (line 181), and `_looks_like_title_continuation()` at lines 64–100 returns
True because the title is all caps, not matched by `_TITLE_CONTINUATION_DISQUALIFIERS`
(lines 41–60), and has no sentence-ending punctuation.

4. Observe that the function appends this cover line into the L0 title (lines 191–197) and
marks the sibling `is_envelope=True` (lines 199–202). Downstream,
`_drop_pre_title_cover_records()` at lines 3230–3256 skips envelope records and can no
longer treat this row as cover metadata, so the final JSONL agreement title incorrectly
contains the registrant/company line.

Fix in Cursor | Fix in VSCode Claude

(Use Cmd/Ctrl + Click for best experience)

Prompt for AI Agent 🤖
This is a comment left during a code review.

**Path:** scripts/parse_doc2dict_with_config.py
**Line:** 2458:2465
**Comment:**
	*Incorrect Condition Logic: The title-continuation heuristic is broad enough to classify uppercase cover metadata (for example, registrant/company lines immediately before the agreement title) as a continuation line, so those rows can be merged into the L0 title and then hidden as envelope records. This corrupts the agreement title and prevents the normal pre-title cover cleanup from handling those records correctly. Tighten the continuation predicate to explicitly exclude company/party-name style lines (or require stronger title-shape signals than uppercase ratio alone).

Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.
Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fix
👍 | 👎

@codeant-ai
Copy link
Copy Markdown

codeant-ai Bot commented May 17, 2026

CodeAnt AI finished reviewing your PR.

@arthrod
Copy link
Copy Markdown
Owner Author

arthrod commented May 17, 2026

Triage agent — PR #80 comment review (read-only pass, no code changes)

3 inline comments reviewed:

  1. coderabbitai @ parse_doc2dict...py:2380 — simplify boolean return (WONT-FIX)
    Nitpick/Low value: returning stripped.endswith(...) directly vs. an if/else. No semantic difference. Current form is equally readable. Skipping cosmetic refactors during active stacking.

  2. coderabbitai @ parse_doc2dict...py:2475 — use list unpacking instead of concatenation (WONT-FIX)
    Nitpick/Low value: [*list1, item] vs list1 + [item]. Both are idiomatic Python. No correctness or performance concern at the scale used here.

  3. codeant-ai @ parse_doc2dict...py:2465 — title-continuation heuristic too broad, can absorb cover metadata (WILL-DEFER)
    Legitimate concern: uppercase registrant/company lines appearing immediately before the agreement title could be classified as continuation lines and merged into the L0 title, then hidden as envelope records. This could corrupt L0. Needs a test case with a cover-page document before fixing. Deferred to polish PR.

WILL-DEFER items (1): Title-continuation heuristic — add guard to exclude registrant/company lines from L0 merge.

Triage only — no code changes made this round.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Feat2 size:L This PR changes 100-499 lines, ignoring generated files

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant