Skip to content

idx=15: freeze (46 records) — Mast Therapeutics Separation Agreement (multi-line title + synthetic L0 swap)#88

Open
arthrod wants to merge 1 commit into
redo/idx-14from
redo/idx-15
Open

idx=15: freeze (46 records) — Mast Therapeutics Separation Agreement (multi-line title + synthetic L0 swap)#88
arthrod wants to merge 1 commit into
redo/idx-14from
redo/idx-15

Conversation

@arthrod
Copy link
Copy Markdown
Owner

@arthrod arthrod commented May 17, 2026

User description

Summary

Sixteenth stacked PR. Adds idx=15 (SEPARATION AGREEMENT AND GENERAL RELEASE OF CLAIMS between Brian M. Culley and Mast Therapeutics, Inc., April 10-13, 2017) as the sixteenth verified frozen baseline on top of idx=14 (PR #87).

This agreement has a quirky HTML structure that exposed 3 parser pathologies, all addressed surgically:

  • Title typeset across 3 centered-bold lines (SEPARATION AGREEMENT / AND / GENERAL RELEASE OF CLAIMS)
  • Standalone bold AGREEMENT body separator between RECITALS and operative sections that doc2dict promoted to L0 instead of the real title
  • Sig-block UP-walk could demote the synthetic separator to L2

Parser changes (3 surgical, shape-driven)

  1. _swap_synthetic_l0_with_real_title (new, ~lines 2411-2559): when L0 has a BARE single-word ^\s*(?:AGREEMENT|PLAN)\s*$ title, finds an earlier sibling in the same scope whose title matches ^.*\b(AGREEMENT|PLAN)\s*$ with a descriptive prefix; swaps depths (earlier→L0, synthetic→L1); sets _swapped_l0 and _synthetic_l0_separator markers. Predicate is shape-tight: silently no-ops if no descriptive earlier title exists or if subdoc_penalty/scope don't match. Inspector verified all prior 15 idxs have descriptive L0 — none fire the swap.

  2. Extended _merge_multiline_l0_title (~lines 2562-2727) with FORWARD continuation walk gated by _swapped_l0. Collects trailing title-line siblings (AND, GENERAL RELEASE OF CLAIMS), absorbs preamble body so _split_l0_title_from_preamble lifts it to L1. Normal multi-line titles (idx=7 backward-walk) still use the original backward walk.

  3. Guarded _explode_signature_block_lines UP-walk (~lines 4676-4682) against claiming the synthetic separator as a sig-block ancestor label. Uses parent.get("_synthetic_l0_separator") exact-match check.

All detection is SHAPE-based. No phrase blocklists.

Verified output for idx=15

  • 46 records, distribution {L0:1, L1:32, L2:13} (max depth 2)
  • Reconstruction: word_coverage 96.3%, char_ratio 99.5%

L0 (verbatim, multi-line)

SEPARATION AGREEMENT
AND
GENERAL RELEASE OF CLAIMS

Top structure

o=0  L0: SEPARATION AGREEMENT\nAND\nGENERAL RELEASE OF CLAIMS
o=1  L1: THIS SEPARATION AGREEMENT AND GENERAL RELEASE OF CLAIMS (hereinafter "Agreement") is entered into by...
o=2  L1: RECITALS
o=3-5 L1: A./B./C. recitals
o=6  L1: AGREEMENT                            ← demoted from L0 to L1 by swap fix
o=7-29 L1: 1.-15. numbered operative sections
o=30 L1: IN WITNESS WHEREOF, the undersigned have executed this Agreement on the dates shown below.
o=31-43 L2: employee + counter-sig blocks (Brian Culley + Mast Therapeutics, doc2dict natural grouping)
o=44 L1: - 5 -                                ← page footer leak (deferred to polish; 5 chars)
o=45 L1: Affirmation                          ← attached post-sig form (legitimate L1 sibling)

Risk assessment (inspector)

All 3 fixes well-scoped:

  • Swap predicate requires bare-AGREEMENT/PLAN AT L0 + descriptive earlier title + matching subdoc_penalty + agreement scope → silent no-op on every prior idx
  • Forward walk gated on _swapped_l0 marker → only fires after the swap → cannot affect normal multi-line titles
  • Sig-block guard uses exact-marker check → cannot bleed into unrelated branches
  • All 15 prior idxs byte-identical via shasum

Known minor quirks (polish-deferred)

  • o=44 - 5 - page footer artifact (5 chars). Per rubric §"Common parser failure modes" this is technically out-of-scope chrome. Single-record defect, not blocking.

Test plan

  • uv run scripts/parse_doc2dict_with_config.py --limit 16 --no-truncate --output-dir data/auto_parse exits 0 with ok 16
  • uv run scripts/level_loop/freeze.py 15 --force reports word_coverage ≥ 90% (96.3%)
  • uv run scripts/level_loop/regress.py reports all 16 frozen idxs OK
  • Inspector verified L0 multi-line via verbatim span; all 3 fixes traced to specific line ranges; sig area structurally clean
  • Inspector verified all 15 prior idxs byte-identical (no regression from the new swap/forward/guard logic)

🤖 Generated with Claude Code


CodeAnt-AI Description

Correctly parse a separation agreement with a split title and internal section separator

What Changed

  • The agreement title now stays attached to the real opening title instead of being replaced by a standalone “AGREEMENT” section marker
  • Multi-line titles that continue after the first line are now combined in the right order, including cases where the last title line carries the opening preamble
  • The standalone “AGREEMENT” separator is kept as body content and is no longer treated as part of signature or heading hierarchy

Impact

✅ Accurate contract titles
✅ Fewer missing or misplaced opening sections
✅ Cleaner signature block extraction

🔄 Retrigger CodeAnt AI Review

💡 Usage Guide

Checking Your Pull Request

Every time you make a pull request, our system automatically looks through it. We check for security issues, mistakes in how you're setting up your infrastructure, and common code problems. We do this to make sure your changes are solid and won't cause any trouble later.

Talking to CodeAnt AI

Got a question or need a hand with something in your pull request? You can easily get in touch with CodeAnt AI right here. Just type the following in a comment on your pull request, and replace "Your question here" with whatever you want to ask:

@codeant-ai ask: Your question here

This lets you have a chat with CodeAnt AI about your pull request, making it easier to understand and improve your code.

Example

@codeant-ai ask: Can you suggest a safer alternative to storing this secret?

Preserve Org Learnings with CodeAnt

You can record team preferences so CodeAnt AI applies them in future reviews. Reply directly to the specific CodeAnt AI suggestion (in the same thread) and replace "Your feedback here" with your input:

@codeant-ai: Your feedback here

This helps CodeAnt AI learn and adapt to your team's coding style and standards.

Example

@codeant-ai: Do not flag unused imports.

Retrigger review

Ask CodeAnt AI to review the PR again, by typing:

@codeant-ai: review

Check Your Repository Health

To analyze the health of your code repository, visit our dashboard at https://app.codeant.ai. This tool helps you identify potential issues and areas for improvement in your codebase, ensuring your repository maintains high standards of code health.

… swap doc2dict synthetic single-word L0 ("AGREEMENT") with the earlier multi-line title carrier ("SEPARATION AGREEMENT / AND / GENERAL RELEASE OF CLAIMS"), enable forward continuation walk on swapped L0, mark synthetic separator so /s/ sig-block UP-walk skips it
@blocksorg
Copy link
Copy Markdown

blocksorg Bot commented May 17, 2026

Mention Blocks like a regular teammate with your question or request:

@blocks review this pull request
@blocks make the following changes ...
@blocks create an issue from what was mentioned in the following comment ...
@blocks explain the following code ...
@blocks are there any security or performance concerns?

Run @blocks /help for more information.

Workspace settings | Disable this message

Copy link
Copy Markdown

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @arthrod! 👋

Your private repo does not have access to Sourcery.

Please upgrade to continue using Sourcery ✨

@qodo-code-review
Copy link
Copy Markdown

Qodo reviews are paused for this user.

Troubleshooting steps vary by plan Learn more →

On a Teams plan?
Reviews resume once this user has a paid seat and their Git account is linked in Qodo.
Link Git account →

Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center?
These require an Enterprise plan - Contact us
Contact us →

@codeant-ai
Copy link
Copy Markdown

codeant-ai Bot commented May 17, 2026

CodeAnt AI is reviewing your PR.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 17, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: da7a2cb5-43bc-4794-a00a-1f1d05a24bc2

📥 Commits

Reviewing files that changed from the base of the PR and between 6313c14 and 17b4135.

📒 Files selected for processing (3)
  • data/auto_parse/level_freeze/frozen/idx_15.jsonl
  • data/auto_parse/level_freeze/state.json
  • scripts/parse_doc2dict_with_config.py
📜 Recent review details
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (Custom checks)

**/*.py: Run smoke test for Python web servers: start server, wait for ready signal, hit GET /health or GET /, assert HTTP 200. Timeout: 30s. Fail if smoke test fails.
Run smoke test for Python CLI: run <cli> --help, assert exit code 0. Fail if smoke test fails.
Run uv run pytest --cov=<src_package> --cov-report=term-missing --cov-branch --cov-fail-under=80 -q for Python coverage. Fail if exit code is non-zero, branch coverage < 70%, or line coverage < 80%.
Run uv run ruff check . --diff for Python linting. Fail if exit code is non-zero and list each violation.
Run uv run ruff format --check --diff . for Python formatting. Fail if exit code is non-zero and list each unformatted file.
Scan diffs for newly added Python suppression comments (# noqa, # type: ignore). Each suppression must have an inline justification comment. Fail if any new suppression lacks justification. Warn if total new suppressions > 3 in a single PR.
Run uv run ruff check --select I,F401 . to verify Python import ordering and detect unused imports. Fail if violations found.
Run the full Python test suite: uv run pytest --tb=line -q on origin/main to capture baseline pass/fail counts, and uv run pytest --tb=short -q on PR branch. Fail immediately if exit code is non-zero.
Run uv run typy check for Python type checking if .py files exist in diff or project has py.typed marker. Fail if exit code is non-zero. If typy is not available, use configured mypy or pyright instead. Fail with 'No Python type checker configured' if none is found.
Scan diffs for new bare type: ignore comments (without error codes) in Python files and cast() calls without explanatory comments. Warn for each. Fail if bare type: ignore count > 3.

Files:

  • scripts/parse_doc2dict_with_config.py
**/*.{py,ts,tsx}

📄 CodeRabbit inference engine (Custom checks)

For each changed production file, verify at least one corresponding test file exists or already exists in the repo with assertions covering changed symbols. Fail if a changed production file has zero associated test file and introduces new exported functions, classes, or public methods. Exempt: config files, type declaration files (.d.ts), migrations, and generated code.

Files:

  • scripts/parse_doc2dict_with_config.py
🪛 Ruff (0.15.12)
scripts/parse_doc2dict_with_config.py

[warning] 2417-2417: Too many branches (13 > 12)

(PLR0912)


[warning] 2704-2704: Consider iterable unpacking instead of concatenation

Replace with iterable unpacking

(RUF005)

🔍 Remote MCP

Summary of Gathered Context

Document Context: Mast Therapeutics Separation Agreement

The idx=15 document is a Separation Agreement and General Release of Claims between Mast Therapeutics, Inc. and employee Brandi L. Roberts, dated April 10–13, 2017 (as mentioned in PR context). The termination was expected to occur on or about April 21, 2017, in connection with the closing of the acquisition of Savara Inc. by Mast. The acquisition was completed on April 27, 2017.

PR Code Changes Review

Three surgical parser fixes were introduced:

  1. _swap_synthetic_l0_with_real_title() function — Detects when doc2dict incorrectly promotes a bare single-word "AGREEMENT"/"PLAN" separator to depth-0 (L0), finds the real descriptive title appearing earlier, and swaps their depth assignments with internal flags (_swapped_l0 and _synthetic_l0_separator).

  2. Extended _merge_multiline_l0_title() — Now performs a forward continuation walk (gated by _swapped_l0 flag) to collect trailing title lines (e.g., "AND / GENERAL RELEASE OF CLAIMS") and absorbs preamble text, while preserving existing backward-walk behavior for normal multi-line titles.

  3. Guarded _explode_signature_block_lines() UP-walk — Explicitly skips synthetic AGREEMENT/PLAN separators (marked with _synthetic_l0_separator) during upward chain walks to prevent misclassification as signature-block party labels.

Verification & Metrics

  • Output for idx=15: 46 records (1 L0, 32 L1, 13 L2) with maximum depth 2
  • Reconstruction quality: 96.3% word coverage, 99.5% character ratio
  • L0 title preserved verbatim:
    SEPARATION AGREEMENT
    AND
    GENERAL RELEASE OF CLAIMS
    
  • Test results: All 16 idx parsing tests pass; inspector verified prior 15 indices unchanged
  • Known minor issue: Page footer artifact (- 5 -) at record o=44 deferred to future polish

State & Freeze Updates

The state.json file was updated with:

  • New frozen array entry: 15
  • Three new freeze history events for idx=15 (timestamps: 2026-05-17T09:45:14, 09:57:58, 10:01:26)

Important Note on User Requirements

⚠️ The "Stitch" tool referenced in your user requirements is not available in the current toolset, and this PR is a code/data engineering change rather than a UI/design change, so it would not apply to design generation tasks.

🔇 Additional comments (4)
data/auto_parse/level_freeze/frozen/idx_15.jsonl (1)

1-46: LGTM!

data/auto_parse/level_freeze/state.json (2)

18-19: LGTM!


231-248: LGTM!

scripts/parse_doc2dict_with_config.py (1)

2403-2559: LGTM!

Also applies to: 2628-2725, 4676-4682, 5032-5052


📝 Walkthrough

Summary by CodeRabbit

Release Notes

  • Bug Fixes

    • Improved parsing accuracy for legal agreements with complex layouts, including those with centered section separators and multi-line title sections.
    • Enhanced signature block extraction to correctly identify document structure elements.
  • Chores

    • Updated internal dataset records and pipeline state management.

Walkthrough

This PR introduces a new parsing pass that detects and normalizes a layout pathology where doc2dict incorrectly promotes a bare single-word AGREEMENT/PLAN separator to L0 depth, while the real multi-line agreement title appears earlier. The fix swaps their depth assignments, extends the merge logic to collect forward continuation lines, updates downstream signature-block logic to skip synthetic separators, and adds a frozen output dataset for idx_15.

Changes

Synthetic Agreement Title Normalization

Layer / File(s) Summary
Synthetic L0 Detection and Swap Implementation
scripts/parse_doc2dict_with_config.py
Regex _BARE_AGREEMENT_PLAN_TITLE_RE identifies bare single-word AGREEMENT/PLAN nodes; new function _swap_synthetic_l0_with_real_title() detects synthetic L0, locates the earlier descriptive title, swaps depth assignments, and marks nodes with internal flags.
Forward Continuation Merge and Pipeline Integration
scripts/parse_doc2dict_with_config.py
_merge_multiline_l0_title() extended with forward continuation walk that merges predicted-header siblings following swapped L0, combines titles in order, transfers body from final line into L0 when present, and marks merged records as envelopes; parse_one pipeline updated to run swap before merge.
Signature-Block Expansion Skip Logic
scripts/parse_doc2dict_with_config.py
_explode_signature_block_lines() upward chain walk now skips parent nodes marked as synthetic separators, preventing synthetic AGREEMENT/PLAN nodes from being misclassified as signature-block party labels.

Frozen Dataset and State Management

Layer / File(s) Summary
Dataset Freeze and State Tracking
data/auto_parse/level_freeze/frozen/idx_15.jsonl, data/auto_parse/level_freeze/state.json
New JSONL dataset with 46 records representing parsed separation agreement sections and affirmation; frozen list extended with index 15; three new freeze history events added for idx_15 with timestamps and record counts.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • arthrod/clause-extract#5: Both PRs involve generating and freezing the idx_15 dataset—this PR updates the freeze state while also modifying the parsing logic in parse_doc2dict_with_config.py that affects how agreement title sections are parsed and structured.

Suggested labels

Feat2

Poem

🐰 A title swapped, a depth reversed,
Synthetic separators thoroughly nursed,
Forward merges dance in line,
Signatures skip what's not quite mine,
idx_15 now frozen, crystalline and fine!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The PR title clearly and specifically describes the main change: adding idx=15 (a separation agreement dataset) with the key parser fixes (multi-line title and synthetic L0 swap).
Description check ✅ Passed The PR description is comprehensive and directly related to the changeset, covering the dataset addition, the three parser pathologies addressed, and detailed verification results.
Docstring Coverage ✅ Passed Docstring coverage is 80.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch

Comment @coderabbitai help to get the list of available commands and usage tips.

@codeant-ai codeant-ai Bot added the size:L This PR changes 100-499 lines, ignoring generated files label May 17, 2026
@coderabbitai coderabbitai Bot added the Feat2 label May 17, 2026
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a structural fix for document parsing to correctly identify agreement titles by swapping generic separators with descriptive titles found earlier in the text. It also updates multi-line title merging to support forward-walking collection and includes data updates for document index 15. Feedback identifies a discrepancy between the docstring and implementation in the new swap function and suggests optimizing regex compilation by moving it to the module level.

Comment on lines +2440 to +2447
1. The current L0 title is BARE: matches exactly "AGREEMENT" or
"PLAN" with nothing else.
2. The current L0 has NO body_direct.
3. There is an EARLIER record (smaller node_id) whose title matches
the AGREEMENT|PLAN end pattern AND has descriptive prefix words
(i.e. title is longer than just the bare word).
4. That earlier record is in agreement scope (not envelope, not
trailer).
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The docstring states that one of the conditions for the swap is that 'The current L0 has NO body_direct'. However, the implementation does not check for this, and a comment at line 2487 explicitly says 'Body presence is NOT a disqualifier'. This suggestion updates the docstring to be consistent with the implementation by removing this point and renumbering the list.

Suggested change
1. The current L0 title is BARE: matches exactly "AGREEMENT" or
"PLAN" with nothing else.
2. The current L0 has NO body_direct.
3. There is an EARLIER record (smaller node_id) whose title matches
the AGREEMENT|PLAN end pattern AND has descriptive prefix words
(i.e. title is longer than just the bare word).
4. That earlier record is in agreement scope (not envelope, not
trailer).
1. The current L0 title is BARE: matches exactly "AGREEMENT" or
"PLAN" with nothing else.
2. There is an EARLIER record (smaller node_id) whose title matches
the AGREEMENT|PLAN end pattern AND has descriptive prefix words
(i.e. title is longer than just the bare word).
3. That earlier record is in agreement scope (not envelope, not
trailer).

Comment on lines +2506 to +2509
_AGREEMENT_END_RE = re.compile(
r"^.*\b(AGREEMENT|PLAN)\s*$",
re.IGNORECASE,
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For better performance, this regular expression should be compiled only once at the module level, for example, right after _BARE_AGREEMENT_PLAN_TITLE_RE. Compiling it on every function call is inefficient, especially since this function is called for each document parsed. You can then use the module-level constant _AGREEMENT_END_RE inside this function.

@codeant-ai
Copy link
Copy Markdown

codeant-ai Bot commented May 17, 2026

CodeAnt AI finished reviewing your PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Feat2 size:L This PR changes 100-499 lines, ignoring generated files

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant