Skip to content

redo/infra: rubric reshape (title→L0, body clauses→L1), order field, 95% reconstruction gate, full freeze reset#72

Open
arthrod wants to merge 2 commits into
proper_goalsfrom
redo/infra
Open

redo/infra: rubric reshape (title→L0, body clauses→L1), order field, 95% reconstruction gate, full freeze reset#72
arthrod wants to merge 2 commits into
proper_goalsfrom
redo/infra

Conversation

@arthrod
Copy link
Copy Markdown
Owner

@arthrod arthrod commented May 12, 2026

User description

Summary

Foundation PR for the redo/idx-N stacked-PR series. Resets the freeze loop's infrastructure so each subsequent per-idx PR works against a single, coherent rubric + schema + gate set.

Rubric reshape

The previous rubric put title+preamble together at L0 and numbered Sections at L2. The new rubric:

  • L0 = agreement title alone (one record per idx)
  • L1 = every direct child of the agreement — preamble, recitals, every top-level body clause (Article when present, else Section), and the signature block
  • L2 = direct children of L1 (Section under Article, or (a)/(b) under top Section)
  • L3 = direct children of L2
  • L4+ = deeper nesting; +1 per subdoc ancestor; ceiling 7

This invalidates all previously frozen baselines. They are stashed (path below).

JSONL schema gains order

Each record now has four keys: {idx, order, level, span}. order is a 0-indexed per-idx sequence number in document order, so downstream consumers can reconstruct the linear sequence without relying on JSON-key ordering.

Reconstruction-faithfulness gate is now BLOCKING

freeze.py refuses any freeze where word_coverage < 95% (per docs/DECISIONS.md §10). The previous draft had this as a non-blocking warning; the user upgraded it to a hard gate so per-idx baselines cannot sneak below the bar.

The error message includes coverage %, char_ratio, missing-word count, and a sample of missing words so the agent on the next dispatch can localize what got dropped.

Validator updates

freeze.py rubric checks now expect:

  • exactly one depth-0 record (the title alone, not title+preamble)
  • order present on every record, 0-indexed, monotonic by 1 across the line order

Full freeze reset

  • data/auto_parse/level_freeze/state.json{current_idx: 0, frozen: [], history: [reset]}
  • data/auto_parse/level_freeze/frozen/idx_*.jsonl removed (14 tracked frozens invalidated by the new rubric)
  • Local stash of all 73 baselines + attempts/turns kept at ~/Library/clause-extract-backups/before-redo-20260511T222200/ (166 MB; outside the repo per Q-stash)

md updates

  • level_rubric.md — full rewrite with the new depth table and worked subdoc-penalty arithmetic
  • scope_rule.md — opens with "every kind of agreement is in scope" (private, government, unilateral, international, multilateral); explicit ban on document-class-specific code paths
  • turn_prompt.md, examples_main_agreement.md, examples_with_subdocs.md, freeze_command.md, README.md, advance_command.md, regress_command.md — aligned with the new rubric and the 95% gate
  • Path corrections (repo root is /Users/arthrod/temp/T/clause-extract, not the doubled /clause-extract/clause-extract)

Smoke tests

  • ast.parse clean on freeze.py, prompt.py, parse_doc2dict_with_config.py
  • parser runs on idx=0 → 66 records emitted, all 66 carry order
  • prompt.py renders 540 lines for current_idx=0
  • freeze.py against the smoke-test output correctly refuses with reconstruction word_coverage=88.0% < 95% bar (the parser still emits old-rubric depths until the agent re-tunes during the per-idx redos)

Stack base

This PR is the foundation for the per-idx stacked PR series. Each subsequent redo/idx-N branch (N = 0..72) will be branched from the previous one (idx=0 from redo/infra) and add one frozen baseline per PR.

Test plan

  • Syntax check all modified Python files
  • Parser runs cleanly on idx=0 with new schema
  • Prompt template renders without format errors
  • freeze.py refuses the smoke-test output (reconstruction gate fires correctly)
  • Backup of pre-redo state at known location

Notes

  • The MM indicators on some files in git status were from a pre-existing WIP commit on a separate branch (wip/before-redo-20260511T222208); that branch is local-only and can be deleted after this PR merges.
  • 60 of the 73 backed-up baselines failed the new 95% gate when tested before the rubric change. That informs the redo: many idxs need substantial re-tuning, not just a depth shift.

🤖 Generated with Claude Code


CodeAnt-AI Description

Tighten parser-tuning rules and enforce reconstruction checks

What Changed

  • The tuning loop now treats the agreement title as the only depth-0 record, with the preamble, recitals, top-level clauses, and signature block starting at depth 1.
  • Each parsed span now gets an explicit order number so the original clause sequence can be reconstructed without relying on JSON field order.
  • Freeze now refuses outputs that miss too much source text, and it reports missing words and coverage details when reconstruction falls below the bar.
  • The guidance now applies the same parsing rules to all agreement types and warns against document-type-specific branches and stale parser output.

Impact

✅ Fewer frozen parses with missing contract text
✅ Clearer clause ordering in exported output
✅ Safer tuning runs across government, unilateral, and multilateral agreements

🔄 Retrigger CodeAnt AI Review

💡 Usage Guide

Checking Your Pull Request

Every time you make a pull request, our system automatically looks through it. We check for security issues, mistakes in how you're setting up your infrastructure, and common code problems. We do this to make sure your changes are solid and won't cause any trouble later.

Talking to CodeAnt AI

Got a question or need a hand with something in your pull request? You can easily get in touch with CodeAnt AI right here. Just type the following in a comment on your pull request, and replace "Your question here" with whatever you want to ask:

@codeant-ai ask: Your question here

This lets you have a chat with CodeAnt AI about your pull request, making it easier to understand and improve your code.

Example

@codeant-ai ask: Can you suggest a safer alternative to storing this secret?

Preserve Org Learnings with CodeAnt

You can record team preferences so CodeAnt AI applies them in future reviews. Reply directly to the specific CodeAnt AI suggestion (in the same thread) and replace "Your feedback here" with your input:

@codeant-ai: Your feedback here

This helps CodeAnt AI learn and adapt to your team's coding style and standards.

Example

@codeant-ai: Do not flag unused imports.

Retrigger review

Ask CodeAnt AI to review the PR again, by typing:

@codeant-ai: review

Check Your Repository Health

To analyze the health of your code repository, visit our dashboard at https://app.codeant.ai. This tool helps you identify potential issues and areas for improvement in your codebase, ensuring your repository maintains high standards of code health.

…full freeze reset

Rubric (level = nesting depth):
  L0 = agreement title alone (was: title + preamble combined)
  L1 = preamble paragraph, recitals block, every top-level body
       clause (Article when present, otherwise Section), signature
       block — all direct children of the agreement
  L2 = direct children of L1 (Section under Article, or "(a)/(b)"
       under top Section)
  L3 = direct children of L2
  L4+ = deeper nesting
  +1 to every descendant per subdoc ancestor; ceiling 7

JSONL schema gains "order" field (4 keys: idx, order, level, span):
  - 0-indexed sequence number within idx, in document order
  - guarantees the linear sequence even if downstream loaders shuffle
    JSON key order

Reconstruction-faithfulness gate (BLOCKING):
  - freeze.py refuses on word_coverage < 95% per DECISIONS.md §10
  - error message includes coverage %, char_ratio, missing-word count,
    sample missing words so the agent can localize the gap

freeze.py validator now also checks:
  - "order" present, 0-indexed, monotonic by 1 across all records
  - "exactly one depth-0 record (the title alone)"

Full freeze reset:
  - state.json: current_idx=0, frozen=[], history=[reset]
  - data/auto_parse/level_freeze/frozen/idx_*.jsonl: all 14 tracked
    frozens removed (invalidated by rubric change). 73 total baselines
    on the local machine — 60 of them failed the new 95% gate; all
    stashed at ~/Library/clause-extract-backups/before-redo-<ts>/

md updates:
  - level_rubric.md: NEW rubric with worked depth table
  - scope_rule.md: clarifies all-agreement-types-in-scope (private,
    government, unilateral, international, multilateral); no
    document-class-specific code allowed
  - turn_prompt.md, examples_main_agreement.md, examples_with_subdocs.md,
    freeze_command.md, README.md, advance_command.md, regress_command.md:
    aligned with the new rubric and the 95% gate
  - paths corrected (repo root is /Users/arthrod/temp/T/clause-extract,
    not the doubled /clause-extract/clause-extract)

Smoke tests:
  - parser runs on idx=0 → 66 records emitted, all 66 carry "order"
  - prompt.py renders 540 lines for current_idx=0
  - freeze.py against the smoke-test output correctly refuses with
    "reconstruction word_coverage=88.0% < 95% bar" (parser still
    emits old-rubric depths; agent will re-tune in per-idx redos)

Stash:  ~/Library/clause-extract-backups/before-redo-20260511T222200/
Stack:  this PR is the base for the redo/idx-N stacked PR series
        (one PR per idx 0..72 rebaking under the new rubric)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@blocksorg
Copy link
Copy Markdown

blocksorg Bot commented May 12, 2026

Mention Blocks like a regular teammate with your question or request:

@blocks review this pull request
@blocks make the following changes ...
@blocks create an issue from what was mentioned in the following comment ...
@blocks explain the following code ...
@blocks are there any security or performance concerns?

Run @blocks /help for more information.

Workspace settings | Disable this message

Copy link
Copy Markdown

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @arthrod! 👋

Your private repo does not have access to Sourcery.

Please upgrade to continue using Sourcery ✨

@qodo-code-review
Copy link
Copy Markdown

ⓘ You've reached your Qodo monthly free-tier limit. Reviews pause until next month — upgrade your plan to continue now, or link your paid account if you already have one.

@codeant-ai
Copy link
Copy Markdown

codeant-ai Bot commented May 12, 2026

CodeAnt AI is reviewing your PR.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 12, 2026

📝 Walkthrough

Walkthrough

This PR removes multiple experimental parsing pipeline snapshots and their corresponding frozen JSONL reference data. All changes are full file deletions from the doc2dict level-freeze attempts and frozen directories with no replacement content.

Changes

Experimental Parsing Snapshots and Frozen Reference Data Cleanup

Layer / File(s) Summary
Deleted experimental doc2dict parsing snapshots
data/auto_parse/level_freeze/attempts/idx_*_attempt*_snapshot.py (17 files)
Complete removal of all attempt versions of doc2dict HTML-to-tree parsing pipelines. Each script previously implemented EX-10 corpus parsing with HTML extraction, rubric depth remapping, structural scope filtering (agreement vs trailer based on signature blocks), section-tree walking, node promotion, and CLI entry points (parse_one, main) writing to Parquet and JSONL outputs. All 17 variations deleted without replacement.
Deleted frozen JSONL reference data
data/auto_parse/level_freeze/frozen/idx_*.jsonl (9 files)
Complete removal of frozen JSONL snapshot data files (idx_0, idx_4, idx_6, idx_7, idx_8, idx_9, idx_10, idx_11, idx_13) that served as expected outputs and test references from the parsing snapshots. Each file contained extracted text spans and section records at varying hierarchy levels.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested labels

Feat2

Poem

🐰 Whiskers twitch as experiments rest,
Old snapshots bundled, given a test,
Frozen data thawed and cleared away,
Make room for growth another day!
🗑️✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main infrastructure changes: rubric reshape with specific depth mapping, order field addition, reconstruction gate enforcement, and full freeze reset.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The PR description is detailed and directly relates to the changeset, covering rubric restructuring, schema changes, reconstruction gate enforcement, and comprehensive documentation updates.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot added the Feat2 label May 12, 2026
@codeant-ai codeant-ai Bot added the size:XL This PR changes 500-999 lines, ignoring generated files label May 12, 2026
Comment on lines +503 to +506
uv run scripts/measure_reconstruction.py --idx {current_idx}

Read the word coverage and char ratio. Word coverage < 95%
is a HARD FAIL at freeze time — the freeze gate refuses
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 Architect Review — HIGH

The prompt tells agents to run uv run scripts/measure_reconstruction.py --idx {current_idx}, but scripts/measure_reconstruction.py does not define any --idx option or positional idx argument, so this command fails and breaks the documented workflow in normal dispatches.

Suggestion: Either add an idx/--idx option to scripts/measure_reconstruction.py to support per-idx measurement, or update the prompt (and task_rules/turn_prompt.md) to use a valid invocation of the script and describe how to inspect a single idx from its outputs.

Fix in Cursor | Fix in VSCode Claude

(Use Cmd/Ctrl + Click for best experience)

Prompt for AI Agent 🤖
This is an **Architect / Logical Review** comment left during a code review. These reviews are first-class, important findings — not optional suggestions. Do NOT dismiss this as a 'big architectural change' just because the title says architect review; most of these can be resolved with a small, localized fix once the intent is understood.

**Path:** scripts/level_loop/prompt.py
**Line:** 503:506
**Comment:**
	*HIGH: The prompt tells agents to run `uv run scripts/measure_reconstruction.py --idx {current_idx}`, but `scripts/measure_reconstruction.py` does not define any `--idx` option or positional idx argument, so this command fails and breaks the documented workflow in normal dispatches.

Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.
If a suggested approach is provided above, use it as the authoritative instruction. If no explicit code suggestion is given, you MUST still draft and apply your own minimal, localized fix — do not punt back with 'no suggestion provided, review manually'. Keep the change as small as possible: add a guard clause, gate on a loading state, reorder an await, wrap in a conditional, etc. Do not refactor surrounding code or expand scope beyond the finding.
Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fix

Comment thread scripts/level_loop/freeze.py Outdated
Comment on lines +349 to +353
reconstructed = "".join((r.get("span") or "") for r in records)
source_norm = _normalize_text(source)
recon_norm = _normalize_text(reconstructed)
source_words = set(source_norm.split())
recon_words = set(recon_norm.split())
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 Architect Review — HIGH

The reconstruction check in freeze.py concatenates spans with "".join(...), while scripts/measure_reconstruction.py builds concat_text using "\n".join(chunks); because normalization then tokenizes on whitespace, this difference can fuse boundary words into a single token, making the blocking gate's word-coverage calculation diverge from the standalone measurement script despite the comment claiming they match.

Suggestion: Align _measure_reconstruction in freeze.py with load_parser_concat/measure in scripts/measure_reconstruction.py (e.g. by sharing a common helper that joins with newlines and normalizes identically) so that the freeze gate's pass/fail decision uses exactly the same reconstruction metric as the diagnostic tool.

Fix in Cursor | Fix in VSCode Claude

(Use Cmd/Ctrl + Click for best experience)

Prompt for AI Agent 🤖
This is an **Architect / Logical Review** comment left during a code review. These reviews are first-class, important findings — not optional suggestions. Do NOT dismiss this as a 'big architectural change' just because the title says architect review; most of these can be resolved with a small, localized fix once the intent is understood.

**Path:** scripts/level_loop/freeze.py
**Line:** 349:353
**Comment:**
	*HIGH: The reconstruction check in `freeze.py` concatenates spans with `"".join(...)`, while `scripts/measure_reconstruction.py` builds `concat_text` using `"\n".join(chunks)`; because normalization then tokenizes on whitespace, this difference can fuse boundary words into a single token, making the blocking gate's word-coverage calculation diverge from the standalone measurement script despite the comment claiming they match.

Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.
If a suggested approach is provided above, use it as the authoritative instruction. If no explicit code suggestion is given, you MUST still draft and apply your own minimal, localized fix — do not punt back with 'no suggestion provided, review manually'. Keep the change as small as possible: add a guard clause, gate on a loading state, reorder an await, wrap in a conditional, etc. Do not refactor surrounding code or expand scope beyond the finding.
Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fix

@codeant-ai
Copy link
Copy Markdown

codeant-ai Bot commented May 12, 2026

CodeAnt AI finished reviewing your PR.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request removes the parse_doc2dict_with_config.py script along with several frozen baseline JSONL files. Feedback from the review highlights that state.json, which is required for the state reset described in the PR, is missing from the commit. Additionally, there is a discrepancy between the number of removed files mentioned in the PR description and those actually present in the diff, suggesting that several indices may have been missed during the cleanup process.

@@ -1,66 +0,0 @@
{"idx": 0, "level": 1, "span": "ULURU Inc."}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The PR description mentions that data/auto_parse/level_freeze/state.json was updated to reset the state, but this file is missing from the diff. Given the note about MM indicators in git status, it's possible this file was modified but not staged for the commit. This file is essential for the "full freeze reset" to take effect.

@@ -1,66 +0,0 @@
{"idx": 0, "level": 1, "span": "ULURU Inc."}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The PR description states that 14 tracked frozens were removed, but the diff only shows 9 .jsonl files being removed from the frozen/ directory. Please verify if indices 1, 2, 3, 5, and 12 (which have attempt snapshots removed in this PR) also have corresponding frozen baselines that should be deleted to complete the reset.

…strip + punct drop)

User lowered the reconstruction gate from 95% to 90% after measuring the
actual failure rate across the 21 stashed baselines:

  bar    pass / 21
  ≥95%      3  (14%)
  ≥90%      6  (29%)   ← current
  ≥85%     12  (57%)
  ≥80%     16  (76%)

But ~half the "missing" tokens were metric artifacts, not real content
drops. Three changes to fix that without softening the spirit of the bar:

  1. Boundary fix: concat spans with " " instead of "" when computing the
     reconstruction. Without this, "(g)" at the start of one record fuses
     with the trailing word of the previous record (e.g. "evidence.(g)"
     becomes one token), making "(g)" look missing.

  2. Envelope strip: drop SEC-envelope-marker tokens from the source-side
     word set before comparing. The parser correctly drops the
     `<DOCUMENT>` envelope (e.g. "EXHIBIT 10.25") from JSONL, but
     span_clean still contains it. Tokens removed in the leading ~600
     chars: "exhibit", pure-decimal numbers ("10", "10.25"), filename
     identifiers (e.g. "ex_10-25.htm", "arlz_ex10_1"), and globally
     "confidential treatment requested" marker tokens.

  3. Pure-punctuation drop: tokens with no alphanumeric content (",",
     ".", ";", "(", "“", "_______________", etc.) carry no semantic
     signal — dropped from BOTH source and reconstruction sides.

After all three fixes:

  bar    pass / 21      delta
  ≥95%      4  (19%)   +1
  ≥90%      6  (29%)    same
  ≥85%     15  (71%)   +3
  ≥80%     17  (81%)   +1
  mean coverage:  87.1%  (was 84.8%)
  median:         88.0%  (was 85.5%)

Idx=0 specifically: 88.0% → 89.7% (just barely under the 90% bar; the
remaining ~150 missing tokens are a real signal — sections 14-21 of the
agreement are dropped by the parser, which is what the per-idx redos
need to fix).

Documentation updated to reflect the 90% bar in level_rubric.md,
turn_prompt.md, freeze_command.md, README.md, prompt.py template.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@codeant-ai
Copy link
Copy Markdown

codeant-ai Bot commented May 16, 2026

CodeAnt AI is running Incremental review

@codeant-ai codeant-ai Bot added size:XXL This PR changes 1000+ lines, ignoring generated files and removed size:XL This PR changes 500-999 lines, ignoring generated files labels May 16, 2026
@codeant-ai
Copy link
Copy Markdown

codeant-ai Bot commented May 16, 2026

CodeAnt AI Incremental review completed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Feat2 size:XXL This PR changes 1000+ lines, ignoring generated files

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant