redo/infra: rubric reshape (title→L0, body clauses→L1), order field, 95% reconstruction gate, full freeze reset#72
redo/infra: rubric reshape (title→L0, body clauses→L1), order field, 95% reconstruction gate, full freeze reset#72arthrod wants to merge 2 commits into
Conversation
…full freeze reset
Rubric (level = nesting depth):
L0 = agreement title alone (was: title + preamble combined)
L1 = preamble paragraph, recitals block, every top-level body
clause (Article when present, otherwise Section), signature
block — all direct children of the agreement
L2 = direct children of L1 (Section under Article, or "(a)/(b)"
under top Section)
L3 = direct children of L2
L4+ = deeper nesting
+1 to every descendant per subdoc ancestor; ceiling 7
JSONL schema gains "order" field (4 keys: idx, order, level, span):
- 0-indexed sequence number within idx, in document order
- guarantees the linear sequence even if downstream loaders shuffle
JSON key order
Reconstruction-faithfulness gate (BLOCKING):
- freeze.py refuses on word_coverage < 95% per DECISIONS.md §10
- error message includes coverage %, char_ratio, missing-word count,
sample missing words so the agent can localize the gap
freeze.py validator now also checks:
- "order" present, 0-indexed, monotonic by 1 across all records
- "exactly one depth-0 record (the title alone)"
Full freeze reset:
- state.json: current_idx=0, frozen=[], history=[reset]
- data/auto_parse/level_freeze/frozen/idx_*.jsonl: all 14 tracked
frozens removed (invalidated by rubric change). 73 total baselines
on the local machine — 60 of them failed the new 95% gate; all
stashed at ~/Library/clause-extract-backups/before-redo-<ts>/
md updates:
- level_rubric.md: NEW rubric with worked depth table
- scope_rule.md: clarifies all-agreement-types-in-scope (private,
government, unilateral, international, multilateral); no
document-class-specific code allowed
- turn_prompt.md, examples_main_agreement.md, examples_with_subdocs.md,
freeze_command.md, README.md, advance_command.md, regress_command.md:
aligned with the new rubric and the 95% gate
- paths corrected (repo root is /Users/arthrod/temp/T/clause-extract,
not the doubled /clause-extract/clause-extract)
Smoke tests:
- parser runs on idx=0 → 66 records emitted, all 66 carry "order"
- prompt.py renders 540 lines for current_idx=0
- freeze.py against the smoke-test output correctly refuses with
"reconstruction word_coverage=88.0% < 95% bar" (parser still
emits old-rubric depths; agent will re-tune in per-idx redos)
Stash: ~/Library/clause-extract-backups/before-redo-20260511T222200/
Stack: this PR is the base for the redo/idx-N stacked PR series
(one PR per idx 0..72 rebaking under the new rubric)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Mention Blocks like a regular teammate with your question or request: @blocks review this pull request Run |
ⓘ You've reached your Qodo monthly free-tier limit. Reviews pause until next month — upgrade your plan to continue now, or link your paid account if you already have one. |
|
CodeAnt AI is reviewing your PR. |
📝 WalkthroughWalkthroughThis PR removes multiple experimental parsing pipeline snapshots and their corresponding frozen JSONL reference data. All changes are full file deletions from the doc2dict level-freeze attempts and frozen directories with no replacement content. ChangesExperimental Parsing Snapshots and Frozen Reference Data Cleanup
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested labels
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
Comment |
| uv run scripts/measure_reconstruction.py --idx {current_idx} | ||
|
|
||
| Read the word coverage and char ratio. Word coverage < 95% | ||
| is a HARD FAIL at freeze time — the freeze gate refuses |
There was a problem hiding this comment.
🟠 Architect Review — HIGH
The prompt tells agents to run uv run scripts/measure_reconstruction.py --idx {current_idx}, but scripts/measure_reconstruction.py does not define any --idx option or positional idx argument, so this command fails and breaks the documented workflow in normal dispatches.
Suggestion: Either add an idx/--idx option to scripts/measure_reconstruction.py to support per-idx measurement, or update the prompt (and task_rules/turn_prompt.md) to use a valid invocation of the script and describe how to inspect a single idx from its outputs.
Fix in Cursor | Fix in VSCode Claude
(Use Cmd/Ctrl + Click for best experience)
Prompt for AI Agent 🤖
This is an **Architect / Logical Review** comment left during a code review. These reviews are first-class, important findings — not optional suggestions. Do NOT dismiss this as a 'big architectural change' just because the title says architect review; most of these can be resolved with a small, localized fix once the intent is understood.
**Path:** scripts/level_loop/prompt.py
**Line:** 503:506
**Comment:**
*HIGH: The prompt tells agents to run `uv run scripts/measure_reconstruction.py --idx {current_idx}`, but `scripts/measure_reconstruction.py` does not define any `--idx` option or positional idx argument, so this command fails and breaks the documented workflow in normal dispatches.
Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.
If a suggested approach is provided above, use it as the authoritative instruction. If no explicit code suggestion is given, you MUST still draft and apply your own minimal, localized fix — do not punt back with 'no suggestion provided, review manually'. Keep the change as small as possible: add a guard clause, gate on a loading state, reorder an await, wrap in a conditional, etc. Do not refactor surrounding code or expand scope beyond the finding.
Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fix| reconstructed = "".join((r.get("span") or "") for r in records) | ||
| source_norm = _normalize_text(source) | ||
| recon_norm = _normalize_text(reconstructed) | ||
| source_words = set(source_norm.split()) | ||
| recon_words = set(recon_norm.split()) |
There was a problem hiding this comment.
🟠 Architect Review — HIGH
The reconstruction check in freeze.py concatenates spans with "".join(...), while scripts/measure_reconstruction.py builds concat_text using "\n".join(chunks); because normalization then tokenizes on whitespace, this difference can fuse boundary words into a single token, making the blocking gate's word-coverage calculation diverge from the standalone measurement script despite the comment claiming they match.
Suggestion: Align _measure_reconstruction in freeze.py with load_parser_concat/measure in scripts/measure_reconstruction.py (e.g. by sharing a common helper that joins with newlines and normalizes identically) so that the freeze gate's pass/fail decision uses exactly the same reconstruction metric as the diagnostic tool.
Fix in Cursor | Fix in VSCode Claude
(Use Cmd/Ctrl + Click for best experience)
Prompt for AI Agent 🤖
This is an **Architect / Logical Review** comment left during a code review. These reviews are first-class, important findings — not optional suggestions. Do NOT dismiss this as a 'big architectural change' just because the title says architect review; most of these can be resolved with a small, localized fix once the intent is understood.
**Path:** scripts/level_loop/freeze.py
**Line:** 349:353
**Comment:**
*HIGH: The reconstruction check in `freeze.py` concatenates spans with `"".join(...)`, while `scripts/measure_reconstruction.py` builds `concat_text` using `"\n".join(chunks)`; because normalization then tokenizes on whitespace, this difference can fuse boundary words into a single token, making the blocking gate's word-coverage calculation diverge from the standalone measurement script despite the comment claiming they match.
Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.
If a suggested approach is provided above, use it as the authoritative instruction. If no explicit code suggestion is given, you MUST still draft and apply your own minimal, localized fix — do not punt back with 'no suggestion provided, review manually'. Keep the change as small as possible: add a guard clause, gate on a loading state, reorder an await, wrap in a conditional, etc. Do not refactor surrounding code or expand scope beyond the finding.
Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fix|
CodeAnt AI finished reviewing your PR. |
There was a problem hiding this comment.
Code Review
This pull request removes the parse_doc2dict_with_config.py script along with several frozen baseline JSONL files. Feedback from the review highlights that state.json, which is required for the state reset described in the PR, is missing from the commit. Additionally, there is a discrepancy between the number of removed files mentioned in the PR description and those actually present in the diff, suggesting that several indices may have been missed during the cleanup process.
| @@ -1,66 +0,0 @@ | |||
| {"idx": 0, "level": 1, "span": "ULURU Inc."} | |||
There was a problem hiding this comment.
The PR description mentions that data/auto_parse/level_freeze/state.json was updated to reset the state, but this file is missing from the diff. Given the note about MM indicators in git status, it's possible this file was modified but not staged for the commit. This file is essential for the "full freeze reset" to take effect.
| @@ -1,66 +0,0 @@ | |||
| {"idx": 0, "level": 1, "span": "ULURU Inc."} | |||
There was a problem hiding this comment.
The PR description states that 14 tracked frozens were removed, but the diff only shows 9 .jsonl files being removed from the frozen/ directory. Please verify if indices 1, 2, 3, 5, and 12 (which have attempt snapshots removed in this PR) also have corresponding frozen baselines that should be deleted to complete the reset.
…strip + punct drop)
User lowered the reconstruction gate from 95% to 90% after measuring the
actual failure rate across the 21 stashed baselines:
bar pass / 21
≥95% 3 (14%)
≥90% 6 (29%) ← current
≥85% 12 (57%)
≥80% 16 (76%)
But ~half the "missing" tokens were metric artifacts, not real content
drops. Three changes to fix that without softening the spirit of the bar:
1. Boundary fix: concat spans with " " instead of "" when computing the
reconstruction. Without this, "(g)" at the start of one record fuses
with the trailing word of the previous record (e.g. "evidence.(g)"
becomes one token), making "(g)" look missing.
2. Envelope strip: drop SEC-envelope-marker tokens from the source-side
word set before comparing. The parser correctly drops the
`<DOCUMENT>` envelope (e.g. "EXHIBIT 10.25") from JSONL, but
span_clean still contains it. Tokens removed in the leading ~600
chars: "exhibit", pure-decimal numbers ("10", "10.25"), filename
identifiers (e.g. "ex_10-25.htm", "arlz_ex10_1"), and globally
"confidential treatment requested" marker tokens.
3. Pure-punctuation drop: tokens with no alphanumeric content (",",
".", ";", "(", "“", "_______________", etc.) carry no semantic
signal — dropped from BOTH source and reconstruction sides.
After all three fixes:
bar pass / 21 delta
≥95% 4 (19%) +1
≥90% 6 (29%) same
≥85% 15 (71%) +3
≥80% 17 (81%) +1
mean coverage: 87.1% (was 84.8%)
median: 88.0% (was 85.5%)
Idx=0 specifically: 88.0% → 89.7% (just barely under the 90% bar; the
remaining ~150 missing tokens are a real signal — sections 14-21 of the
agreement are dropped by the parser, which is what the per-idx redos
need to fix).
Documentation updated to reflect the 90% bar in level_rubric.md,
turn_prompt.md, freeze_command.md, README.md, prompt.py template.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
CodeAnt AI is running Incremental review |
|
CodeAnt AI Incremental review completed. |
User description
Summary
Foundation PR for the redo/idx-N stacked-PR series. Resets the freeze loop's infrastructure so each subsequent per-idx PR works against a single, coherent rubric + schema + gate set.
Rubric reshape
The previous rubric put title+preamble together at L0 and numbered Sections at L2. The new rubric:
(a)/(b)under top Section)This invalidates all previously frozen baselines. They are stashed (path below).
JSONL schema gains
orderEach record now has four keys:
{idx, order, level, span}.orderis a 0-indexed per-idx sequence number in document order, so downstream consumers can reconstruct the linear sequence without relying on JSON-key ordering.Reconstruction-faithfulness gate is now BLOCKING
freeze.pyrefuses any freeze whereword_coverage < 95%(perdocs/DECISIONS.md§10). The previous draft had this as a non-blocking warning; the user upgraded it to a hard gate so per-idx baselines cannot sneak below the bar.The error message includes coverage %, char_ratio, missing-word count, and a sample of missing words so the agent on the next dispatch can localize what got dropped.
Validator updates
freeze.pyrubric checks now expect:orderpresent on every record, 0-indexed, monotonic by 1 across the line orderFull freeze reset
data/auto_parse/level_freeze/state.json→{current_idx: 0, frozen: [], history: [reset]}data/auto_parse/level_freeze/frozen/idx_*.jsonlremoved (14 tracked frozens invalidated by the new rubric)~/Library/clause-extract-backups/before-redo-20260511T222200/(166 MB; outside the repo per Q-stash)md updates
level_rubric.md— full rewrite with the new depth table and worked subdoc-penalty arithmeticscope_rule.md— opens with "every kind of agreement is in scope" (private, government, unilateral, international, multilateral); explicit ban on document-class-specific code pathsturn_prompt.md,examples_main_agreement.md,examples_with_subdocs.md,freeze_command.md,README.md,advance_command.md,regress_command.md— aligned with the new rubric and the 95% gate/Users/arthrod/temp/T/clause-extract, not the doubled/clause-extract/clause-extract)Smoke tests
ast.parseclean onfreeze.py,prompt.py,parse_doc2dict_with_config.pyorderprompt.pyrenders 540 lines for current_idx=0freeze.pyagainst the smoke-test output correctly refuses withreconstruction word_coverage=88.0% < 95% bar(the parser still emits old-rubric depths until the agent re-tunes during the per-idx redos)Stack base
This PR is the foundation for the per-idx stacked PR series. Each subsequent
redo/idx-Nbranch (N = 0..72) will be branched from the previous one (idx=0 fromredo/infra) and add one frozen baseline per PR.Test plan
Notes
MMindicators on some files ingit statuswere from a pre-existing WIP commit on a separate branch (wip/before-redo-20260511T222208); that branch is local-only and can be deleted after this PR merges.🤖 Generated with Claude Code
CodeAnt-AI Description
Tighten parser-tuning rules and enforce reconstruction checks
What Changed
ordernumber so the original clause sequence can be reconstructed without relying on JSON field order.Impact
✅ Fewer frozen parses with missing contract text✅ Clearer clause ordering in exported output✅ Safer tuning runs across government, unilateral, and multilateral agreements🔄 Retrigger CodeAnt AI Review
💡 Usage Guide
Checking Your Pull Request
Every time you make a pull request, our system automatically looks through it. We check for security issues, mistakes in how you're setting up your infrastructure, and common code problems. We do this to make sure your changes are solid and won't cause any trouble later.
Talking to CodeAnt AI
Got a question or need a hand with something in your pull request? You can easily get in touch with CodeAnt AI right here. Just type the following in a comment on your pull request, and replace "Your question here" with whatever you want to ask:
This lets you have a chat with CodeAnt AI about your pull request, making it easier to understand and improve your code.
Example
Preserve Org Learnings with CodeAnt
You can record team preferences so CodeAnt AI applies them in future reviews. Reply directly to the specific CodeAnt AI suggestion (in the same thread) and replace "Your feedback here" with your input:
This helps CodeAnt AI learn and adapt to your team's coding style and standards.
Example
Retrigger review
Ask CodeAnt AI to review the PR again, by typing:
Check Your Repository Health
To analyze the health of your code repository, visit our dashboard at https://app.codeant.ai. This tool helps you identify potential issues and areas for improvement in your codebase, ensuring your repository maintains high standards of code health.