Skip to content

idx=14: freeze (381 records) — Aralez Designation of Agent Agreement (tri-party gov contract with embedded SF-1449)#87

Open
arthrod wants to merge 1 commit into
redo/idx-13from
redo/idx-14
Open

idx=14: freeze (381 records) — Aralez Designation of Agent Agreement (tri-party gov contract with embedded SF-1449)#87
arthrod wants to merge 1 commit into
redo/idx-13from
redo/idx-14

Conversation

@arthrod
Copy link
Copy Markdown
Owner

@arthrod arthrod commented May 17, 2026

User description

Summary

Fifteenth stacked PR. Adds idx=14 (DESIGNATION OF AGENT AGREEMENT between Aralez Pharmaceuticals US Inc., AstraZeneca Pharmaceuticals LP, and the US Government via VA, February 23, 2017) as the fifteenth verified frozen baseline on top of idx=13 (PR #86).

This is the corpus's first tri-party government contract with a complex embedded subdocument structure: the small Designation of Agent Agreement itself is followed by the entire VA National Contract solicitation/SF-30 modification (~145K chars) containing a NOVATION AGREEMENT, multiple SECTIONS (B/C/D with 12 FAR/VAAR clauses + 4 attachments), and a multi-page SMALL BUSINESS SUBCONTRACTING PLAN.

Parser change (1 surgical, shape-driven)

Extended real_subdoc_ids in _apply_scope_rule (~lines 589-608) to also include secondary-agreement carriers — nodes with subdoc_penalty=0 themselves whose direct children carry subdoc_penalty>=1. The depth-walker sets this when a non-subdoc-class section's title matches the AGREEMENT|PLAN structural-level-0 pattern AND the primary L0 already exists. Recovers secondary agreements (NOVATION AGREEMENT at o=75, SMALL BUSINESS SUBCONTRACTING PLAN at o=330) that were previously stranded in scope=trailer.

Pure structural — uses subdoc_penalty arithmetic from the tree walk. No phrase matching.

Cross-idx audit: idxs 1, 2, 3, 5, 13 also have secondary carriers but already in scope=agreement (their carriers were trivially classified). Only idx=14 has a carrier currently in trailer that flips to agreement.

Verified output for idx=14

  • 381 records (was 330 before fix — 51 records recovered from trailer)
  • Distribution {L0:1, L1:107, L2:110, L3:107, L4:45, L5:11} (max depth 5, ≤7 ceiling)
  • Reconstruction: word_coverage 92.4% (above 90% gate)

Top structure

o=0   L0: DESIGNATION OF AGENT AGREEMENT
o=1   L1: This Designation of Agent Agreement (...) is entered effective as of February 23, 2017 by and between Aralez Pharmaceuticals US Inc. ('Contractor')...
o=2-8 L1: Sections 1-7 of the main agreement (Designation, Ordering, Construction, No liability, Disclaimer, Term, Assignment)
o=9   L1: IN WITNESS WHEREOF, the parties have caused this Agreement to be duly executed...
o=10  L2: ARALEZ PHARMACEUTICALS US INC. / By: /s/ Eric Trachtenberg / ASTRAZENECA / UNITED STATES GOVERNMENT (tri-party sig block, doc2dict natural grouping)
o=11  L2: SECTION B - CONTINUATION OF SF 1449 BLOCKS — this single record holds 145,034 chars of doc2dict's SF-1449/SF-30 table-cell serialization (see char_ratio caveat below)
o=75  L1: NOVATION AGREEMENT                                    ← recovered from trailer
o=330 L1: SMALL BUSINESS SUBCONTRACTING PLAN                     ← recovered from trailer

⚠️ char_ratio caveat (170.2% — diagnostic only, not a gate)

word_coverage 92.4% passes the 90% blocking gate. char_ratio is informational per freeze_command.md.

The 170.2% char_ratio is a 70-point outlier vs the 14 prior baselines (80.9%–99.9%). Inspector identified the root cause:

NOT records duplicating each other. Zero exact duplicates, zero substring containments via fingerprint check.

Root cause: o=11 alone holds 135,609 normalized chars (95.6% of the source-of-truth's 141,805 chars) — doc2dict's flattened HTML-table serialization of the SF-1449/SF-30 government contract form. The form HTML uses tables with repeated column headers ("Base Year", "Bottles", "$255.00", etc.) which doc2dict expanded into one massive body_direct value. Inside o=11 alone, "metoprolol succinate" appears 178× (vs 13× in source), "solicitation" 191× (vs 75×), "amendment" 155× (vs 25×).

The parser correctly preserves doc2dict's natural HTML grouping per the rubric. Re-slicing o=11 to dedupe table content would be synthetic restructuring — explicitly forbidden by level_rubric.md §"Common parser failure modes".

This is an inherent property of the source document (table-heavy government contract form), not a parser defect. Future polish round may investigate whether doc2dict can be configured to skip repeated table cells.

Test plan

  • uv run scripts/parse_doc2dict_with_config.py --limit 15 --no-truncate --output-dir data/auto_parse exits 0 with ok 15
  • uv run scripts/level_loop/freeze.py 14 --force reports word_coverage ≥ 90% (92.4%)
  • uv run scripts/level_loop/regress.py reports all 15 frozen idxs OK
  • Inspector verified the secondary-agreement-carrier fix recovers NOVATION + SBSP from trailer
  • Inspector verified idx=0..13 byte-identical via shasum (no regression)
  • Inspector identified char_ratio root cause as doc2dict-input bloat, NOT parser-output duplication

Source

http://www.sec.gov/Archives/edgar/data/1660719/000155837017004016/arlz-20170331ex101c3bb2d.htm

🤖 Generated with Claude Code


CodeAnt-AI Description

Keep attached plan sections in agreement scope so the full document is frozen

What Changed

  • Treats secondary attached sections as part of the main agreement instead of leaving them in the trailer
  • Recovers the missing subcontracting plan content and related nested sections for idx=14
  • Freezes the idx=14 baseline after the recovered records raise the document to the passing range

Impact

✅ Fewer missing contract sections
✅ More complete government contract captures
✅ Higher freeze success for complex attached plans

🔄 Retrigger CodeAnt AI Review

💡 Usage Guide

Checking Your Pull Request

Every time you make a pull request, our system automatically looks through it. We check for security issues, mistakes in how you're setting up your infrastructure, and common code problems. We do this to make sure your changes are solid and won't cause any trouble later.

Talking to CodeAnt AI

Got a question or need a hand with something in your pull request? You can easily get in touch with CodeAnt AI right here. Just type the following in a comment on your pull request, and replace "Your question here" with whatever you want to ask:

@codeant-ai ask: Your question here

This lets you have a chat with CodeAnt AI about your pull request, making it easier to understand and improve your code.

Example

@codeant-ai ask: Can you suggest a safer alternative to storing this secret?

Preserve Org Learnings with CodeAnt

You can record team preferences so CodeAnt AI applies them in future reviews. Reply directly to the specific CodeAnt AI suggestion (in the same thread) and replace "Your feedback here" with your input:

@codeant-ai: Your feedback here

This helps CodeAnt AI learn and adapt to your team's coding style and standards.

Example

@codeant-ai: Do not flag unused imports.

Retrigger review

Ask CodeAnt AI to review the PR again, by typing:

@codeant-ai: review

Check Your Repository Health

To analyze the health of your code repository, visit our dashboard at https://app.codeant.ai. This tool helps you identify potential issues and areas for improvement in your codebase, ensuring your repository maintains high standards of code health.

…Agent Agreement: keep secondary-agreement (AGREEMENT|PLAN) carriers in agreement scope so attached SUBCONTRACTING PLAN subtree doesn't strand in trailer

The Aralez Designation of Agent Agreement is a government contract that
references an embedded VA National Contract solicitation. After the main
agreement title + 7 numbered sections + signatures, the source contains
the full VA contract sections (B/C/D/E/F) including an attached
SMALL BUSINESS SUBCONTRACTING PLAN with its own nested 1–11 outline.

The depth-walker already treats "SMALL BUSINESS SUBCONTRACTING PLAN" as
a secondary L0 root (matches the AGREEMENT|PLAN structural pattern) and
gives its descendants subdoc_penalty=1. But the scope rule's
_is_real_subdoc_title only accepted cls in {exhibit, schedule, appendix,
annex}, so this carrier (cls=predicted header) failed the subdoc test
and the post-sig walk-up marked it as trailer — dropping ~50 records
of legitimate attached subdoc content and pulling reconstruction below
the 90% bar (85.9%).

Fix: extend real_subdoc_ids inside _apply_scope_rule to also include
structural secondary-agreement carriers — nodes whose own subdoc_penalty
is 0 but whose direct children carry subdoc_penalty>=1. The detection is
purely structural (subdoc_penalty arithmetic comes from walk_sections'
is_secondary_agreement branch, not phrase matching).

Cross-idx audit: only idx=14 has a secondary carrier currently in
trailer scope. idxs 1, 2, 3, 5, 13 also have secondary carriers but
they are all already in agreement scope — no behavior change for them.

Result: idx=14 reconstruction 85.9% → 92.4%, freeze passes the 90% bar,
all 15 idxs (0..14) regress OK.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@blocksorg
Copy link
Copy Markdown

blocksorg Bot commented May 17, 2026

Mention Blocks like a regular teammate with your question or request:

@blocks review this pull request
@blocks make the following changes ...
@blocks create an issue from what was mentioned in the following comment ...
@blocks explain the following code ...
@blocks are there any security or performance concerns?

Run @blocks /help for more information.

Workspace settings | Disable this message

Copy link
Copy Markdown

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @arthrod! 👋

Your private repo does not have access to Sourcery.

Please upgrade to continue using Sourcery ✨

@qodo-code-review
Copy link
Copy Markdown

Qodo reviews are paused for this user.

Troubleshooting steps vary by plan Learn more →

On a Teams plan?
Reviews resume once this user has a paid seat and their Git account is linked in Qodo.
Link Git account →

Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center?
These require an Enterprise plan - Contact us
Contact us →

@codeant-ai
Copy link
Copy Markdown

codeant-ai Bot commented May 17, 2026

CodeAnt AI is reviewing your PR.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 17, 2026

Review Change Stack

📝 Walkthrough

Summary by CodeRabbit

Release Notes

  • Bug Fixes

    • Improved document parsing accuracy by refining the logic for identifying primary content versus supplementary attachments, reducing misclassification errors in document categorization.
  • Chores

    • Updated internal state management records.

Walkthrough

The PR updates document parsing logic to better identify secondary-agreement carrier nodes in scope classification, ensuring post-signature content is not incorrectly labeled as trailer. It also records this parsing run milestone by freezing index 14 with a new state history entry.

Changes

Scope Classification and State Tracking

Layer / File(s) Summary
Scope classification for secondary-agreement carriers
scripts/parse_doc2dict_with_config.py
The _apply_scope_rule function adds structural detection for secondary-agreement carriers: nodes with zero subdoc_penalty whose direct children have penalties >= 1 (and not marked is_envelope) are now classified as real subdocuments, preventing post-signature content from being misclassified as trailer.
Freeze index 14 state tracking
data/auto_parse/level_freeze/state.json
The frozen indices list is expanded to 13, 14 and a new history record captures the freeze action for index 14 at timestamp 2026-05-17T09:31:29 with 381 records processed.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • arthrod/clause-extract#5: Extends the same freeze progression by advancing to idx=15 while also including the 14 index in the frozen set, building on this run's state checkpoint.

Suggested labels

Feat2


🐰 A logic leap for subdocs fair,
Now secondary carriers declare,
Scope dreams of real,
With penalty's seal—
Index fourteen's frozen with care!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately reflects the main change: adding idx=14 (Aralez Designation of Agent Agreement) as a frozen baseline, with specific details about the document type and record count.
Description check ✅ Passed The description is comprehensive and directly related to the changeset, detailing the parser modification, the recovered secondary-agreement carriers, and the verified output metrics for idx=14.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch

Comment @coderabbitai help to get the list of available commands and usage tips.

@codeant-ai codeant-ai Bot added the size:L This PR changes 100-499 lines, ignoring generated files label May 17, 2026
@coderabbitai coderabbitai Bot added the Feat2 label May 17, 2026
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates state tracking data and introduces logic to identify secondary-agreement carriers within the document parsing script. The review feedback highlights that the new logic is overly broad and contradicts documented scope rules by potentially protecting bare identifiers from being marked as out-of-scope. A code suggestion was provided to refine the identification criteria by excluding nodes that already belong to subdocument classes.

Comment on lines +603 to +608
if (
not r.get("is_envelope")
and (r.get("subdoc_penalty", 0) or 0) == 0
and children
and any((c.get("subdoc_penalty", 0) or 0) > 0 for c in children)
):
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current logic for identifying secondary-agreement carriers is too broad and contradicts the documented scope rule in task_rules/scope_rule.md.

According to the scope rule, bare identifiers (e.g., "EXHIBIT A" with no descriptive text) are NOT real subdocuments and should be excluded from the JSONL if they appear after the signature block. However, because walk_sections increments the subdoc_penalty for any node with a subdoc class (line 954), a bare exhibit will satisfy the condition (penalty == 0 and children penalty > 0). This causes bare exhibits to be added to real_subdoc_ids, protecting them and their descendants from being marked as trailer scope.

To align with the intent of recovering only secondary agreements (as noted in your comment on lines 590-602), you should ensure the node does not belong to a subdoc class.

Suggested change
if (
not r.get("is_envelope")
and (r.get("subdoc_penalty", 0) or 0) == 0
and children
and any((c.get("subdoc_penalty", 0) or 0) > 0 for c in children)
):
if (
not r.get("is_envelope")
and r.get("cls") not in _SUBDOC_CLASSES
and (r.get("subdoc_penalty", 0) or 0) == 0
and children
and any((c.get("subdoc_penalty", 0) or 0) > 0 for c in children)
):
References
  1. A section is out of scope if it appears after the signature block and is not a descendant of a real subdocument. Real subdocuments must have descriptive text beyond a bare identifier.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@scripts/parse_doc2dict_with_config.py`:
- Around line 589-609: Replace the "else:" followed by "if" block with a single
"elif" condition to reduce indentation and improve readability while preserving
behavior: change the branch that checks r (the record), subdoc_penalty,
children, and the any(...) child check and then adds r["node_id"] to
real_subdoc_ids so the logic around the is_envelope check,
(r.get("subdoc_penalty", 0) or 0) == 0, children truthiness, and
any((c.get("subdoc_penalty", 0) or 0) > 0 for c in children) remains identical;
update the control flow where the original else and nested if occur so only an
elif with the same combined condition is used.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: cd20c1d9-5537-4817-9e06-113a6ad0e9e6

📥 Commits

Reviewing files that changed from the base of the PR and between 1ac070d and 6313c14.

📒 Files selected for processing (3)
  • data/auto_parse/level_freeze/frozen/idx_14.jsonl
  • data/auto_parse/level_freeze/state.json
  • scripts/parse_doc2dict_with_config.py
📜 Review details
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (Custom checks)

**/*.py: Run smoke test for Python web servers: start server, wait for ready signal, hit GET /health or GET /, assert HTTP 200. Timeout: 30s. Fail if smoke test fails.
Run smoke test for Python CLI: run <cli> --help, assert exit code 0. Fail if smoke test fails.
Run uv run pytest --cov=<src_package> --cov-report=term-missing --cov-branch --cov-fail-under=80 -q for Python coverage. Fail if exit code is non-zero, branch coverage < 70%, or line coverage < 80%.
Run uv run ruff check . --diff for Python linting. Fail if exit code is non-zero and list each violation.
Run uv run ruff format --check --diff . for Python formatting. Fail if exit code is non-zero and list each unformatted file.
Scan diffs for newly added Python suppression comments (# noqa, # type: ignore). Each suppression must have an inline justification comment. Fail if any new suppression lacks justification. Warn if total new suppressions > 3 in a single PR.
Run uv run ruff check --select I,F401 . to verify Python import ordering and detect unused imports. Fail if violations found.
Run the full Python test suite: uv run pytest --tb=line -q on origin/main to capture baseline pass/fail counts, and uv run pytest --tb=short -q on PR branch. Fail immediately if exit code is non-zero.
Run uv run typy check for Python type checking if .py files exist in diff or project has py.typed marker. Fail if exit code is non-zero. If typy is not available, use configured mypy or pyright instead. Fail with 'No Python type checker configured' if none is found.
Scan diffs for new bare type: ignore comments (without error codes) in Python files and cast() calls without explanatory comments. Warn for each. Fail if bare type: ignore count > 3.

Files:

  • scripts/parse_doc2dict_with_config.py
**/*.{py,ts,tsx}

📄 CodeRabbit inference engine (Custom checks)

For each changed production file, verify at least one corresponding test file exists or already exists in the repo with assertions covering changed symbols. Fail if a changed production file has zero associated test file and introduces new exported functions, classes, or public methods. Exempt: config files, type declaration files (.d.ts), migrations, and generated code.

Files:

  • scripts/parse_doc2dict_with_config.py
🪛 Ruff (0.15.12)
scripts/parse_doc2dict_with_config.py

[warning] 589-603: Use elif instead of else then if, to reduce indentation

Convert to elif

(PLR5501)

🔇 Additional comments (1)
data/auto_parse/level_freeze/state.json (1)

17-18: LGTM!

Also applies to: 223-229

Comment on lines +589 to +609
else:
# Secondary-agreement carrier: a node whose own subdoc_penalty
# is 0 but whose direct children carry subdoc_penalty>=1.
# The depth-walker sets this when a non-subdoc-class section has
# a title matching the AGREEMENT|PLAN structural-level-0 pattern
# and the primary L0 has already been emitted (see walk_sections'
# is_secondary_agreement branch). Structurally this IS an
# attached subdocument (it has its own subtree with its own
# depth penalty) even though doc2dict didn't tag its cls as
# exhibit/schedule/appendix/annex. Treat it as a real subdoc
# for scope purposes so the post-sig walk-up doesn't strand it
# in trailer (and bring genuine attached subdoc content with
# it). The check is purely structural: subdoc_penalty arithmetic
# comes from the tree walk, not phrase matching.
if (
not r.get("is_envelope")
and (r.get("subdoc_penalty", 0) or 0) == 0
and children
and any((c.get("subdoc_penalty", 0) or 0) > 0 for c in children)
):
real_subdoc_ids.add(r["node_id"])
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial | 💤 Low value

Consider using elif to reduce indentation.

The static analysis tool correctly identifies that the else: if pattern can be simplified to elif. This improves readability without changing behavior.

♻️ Suggested refactor
-        else:
-            # Secondary-agreement carrier: a node whose own subdoc_penalty
-            # is 0 but whose direct children carry subdoc_penalty>=1.
-            # The depth-walker sets this when a non-subdoc-class section has
-            # a title matching the AGREEMENT|PLAN structural-level-0 pattern
-            # and the primary L0 has already been emitted (see walk_sections'
-            # is_secondary_agreement branch). Structurally this IS an
-            # attached subdocument (it has its own subtree with its own
-            # depth penalty) even though doc2dict didn't tag its cls as
-            # exhibit/schedule/appendix/annex. Treat it as a real subdoc
-            # for scope purposes so the post-sig walk-up doesn't strand it
-            # in trailer (and bring genuine attached subdoc content with
-            # it). The check is purely structural: subdoc_penalty arithmetic
-            # comes from the tree walk, not phrase matching.
-            if (
-                not r.get("is_envelope")
-                and (r.get("subdoc_penalty", 0) or 0) == 0
-                and children
-                and any((c.get("subdoc_penalty", 0) or 0) > 0 for c in children)
-            ):
-                real_subdoc_ids.add(r["node_id"])
+        # Secondary-agreement carrier: a node whose own subdoc_penalty
+        # is 0 but whose direct children carry subdoc_penalty>=1.
+        # The depth-walker sets this when a non-subdoc-class section has
+        # a title matching the AGREEMENT|PLAN structural-level-0 pattern
+        # and the primary L0 has already been emitted (see walk_sections'
+        # is_secondary_agreement branch). Structurally this IS an
+        # attached subdocument (it has its own subtree with its own
+        # depth penalty) even though doc2dict didn't tag its cls as
+        # exhibit/schedule/appendix/annex. Treat it as a real subdoc
+        # for scope purposes so the post-sig walk-up doesn't strand it
+        # in trailer (and bring genuine attached subdoc content with
+        # it). The check is purely structural: subdoc_penalty arithmetic
+        # comes from the tree walk, not phrase matching.
+        elif (
+            not r.get("is_envelope")
+            and (r.get("subdoc_penalty", 0) or 0) == 0
+            and children
+            and any((c.get("subdoc_penalty", 0) or 0) > 0 for c in children)
+        ):
+            real_subdoc_ids.add(r["node_id"])
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
else:
# Secondary-agreement carrier: a node whose own subdoc_penalty
# is 0 but whose direct children carry subdoc_penalty>=1.
# The depth-walker sets this when a non-subdoc-class section has
# a title matching the AGREEMENT|PLAN structural-level-0 pattern
# and the primary L0 has already been emitted (see walk_sections'
# is_secondary_agreement branch). Structurally this IS an
# attached subdocument (it has its own subtree with its own
# depth penalty) even though doc2dict didn't tag its cls as
# exhibit/schedule/appendix/annex. Treat it as a real subdoc
# for scope purposes so the post-sig walk-up doesn't strand it
# in trailer (and bring genuine attached subdoc content with
# it). The check is purely structural: subdoc_penalty arithmetic
# comes from the tree walk, not phrase matching.
if (
not r.get("is_envelope")
and (r.get("subdoc_penalty", 0) or 0) == 0
and children
and any((c.get("subdoc_penalty", 0) or 0) > 0 for c in children)
):
real_subdoc_ids.add(r["node_id"])
# Secondary-agreement carrier: a node whose own subdoc_penalty
# is 0 but whose direct children carry subdoc_penalty>=1.
# The depth-walker sets this when a non-subdoc-class section has
# a title matching the AGREEMENT|PLAN structural-level-0 pattern
# and the primary L0 has already been emitted (see walk_sections'
# is_secondary_agreement branch). Structurally this IS an
# attached subdocument (it has its own subtree with its own
# depth penalty) even though doc2dict didn't tag its cls as
# exhibit/schedule/appendix/annex. Treat it as a real subdoc
# for scope purposes so the post-sig walk-up doesn't strand it
# in trailer (and bring genuine attached subdoc content with
# it). The check is purely structural: subdoc_penalty arithmetic
# comes from the tree walk, not phrase matching.
elif (
not r.get("is_envelope")
and (r.get("subdoc_penalty", 0) or 0) == 0
and children
and any((c.get("subdoc_penalty", 0) or 0) > 0 for c in children)
):
real_subdoc_ids.add(r["node_id"])
🧰 Tools
🪛 Ruff (0.15.12)

[warning] 589-603: Use elif instead of else then if, to reduce indentation

Convert to elif

(PLR5501)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/parse_doc2dict_with_config.py` around lines 589 - 609, Replace the
"else:" followed by "if" block with a single "elif" condition to reduce
indentation and improve readability while preserving behavior: change the branch
that checks r (the record), subdoc_penalty, children, and the any(...) child
check and then adds r["node_id"] to real_subdoc_ids so the logic around the
is_envelope check, (r.get("subdoc_penalty", 0) or 0) == 0, children truthiness,
and any((c.get("subdoc_penalty", 0) or 0) > 0 for c in children) remains
identical; update the control flow where the original else and nested if occur
so only an elif with the same combined condition is used.

{"idx": 14, "order": 46, "level": 2, "span": "(TAILORED). Except as otherwise provided by an express or implied warranty, the Contractor will not be liable in a breach of warranty action to the Government for consequential damages resulting from any defect or deficiencies in accepted items.\n\t\t"}
{"idx": 14, "order": 47, "level": 2, "span": "C.4  52.216-21 REQUIREMENTS (OCT 1995) (MAY 5, 2011 DEVIATION)"}
{"idx": 14, "order": 48, "level": 2, "span": "C.5  52.217-9 OPTION TO EXTEND THE TERM OF THE CONTRACT (MAR 2000)"}
{"idx": 14, "order": 49, "level": 2, "span": "http://www.acquisition.gov/far/index.htmI"}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: The FAR reference URL has an OCR typo (index.htmI with uppercase I instead of l), which makes the link invalid for any downstream URL parsing or link-checking logic. Correct it to the canonical FAR URL so consumers can resolve it reliably. [logic error]

Severity Level: Major ⚠️
- ⚠️ FAR reference URL for idx=14 fails link validation.
- ⚠️ Downstream link-checking reports false negatives for FAR.
- ⚠️ Any FAR cross-link index omits this malformed entry.
Steps of Reproduction ✅
1. Load the frozen baseline record for idx=14 from
`data/auto_parse/level_freeze/frozen/idx_14.jsonl` and navigate to entry with `"order":
49`, where the span is `http://www.acquisition.gov/far/index.htmI`.

2. Run any downstream pipeline component that extracts URLs from `span` text (the same
extractor used for other frozen idx_* JSONL files to build outbound-link indexes or to
drive link-checking).

3. Observe that the extractor emits `http://www.acquisition.gov/far/index.htmI` as a URL;
because the path ends with `index.htmI` (uppercase `I` instead of lowercase `l`), an HTTP
client or link checker receives an HTTP 404 / DNS error rather than the intended FAR index
page.

4. Compare behavior with another frozen idx JSONL record that correctly uses
`http://www.acquisition.gov/far/index.html` and verify that consumers successfully resolve
and validate that canonical FAR URL, demonstrating that the OCR-typo variant breaks link
resolution for idx=14.

Fix in Cursor | Fix in VSCode Claude

(Use Cmd/Ctrl + Click for best experience)

Prompt for AI Agent 🤖
This is a comment left during a code review.

**Path:** data/auto_parse/level_freeze/frozen/idx_14.jsonl
**Line:** 50:50
**Comment:**
	*Logic Error: The FAR reference URL has an OCR typo (`index.htmI` with uppercase `I` instead of `l`), which makes the link invalid for any downstream URL parsing or link-checking logic. Correct it to the canonical FAR URL so consumers can resolve it reliably.

Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.
Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fix
👍 | 👎

{"idx": 14, "order": 97, "level": 1, "span": "(Refer to Schedule of Supplies for package size details and estimates)\nOffered prices shall include a [***] Cost Recovery Fee (See Scope of Contract, paragraph 12). The Government reserves the right not to award a contract on this solicitation should offered prices match or exceed current Federal Supply Schedule prices. Offers for pharmaceuticals sourced from countries not covered by the Trade Agreement Act (TAA) may be given consideration pursuant to Federal Acquisition Regulation (FAR) Part 25. Acknowledgement of Amendments. The following amendments are acknowledged as part of this solicitation. (Please complete if applicable)\nDate Acknowledged by Offeror\nAmendment Number\nThe System for Award Management (SAM) is an online system that replaces CCR/Fed Reg, ORCA, and EPLS.  Contractors should now go to www.sam.gov to find their information. Training tools are available on the SAM website at www.sam.gov for familiarization with the SAM system .  Prospective contractors shall maintain a current and accurate record in the SAM database. SAM updates are required, as necessary, but at least annually. (see 52.212-4(t) and 52.212-l(k)). Subcontracting Plan Requirements: Pursuant to the requirements of Public Law 95-507, all large business concerns are required to have an approved subcontracting plan for contracts valued over $700,000 before the Government can award a contract (see FAR 52.219-9 for details). Offerers must submit a currently approved commercial plan or a new plan for review and approval. Attachment \"D\" includes all of the elements required to be addressed and is included to facilitate the submission of a subcontracting plan. As prescribed in FAR Part 42.15, the VA evaluates contractor performance on all contracts that exceed $150,000, and shares those evaluations with other federal government agencies. The FAR requires that the contractor be provided an opportunity to comment on past performance evaluations prior to each report closing. To fulfill this requirement, VA will be using an online database, the Contractor Performance Assessment Reporting System (CPARS). Annual reporting of past performance will be completed at http://www.cpars.gov and uploaded to PPIRS (Past Performance Information Retrieval System)."}
{"idx": 14, "order": 98, "level": 1, "span": "1.  INTRODUCTION"}
{"idx": 14, "order": 99, "level": 1, "span": "2.   EXTENT OF OBLIGATION"}
{"idx": 14, "order": 100, "level": 4, "span": "Government Participants. The contractor shall provide the products specified in the schedule at the prices awarded herein for the facilities/agencies below:\n\t\t\nAll Department of Veterans Affairs (VA) facilities\nAll Ordering Activities under the Department of Defense (DOD) Pharmaceutical Prime Vendor Program\nIndian Health Service (IHS) facilities\nAll Bureau of Prisons (BOP) facilities\nFederal Health Care Center (FHCC)\nAll Option 2 State Veteran Homes (See paragraph 2.2 State Veteran Homes)\nA database of all facilities authorized to use the VA PPV Program may be downloaded from the National Acquisition Center's web site at http://www.va.gov/oal/business/nc/ppv.asp. The database identifies each state veteran home as option 1 or 2. A database for all facilities authorized to use the DOD PPV Program may be downloaded from the DOD's website at https://www.medicaI.dla.mil/Portal/PrimeVendor/PvPharm/PharmPvOverview.aspx."}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: The DLA portal URL contains a hostname typo (medicaI with uppercase I), which creates an invalid domain and breaks hyperlink resolution. Normalize this to the correct host (medical.dla.mil) to prevent broken-link behavior in downstream consumers. [logic error]

Severity Level: Major ⚠️
- ⚠️ DOD PPV portal link for idx=14 cannot resolve.
- ⚠️ Automated link harvesters store an invalid DLA hostname.
- ⚠️ Any UI rendering hyperlinks shows a dead PPV overview link.
Steps of Reproduction ✅
1. Load the idx=14 frozen JSONL file at `data/auto_parse/level_freeze/frozen/idx_14.jsonl`
and locate the `"order": 100` record where the span text includes
`https://www.medicaI.dla.mil/Portal/PrimeVendor/PvPharm/PharmPvOverview.aspx`.

2. Run the same URL-extraction or hyperlink-enrichment stage that processes other contract
spans, which will emit this DOD PPV Program URL exactly as written in the span.

3. Feed the extracted URL into any HTTP client or automated link checker used by your
tooling; because the hostname contains `medicaI` (uppercase `I` instead of `l`), DNS
lookup or TLS handshake fails and the request cannot reach the intended `medical.dla.mil`
host.

4. Compare against another idx JSONL or external reference where the correct
`https://www.medical.dla.mil/Portal/PrimeVendor/PvPharm/PharmPvOverview.aspx` URL is used,
confirming that the typo uniquely breaks link resolution for this idx=14 baseline.

Fix in Cursor | Fix in VSCode Claude

(Use Cmd/Ctrl + Click for best experience)

Prompt for AI Agent 🤖
This is a comment left during a code review.

**Path:** data/auto_parse/level_freeze/frozen/idx_14.jsonl
**Line:** 101:101
**Comment:**
	*Logic Error: The DLA portal URL contains a hostname typo (`medicaI` with uppercase `I`), which creates an invalid domain and breaks hyperlink resolution. Normalize this to the correct host (`medical.dla.mil`) to prevent broken-link behavior in downstream consumers.

Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.
Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fix
👍 | 👎

{"idx": 14, "order": 201, "level": 2, "span": "(b) Contractors shall ensure that the CAGE code is maintained throughout the life of the contract\nFor contractors registered in the System for Award Management (SAM), the DLA Contractor and Government Entity (CAGE) Branch shall only modify data received from SAM in the CAGE master file if the contractor initiates those changes via update of its SAM registration. Contractors undergoing a novation or change-of-name agreement shall notify the contracting officer in accordance with subpart 42.12. The contractor shall communicate any change to the CAGE code to the contracting officer within 30 days after the change, so that a modification can be issued to update the CAGE code on the contract."}
{"idx": 14, "order": 202, "level": 2, "span": "(c) Contractors located in the United States or its outlying areas that are not registered in SAM shall submit written change requests to the DLA Contractor and Government Entity (CAGE) Branch\nRequests for changes shall be provided on a DD Form 2051, Request for Assignment of a Commercial and Government Entity (CAGE) Code, to the address shown on the back of the DD Form 2051. Change requests to the CAGE master file are accepted from the entity identified by the code."}
{"idx": 14, "order": 203, "level": 2, "span": "(d) Contractors located outside the United States and its outlying areas that are not registered in SAM shall contact the appropriate National Codification Bureau or NSPA to request CAGE changes\nPoints of contact for National Codification Bureaus and NSPA, as well as additional information on obtaining NCAGE codes, are available at http://www.dlis.dla.mil/nato/ObtainCAGE.asp."}
{"idx": 14, "order": 204, "level": 2, "span": "(e) Additional guidance for maintaining CAGE codes is available at http://www.dlis.dla.mil/cage welcome.asp."}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: This URL includes a literal space (cage welcome.asp) and trailing punctuation, which makes it non-parseable as a valid URL in strict parsers. Replace it with a valid encoded or canonical URL to avoid failures in URL extraction/validation steps. [logic error]

Severity Level: Major ⚠️
- ⚠️ CAGE guidance URL cannot be parsed by strict tools.
- ⚠️ Link-checking and harvesting miss this DLA reference.
- ⚠️ Any rendered hyperlink may truncate before `welcome.asp`.
Steps of Reproduction ✅
1. Open `data/auto_parse/level_freeze/frozen/idx_14.jsonl` and find the record with
`"order": 204`, where the span contains `http://www.dlis.dla.mil/cage welcome.asp`.

2. Run your standard URL extraction logic over this span; typical regex- or RFC-compliant
parsers will either stop at `http://www.dlis.dla.mil/cage` or reject the token entirely
because of the embedded space before `welcome.asp`.

3. Observe that any downstream link checker, documentation generator, or hyperlinking UI
either produces a truncated URL (`.../cage`) or omits this CAGE guidance link, since `cage
welcome.asp` is not a valid path fragment without encoding.

4. Compare with another reference (for example, from the same clause in a different idx
frozen file) that uses a canonical or percent-encoded URL (e.g.,
`http://www.dlis.dla.mil/cage_welcome.asp`), and confirm that only the idx=14 baseline
fails strict URL parsing due to the embedded whitespace.

Fix in Cursor | Fix in VSCode Claude

(Use Cmd/Ctrl + Click for best experience)

Prompt for AI Agent 🤖
This is a comment left during a code review.

**Path:** data/auto_parse/level_freeze/frozen/idx_14.jsonl
**Line:** 205:205
**Comment:**
	*Logic Error: This URL includes a literal space (`cage welcome.asp`) and trailing punctuation, which makes it non-parseable as a valid URL in strict parsers. Replace it with a valid encoded or canonical URL to avoid failures in URL extraction/validation steps.

Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.
Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fix
👍 | 👎

@codeant-ai
Copy link
Copy Markdown

codeant-ai Bot commented May 17, 2026

CodeAnt AI finished reviewing your PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Feat2 size:L This PR changes 100-499 lines, ignoring generated files

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant