idx=14: freeze (381 records) — Aralez Designation of Agent Agreement (tri-party gov contract with embedded SF-1449)#87
idx=14: freeze (381 records) — Aralez Designation of Agent Agreement (tri-party gov contract with embedded SF-1449)#87arthrod wants to merge 1 commit into
Conversation
…Agent Agreement: keep secondary-agreement (AGREEMENT|PLAN) carriers in agreement scope so attached SUBCONTRACTING PLAN subtree doesn't strand in trailer
The Aralez Designation of Agent Agreement is a government contract that
references an embedded VA National Contract solicitation. After the main
agreement title + 7 numbered sections + signatures, the source contains
the full VA contract sections (B/C/D/E/F) including an attached
SMALL BUSINESS SUBCONTRACTING PLAN with its own nested 1–11 outline.
The depth-walker already treats "SMALL BUSINESS SUBCONTRACTING PLAN" as
a secondary L0 root (matches the AGREEMENT|PLAN structural pattern) and
gives its descendants subdoc_penalty=1. But the scope rule's
_is_real_subdoc_title only accepted cls in {exhibit, schedule, appendix,
annex}, so this carrier (cls=predicted header) failed the subdoc test
and the post-sig walk-up marked it as trailer — dropping ~50 records
of legitimate attached subdoc content and pulling reconstruction below
the 90% bar (85.9%).
Fix: extend real_subdoc_ids inside _apply_scope_rule to also include
structural secondary-agreement carriers — nodes whose own subdoc_penalty
is 0 but whose direct children carry subdoc_penalty>=1. The detection is
purely structural (subdoc_penalty arithmetic comes from walk_sections'
is_secondary_agreement branch, not phrase matching).
Cross-idx audit: only idx=14 has a secondary carrier currently in
trailer scope. idxs 1, 2, 3, 5, 13 also have secondary carriers but
they are all already in agreement scope — no behavior change for them.
Result: idx=14 reconstruction 85.9% → 92.4%, freeze passes the 90% bar,
all 15 idxs (0..14) regress OK.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Mention Blocks like a regular teammate with your question or request: @blocks review this pull request Run |
Qodo reviews are paused for this user.Troubleshooting steps vary by plan Learn more → On a Teams plan? Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center? |
|
CodeAnt AI is reviewing your PR. |
📝 WalkthroughSummary by CodeRabbitRelease Notes
WalkthroughThe PR updates document parsing logic to better identify secondary-agreement carrier nodes in scope classification, ensuring post-signature content is not incorrectly labeled as trailer. It also records this parsing run milestone by freezing index 14 with a new state history entry. ChangesScope Classification and State Tracking
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested labels
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
Comment |
There was a problem hiding this comment.
Code Review
This pull request updates state tracking data and introduces logic to identify secondary-agreement carriers within the document parsing script. The review feedback highlights that the new logic is overly broad and contradicts documented scope rules by potentially protecting bare identifiers from being marked as out-of-scope. A code suggestion was provided to refine the identification criteria by excluding nodes that already belong to subdocument classes.
| if ( | ||
| not r.get("is_envelope") | ||
| and (r.get("subdoc_penalty", 0) or 0) == 0 | ||
| and children | ||
| and any((c.get("subdoc_penalty", 0) or 0) > 0 for c in children) | ||
| ): |
There was a problem hiding this comment.
The current logic for identifying secondary-agreement carriers is too broad and contradicts the documented scope rule in task_rules/scope_rule.md.
According to the scope rule, bare identifiers (e.g., "EXHIBIT A" with no descriptive text) are NOT real subdocuments and should be excluded from the JSONL if they appear after the signature block. However, because walk_sections increments the subdoc_penalty for any node with a subdoc class (line 954), a bare exhibit will satisfy the condition (penalty == 0 and children penalty > 0). This causes bare exhibits to be added to real_subdoc_ids, protecting them and their descendants from being marked as trailer scope.
To align with the intent of recovering only secondary agreements (as noted in your comment on lines 590-602), you should ensure the node does not belong to a subdoc class.
| if ( | |
| not r.get("is_envelope") | |
| and (r.get("subdoc_penalty", 0) or 0) == 0 | |
| and children | |
| and any((c.get("subdoc_penalty", 0) or 0) > 0 for c in children) | |
| ): | |
| if ( | |
| not r.get("is_envelope") | |
| and r.get("cls") not in _SUBDOC_CLASSES | |
| and (r.get("subdoc_penalty", 0) or 0) == 0 | |
| and children | |
| and any((c.get("subdoc_penalty", 0) or 0) > 0 for c in children) | |
| ): |
References
- A section is out of scope if it appears after the signature block and is not a descendant of a real subdocument. Real subdocuments must have descriptive text beyond a bare identifier.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@scripts/parse_doc2dict_with_config.py`:
- Around line 589-609: Replace the "else:" followed by "if" block with a single
"elif" condition to reduce indentation and improve readability while preserving
behavior: change the branch that checks r (the record), subdoc_penalty,
children, and the any(...) child check and then adds r["node_id"] to
real_subdoc_ids so the logic around the is_envelope check,
(r.get("subdoc_penalty", 0) or 0) == 0, children truthiness, and
any((c.get("subdoc_penalty", 0) or 0) > 0 for c in children) remains identical;
update the control flow where the original else and nested if occur so only an
elif with the same combined condition is used.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: cd20c1d9-5537-4817-9e06-113a6ad0e9e6
📒 Files selected for processing (3)
data/auto_parse/level_freeze/frozen/idx_14.jsonldata/auto_parse/level_freeze/state.jsonscripts/parse_doc2dict_with_config.py
📜 Review details
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (Custom checks)
**/*.py: Run smoke test for Python web servers: start server, wait for ready signal, hit GET /health or GET /, assert HTTP 200. Timeout: 30s. Fail if smoke test fails.
Run smoke test for Python CLI: run<cli> --help, assert exit code 0. Fail if smoke test fails.
Runuv run pytest --cov=<src_package> --cov-report=term-missing --cov-branch --cov-fail-under=80 -qfor Python coverage. Fail if exit code is non-zero, branch coverage < 70%, or line coverage < 80%.
Runuv run ruff check . --difffor Python linting. Fail if exit code is non-zero and list each violation.
Runuv run ruff format --check --diff .for Python formatting. Fail if exit code is non-zero and list each unformatted file.
Scan diffs for newly added Python suppression comments (# noqa, # type: ignore). Each suppression must have an inline justification comment. Fail if any new suppression lacks justification. Warn if total new suppressions > 3 in a single PR.
Runuv run ruff check --select I,F401 .to verify Python import ordering and detect unused imports. Fail if violations found.
Run the full Python test suite:uv run pytest --tb=line -qon origin/main to capture baseline pass/fail counts, anduv run pytest --tb=short -qon PR branch. Fail immediately if exit code is non-zero.
Runuv run typy checkfor Python type checking if .py files exist in diff or project has py.typed marker. Fail if exit code is non-zero. If typy is not available, use configured mypy or pyright instead. Fail with 'No Python type checker configured' if none is found.
Scan diffs for new baretype: ignorecomments (without error codes) in Python files andcast()calls without explanatory comments. Warn for each. Fail if baretype: ignorecount > 3.
Files:
scripts/parse_doc2dict_with_config.py
**/*.{py,ts,tsx}
📄 CodeRabbit inference engine (Custom checks)
For each changed production file, verify at least one corresponding test file exists or already exists in the repo with assertions covering changed symbols. Fail if a changed production file has zero associated test file and introduces new exported functions, classes, or public methods. Exempt: config files, type declaration files (.d.ts), migrations, and generated code.
Files:
scripts/parse_doc2dict_with_config.py
🪛 Ruff (0.15.12)
scripts/parse_doc2dict_with_config.py
[warning] 589-603: Use elif instead of else then if, to reduce indentation
Convert to elif
(PLR5501)
🔇 Additional comments (1)
data/auto_parse/level_freeze/state.json (1)
17-18: LGTM!Also applies to: 223-229
| else: | ||
| # Secondary-agreement carrier: a node whose own subdoc_penalty | ||
| # is 0 but whose direct children carry subdoc_penalty>=1. | ||
| # The depth-walker sets this when a non-subdoc-class section has | ||
| # a title matching the AGREEMENT|PLAN structural-level-0 pattern | ||
| # and the primary L0 has already been emitted (see walk_sections' | ||
| # is_secondary_agreement branch). Structurally this IS an | ||
| # attached subdocument (it has its own subtree with its own | ||
| # depth penalty) even though doc2dict didn't tag its cls as | ||
| # exhibit/schedule/appendix/annex. Treat it as a real subdoc | ||
| # for scope purposes so the post-sig walk-up doesn't strand it | ||
| # in trailer (and bring genuine attached subdoc content with | ||
| # it). The check is purely structural: subdoc_penalty arithmetic | ||
| # comes from the tree walk, not phrase matching. | ||
| if ( | ||
| not r.get("is_envelope") | ||
| and (r.get("subdoc_penalty", 0) or 0) == 0 | ||
| and children | ||
| and any((c.get("subdoc_penalty", 0) or 0) > 0 for c in children) | ||
| ): | ||
| real_subdoc_ids.add(r["node_id"]) |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial | 💤 Low value
Consider using elif to reduce indentation.
The static analysis tool correctly identifies that the else: if pattern can be simplified to elif. This improves readability without changing behavior.
♻️ Suggested refactor
- else:
- # Secondary-agreement carrier: a node whose own subdoc_penalty
- # is 0 but whose direct children carry subdoc_penalty>=1.
- # The depth-walker sets this when a non-subdoc-class section has
- # a title matching the AGREEMENT|PLAN structural-level-0 pattern
- # and the primary L0 has already been emitted (see walk_sections'
- # is_secondary_agreement branch). Structurally this IS an
- # attached subdocument (it has its own subtree with its own
- # depth penalty) even though doc2dict didn't tag its cls as
- # exhibit/schedule/appendix/annex. Treat it as a real subdoc
- # for scope purposes so the post-sig walk-up doesn't strand it
- # in trailer (and bring genuine attached subdoc content with
- # it). The check is purely structural: subdoc_penalty arithmetic
- # comes from the tree walk, not phrase matching.
- if (
- not r.get("is_envelope")
- and (r.get("subdoc_penalty", 0) or 0) == 0
- and children
- and any((c.get("subdoc_penalty", 0) or 0) > 0 for c in children)
- ):
- real_subdoc_ids.add(r["node_id"])
+ # Secondary-agreement carrier: a node whose own subdoc_penalty
+ # is 0 but whose direct children carry subdoc_penalty>=1.
+ # The depth-walker sets this when a non-subdoc-class section has
+ # a title matching the AGREEMENT|PLAN structural-level-0 pattern
+ # and the primary L0 has already been emitted (see walk_sections'
+ # is_secondary_agreement branch). Structurally this IS an
+ # attached subdocument (it has its own subtree with its own
+ # depth penalty) even though doc2dict didn't tag its cls as
+ # exhibit/schedule/appendix/annex. Treat it as a real subdoc
+ # for scope purposes so the post-sig walk-up doesn't strand it
+ # in trailer (and bring genuine attached subdoc content with
+ # it). The check is purely structural: subdoc_penalty arithmetic
+ # comes from the tree walk, not phrase matching.
+ elif (
+ not r.get("is_envelope")
+ and (r.get("subdoc_penalty", 0) or 0) == 0
+ and children
+ and any((c.get("subdoc_penalty", 0) or 0) > 0 for c in children)
+ ):
+ real_subdoc_ids.add(r["node_id"])📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| else: | |
| # Secondary-agreement carrier: a node whose own subdoc_penalty | |
| # is 0 but whose direct children carry subdoc_penalty>=1. | |
| # The depth-walker sets this when a non-subdoc-class section has | |
| # a title matching the AGREEMENT|PLAN structural-level-0 pattern | |
| # and the primary L0 has already been emitted (see walk_sections' | |
| # is_secondary_agreement branch). Structurally this IS an | |
| # attached subdocument (it has its own subtree with its own | |
| # depth penalty) even though doc2dict didn't tag its cls as | |
| # exhibit/schedule/appendix/annex. Treat it as a real subdoc | |
| # for scope purposes so the post-sig walk-up doesn't strand it | |
| # in trailer (and bring genuine attached subdoc content with | |
| # it). The check is purely structural: subdoc_penalty arithmetic | |
| # comes from the tree walk, not phrase matching. | |
| if ( | |
| not r.get("is_envelope") | |
| and (r.get("subdoc_penalty", 0) or 0) == 0 | |
| and children | |
| and any((c.get("subdoc_penalty", 0) or 0) > 0 for c in children) | |
| ): | |
| real_subdoc_ids.add(r["node_id"]) | |
| # Secondary-agreement carrier: a node whose own subdoc_penalty | |
| # is 0 but whose direct children carry subdoc_penalty>=1. | |
| # The depth-walker sets this when a non-subdoc-class section has | |
| # a title matching the AGREEMENT|PLAN structural-level-0 pattern | |
| # and the primary L0 has already been emitted (see walk_sections' | |
| # is_secondary_agreement branch). Structurally this IS an | |
| # attached subdocument (it has its own subtree with its own | |
| # depth penalty) even though doc2dict didn't tag its cls as | |
| # exhibit/schedule/appendix/annex. Treat it as a real subdoc | |
| # for scope purposes so the post-sig walk-up doesn't strand it | |
| # in trailer (and bring genuine attached subdoc content with | |
| # it). The check is purely structural: subdoc_penalty arithmetic | |
| # comes from the tree walk, not phrase matching. | |
| elif ( | |
| not r.get("is_envelope") | |
| and (r.get("subdoc_penalty", 0) or 0) == 0 | |
| and children | |
| and any((c.get("subdoc_penalty", 0) or 0) > 0 for c in children) | |
| ): | |
| real_subdoc_ids.add(r["node_id"]) |
🧰 Tools
🪛 Ruff (0.15.12)
[warning] 589-603: Use elif instead of else then if, to reduce indentation
Convert to elif
(PLR5501)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@scripts/parse_doc2dict_with_config.py` around lines 589 - 609, Replace the
"else:" followed by "if" block with a single "elif" condition to reduce
indentation and improve readability while preserving behavior: change the branch
that checks r (the record), subdoc_penalty, children, and the any(...) child
check and then adds r["node_id"] to real_subdoc_ids so the logic around the
is_envelope check, (r.get("subdoc_penalty", 0) or 0) == 0, children truthiness,
and any((c.get("subdoc_penalty", 0) or 0) > 0 for c in children) remains
identical; update the control flow where the original else and nested if occur
so only an elif with the same combined condition is used.
| {"idx": 14, "order": 46, "level": 2, "span": "(TAILORED). Except as otherwise provided by an express or implied warranty, the Contractor will not be liable in a breach of warranty action to the Government for consequential damages resulting from any defect or deficiencies in accepted items.\n\t\t"} | ||
| {"idx": 14, "order": 47, "level": 2, "span": "C.4 52.216-21 REQUIREMENTS (OCT 1995) (MAY 5, 2011 DEVIATION)"} | ||
| {"idx": 14, "order": 48, "level": 2, "span": "C.5 52.217-9 OPTION TO EXTEND THE TERM OF THE CONTRACT (MAR 2000)"} | ||
| {"idx": 14, "order": 49, "level": 2, "span": "http://www.acquisition.gov/far/index.htmI"} |
There was a problem hiding this comment.
Suggestion: The FAR reference URL has an OCR typo (index.htmI with uppercase I instead of l), which makes the link invalid for any downstream URL parsing or link-checking logic. Correct it to the canonical FAR URL so consumers can resolve it reliably. [logic error]
Severity Level: Major ⚠️
- ⚠️ FAR reference URL for idx=14 fails link validation.
- ⚠️ Downstream link-checking reports false negatives for FAR.
- ⚠️ Any FAR cross-link index omits this malformed entry.Steps of Reproduction ✅
1. Load the frozen baseline record for idx=14 from
`data/auto_parse/level_freeze/frozen/idx_14.jsonl` and navigate to entry with `"order":
49`, where the span is `http://www.acquisition.gov/far/index.htmI`.
2. Run any downstream pipeline component that extracts URLs from `span` text (the same
extractor used for other frozen idx_* JSONL files to build outbound-link indexes or to
drive link-checking).
3. Observe that the extractor emits `http://www.acquisition.gov/far/index.htmI` as a URL;
because the path ends with `index.htmI` (uppercase `I` instead of lowercase `l`), an HTTP
client or link checker receives an HTTP 404 / DNS error rather than the intended FAR index
page.
4. Compare behavior with another frozen idx JSONL record that correctly uses
`http://www.acquisition.gov/far/index.html` and verify that consumers successfully resolve
and validate that canonical FAR URL, demonstrating that the OCR-typo variant breaks link
resolution for idx=14.Fix in Cursor | Fix in VSCode Claude
(Use Cmd/Ctrl + Click for best experience)
Prompt for AI Agent 🤖
This is a comment left during a code review.
**Path:** data/auto_parse/level_freeze/frozen/idx_14.jsonl
**Line:** 50:50
**Comment:**
*Logic Error: The FAR reference URL has an OCR typo (`index.htmI` with uppercase `I` instead of `l`), which makes the link invalid for any downstream URL parsing or link-checking logic. Correct it to the canonical FAR URL so consumers can resolve it reliably.
Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.
Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fix| {"idx": 14, "order": 97, "level": 1, "span": "(Refer to Schedule of Supplies for package size details and estimates)\nOffered prices shall include a [***] Cost Recovery Fee (See Scope of Contract, paragraph 12). The Government reserves the right not to award a contract on this solicitation should offered prices match or exceed current Federal Supply Schedule prices. Offers for pharmaceuticals sourced from countries not covered by the Trade Agreement Act (TAA) may be given consideration pursuant to Federal Acquisition Regulation (FAR) Part 25. Acknowledgement of Amendments. The following amendments are acknowledged as part of this solicitation. (Please complete if applicable)\nDate Acknowledged by Offeror\nAmendment Number\nThe System for Award Management (SAM) is an online system that replaces CCR/Fed Reg, ORCA, and EPLS. Contractors should now go to www.sam.gov to find their information. Training tools are available on the SAM website at www.sam.gov for familiarization with the SAM system . Prospective contractors shall maintain a current and accurate record in the SAM database. SAM updates are required, as necessary, but at least annually. (see 52.212-4(t) and 52.212-l(k)). Subcontracting Plan Requirements: Pursuant to the requirements of Public Law 95-507, all large business concerns are required to have an approved subcontracting plan for contracts valued over $700,000 before the Government can award a contract (see FAR 52.219-9 for details). Offerers must submit a currently approved commercial plan or a new plan for review and approval. Attachment \"D\" includes all of the elements required to be addressed and is included to facilitate the submission of a subcontracting plan. As prescribed in FAR Part 42.15, the VA evaluates contractor performance on all contracts that exceed $150,000, and shares those evaluations with other federal government agencies. The FAR requires that the contractor be provided an opportunity to comment on past performance evaluations prior to each report closing. To fulfill this requirement, VA will be using an online database, the Contractor Performance Assessment Reporting System (CPARS). Annual reporting of past performance will be completed at http://www.cpars.gov and uploaded to PPIRS (Past Performance Information Retrieval System)."} | ||
| {"idx": 14, "order": 98, "level": 1, "span": "1. INTRODUCTION"} | ||
| {"idx": 14, "order": 99, "level": 1, "span": "2. EXTENT OF OBLIGATION"} | ||
| {"idx": 14, "order": 100, "level": 4, "span": "Government Participants. The contractor shall provide the products specified in the schedule at the prices awarded herein for the facilities/agencies below:\n\t\t\nAll Department of Veterans Affairs (VA) facilities\nAll Ordering Activities under the Department of Defense (DOD) Pharmaceutical Prime Vendor Program\nIndian Health Service (IHS) facilities\nAll Bureau of Prisons (BOP) facilities\nFederal Health Care Center (FHCC)\nAll Option 2 State Veteran Homes (See paragraph 2.2 State Veteran Homes)\nA database of all facilities authorized to use the VA PPV Program may be downloaded from the National Acquisition Center's web site at http://www.va.gov/oal/business/nc/ppv.asp. The database identifies each state veteran home as option 1 or 2. A database for all facilities authorized to use the DOD PPV Program may be downloaded from the DOD's website at https://www.medicaI.dla.mil/Portal/PrimeVendor/PvPharm/PharmPvOverview.aspx."} |
There was a problem hiding this comment.
Suggestion: The DLA portal URL contains a hostname typo (medicaI with uppercase I), which creates an invalid domain and breaks hyperlink resolution. Normalize this to the correct host (medical.dla.mil) to prevent broken-link behavior in downstream consumers. [logic error]
Severity Level: Major ⚠️
- ⚠️ DOD PPV portal link for idx=14 cannot resolve.
- ⚠️ Automated link harvesters store an invalid DLA hostname.
- ⚠️ Any UI rendering hyperlinks shows a dead PPV overview link.Steps of Reproduction ✅
1. Load the idx=14 frozen JSONL file at `data/auto_parse/level_freeze/frozen/idx_14.jsonl`
and locate the `"order": 100` record where the span text includes
`https://www.medicaI.dla.mil/Portal/PrimeVendor/PvPharm/PharmPvOverview.aspx`.
2. Run the same URL-extraction or hyperlink-enrichment stage that processes other contract
spans, which will emit this DOD PPV Program URL exactly as written in the span.
3. Feed the extracted URL into any HTTP client or automated link checker used by your
tooling; because the hostname contains `medicaI` (uppercase `I` instead of `l`), DNS
lookup or TLS handshake fails and the request cannot reach the intended `medical.dla.mil`
host.
4. Compare against another idx JSONL or external reference where the correct
`https://www.medical.dla.mil/Portal/PrimeVendor/PvPharm/PharmPvOverview.aspx` URL is used,
confirming that the typo uniquely breaks link resolution for this idx=14 baseline.Fix in Cursor | Fix in VSCode Claude
(Use Cmd/Ctrl + Click for best experience)
Prompt for AI Agent 🤖
This is a comment left during a code review.
**Path:** data/auto_parse/level_freeze/frozen/idx_14.jsonl
**Line:** 101:101
**Comment:**
*Logic Error: The DLA portal URL contains a hostname typo (`medicaI` with uppercase `I`), which creates an invalid domain and breaks hyperlink resolution. Normalize this to the correct host (`medical.dla.mil`) to prevent broken-link behavior in downstream consumers.
Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.
Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fix| {"idx": 14, "order": 201, "level": 2, "span": "(b) Contractors shall ensure that the CAGE code is maintained throughout the life of the contract\nFor contractors registered in the System for Award Management (SAM), the DLA Contractor and Government Entity (CAGE) Branch shall only modify data received from SAM in the CAGE master file if the contractor initiates those changes via update of its SAM registration. Contractors undergoing a novation or change-of-name agreement shall notify the contracting officer in accordance with subpart 42.12. The contractor shall communicate any change to the CAGE code to the contracting officer within 30 days after the change, so that a modification can be issued to update the CAGE code on the contract."} | ||
| {"idx": 14, "order": 202, "level": 2, "span": "(c) Contractors located in the United States or its outlying areas that are not registered in SAM shall submit written change requests to the DLA Contractor and Government Entity (CAGE) Branch\nRequests for changes shall be provided on a DD Form 2051, Request for Assignment of a Commercial and Government Entity (CAGE) Code, to the address shown on the back of the DD Form 2051. Change requests to the CAGE master file are accepted from the entity identified by the code."} | ||
| {"idx": 14, "order": 203, "level": 2, "span": "(d) Contractors located outside the United States and its outlying areas that are not registered in SAM shall contact the appropriate National Codification Bureau or NSPA to request CAGE changes\nPoints of contact for National Codification Bureaus and NSPA, as well as additional information on obtaining NCAGE codes, are available at http://www.dlis.dla.mil/nato/ObtainCAGE.asp."} | ||
| {"idx": 14, "order": 204, "level": 2, "span": "(e) Additional guidance for maintaining CAGE codes is available at http://www.dlis.dla.mil/cage welcome.asp."} |
There was a problem hiding this comment.
Suggestion: This URL includes a literal space (cage welcome.asp) and trailing punctuation, which makes it non-parseable as a valid URL in strict parsers. Replace it with a valid encoded or canonical URL to avoid failures in URL extraction/validation steps. [logic error]
Severity Level: Major ⚠️
- ⚠️ CAGE guidance URL cannot be parsed by strict tools.
- ⚠️ Link-checking and harvesting miss this DLA reference.
- ⚠️ Any rendered hyperlink may truncate before `welcome.asp`.Steps of Reproduction ✅
1. Open `data/auto_parse/level_freeze/frozen/idx_14.jsonl` and find the record with
`"order": 204`, where the span contains `http://www.dlis.dla.mil/cage welcome.asp`.
2. Run your standard URL extraction logic over this span; typical regex- or RFC-compliant
parsers will either stop at `http://www.dlis.dla.mil/cage` or reject the token entirely
because of the embedded space before `welcome.asp`.
3. Observe that any downstream link checker, documentation generator, or hyperlinking UI
either produces a truncated URL (`.../cage`) or omits this CAGE guidance link, since `cage
welcome.asp` is not a valid path fragment without encoding.
4. Compare with another reference (for example, from the same clause in a different idx
frozen file) that uses a canonical or percent-encoded URL (e.g.,
`http://www.dlis.dla.mil/cage_welcome.asp`), and confirm that only the idx=14 baseline
fails strict URL parsing due to the embedded whitespace.Fix in Cursor | Fix in VSCode Claude
(Use Cmd/Ctrl + Click for best experience)
Prompt for AI Agent 🤖
This is a comment left during a code review.
**Path:** data/auto_parse/level_freeze/frozen/idx_14.jsonl
**Line:** 205:205
**Comment:**
*Logic Error: This URL includes a literal space (`cage welcome.asp`) and trailing punctuation, which makes it non-parseable as a valid URL in strict parsers. Replace it with a valid encoded or canonical URL to avoid failures in URL extraction/validation steps.
Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.
Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fix|
CodeAnt AI finished reviewing your PR. |
User description
Summary
Fifteenth stacked PR. Adds idx=14 (DESIGNATION OF AGENT AGREEMENT between Aralez Pharmaceuticals US Inc., AstraZeneca Pharmaceuticals LP, and the US Government via VA, February 23, 2017) as the fifteenth verified frozen baseline on top of idx=13 (PR #86).
This is the corpus's first tri-party government contract with a complex embedded subdocument structure: the small Designation of Agent Agreement itself is followed by the entire VA National Contract solicitation/SF-30 modification (~145K chars) containing a NOVATION AGREEMENT, multiple SECTIONS (B/C/D with 12 FAR/VAAR clauses + 4 attachments), and a multi-page SMALL BUSINESS SUBCONTRACTING PLAN.
Parser change (1 surgical, shape-driven)
Extended
real_subdoc_idsin_apply_scope_rule(~lines 589-608) to also include secondary-agreement carriers — nodes withsubdoc_penalty=0themselves whose direct children carrysubdoc_penalty>=1. The depth-walker sets this when a non-subdoc-class section's title matches theAGREEMENT|PLANstructural-level-0 pattern AND the primary L0 already exists. Recovers secondary agreements (NOVATION AGREEMENT at o=75, SMALL BUSINESS SUBCONTRACTING PLAN at o=330) that were previously stranded inscope=trailer.Pure structural — uses
subdoc_penaltyarithmetic from the tree walk. No phrase matching.Cross-idx audit: idxs 1, 2, 3, 5, 13 also have secondary carriers but already in
scope=agreement(their carriers were trivially classified). Only idx=14 has a carrier currently in trailer that flips to agreement.Verified output for idx=14
{L0:1, L1:107, L2:110, L3:107, L4:45, L5:11}(max depth 5, ≤7 ceiling)Top structure
word_coverage 92.4% passes the 90% blocking gate. char_ratio is informational per
freeze_command.md.The 170.2% char_ratio is a 70-point outlier vs the 14 prior baselines (80.9%–99.9%). Inspector identified the root cause:
NOT records duplicating each other. Zero exact duplicates, zero substring containments via fingerprint check.
Root cause: o=11 alone holds 135,609 normalized chars (95.6% of the source-of-truth's 141,805 chars) — doc2dict's flattened HTML-table serialization of the SF-1449/SF-30 government contract form. The form HTML uses tables with repeated column headers ("Base Year", "Bottles", "$255.00", etc.) which doc2dict expanded into one massive
body_directvalue. Inside o=11 alone, "metoprolol succinate" appears 178× (vs 13× in source), "solicitation" 191× (vs 75×), "amendment" 155× (vs 25×).The parser correctly preserves doc2dict's natural HTML grouping per the rubric. Re-slicing o=11 to dedupe table content would be synthetic restructuring — explicitly forbidden by
level_rubric.md§"Common parser failure modes".This is an inherent property of the source document (table-heavy government contract form), not a parser defect. Future polish round may investigate whether doc2dict can be configured to skip repeated table cells.
Test plan
uv run scripts/parse_doc2dict_with_config.py --limit 15 --no-truncate --output-dir data/auto_parseexits 0 withok 15uv run scripts/level_loop/freeze.py 14 --forcereports word_coverage ≥ 90% (92.4%)uv run scripts/level_loop/regress.pyreports all 15 frozen idxs OKSource
http://www.sec.gov/Archives/edgar/data/1660719/000155837017004016/arlz-20170331ex101c3bb2d.htm
🤖 Generated with Claude Code
CodeAnt-AI Description
Keep attached plan sections in agreement scope so the full document is frozen
What Changed
Impact
✅ Fewer missing contract sections✅ More complete government contract captures✅ Higher freeze success for complex attached plans🔄 Retrigger CodeAnt AI Review
💡 Usage Guide
Checking Your Pull Request
Every time you make a pull request, our system automatically looks through it. We check for security issues, mistakes in how you're setting up your infrastructure, and common code problems. We do this to make sure your changes are solid and won't cause any trouble later.
Talking to CodeAnt AI
Got a question or need a hand with something in your pull request? You can easily get in touch with CodeAnt AI right here. Just type the following in a comment on your pull request, and replace "Your question here" with whatever you want to ask:
This lets you have a chat with CodeAnt AI about your pull request, making it easier to understand and improve your code.
Example
Preserve Org Learnings with CodeAnt
You can record team preferences so CodeAnt AI applies them in future reviews. Reply directly to the specific CodeAnt AI suggestion (in the same thread) and replace "Your feedback here" with your input:
This helps CodeAnt AI learn and adapt to your team's coding style and standards.
Example
Retrigger review
Ask CodeAnt AI to review the PR again, by typing:
Check Your Repository Health
To analyze the health of your code repository, visit our dashboard at https://app.codeant.ai. This tool helps you identify potential issues and areas for improvement in your codebase, ensuring your repository maintains high standards of code health.