Skip to content

Implement Manifest Chain Tamper Detection for cryptographically linked dataset manifests.#53

Open
aniket866 wants to merge 5 commits intoAOSSIE-Org:mainfrom
aniket866:chain
Open

Implement Manifest Chain Tamper Detection for cryptographically linked dataset manifests.#53
aniket866 wants to merge 5 commits intoAOSSIE-Org:mainfrom
aniket866:chain

Conversation

@aniket866
Copy link
Contributor

@aniket866 aniket866 commented Mar 9, 2026

Addressed Issues:

Closes #35

Step 1: Download dump

Windows / Linux / Mac

uv run python scripts/download_dump.py --wiki simplewiki --date 20260201

Step 2: First preprocessing

Windows / Linux / Mac

uv run python -m openverifiablellm.utils simplewiki-20260201-pages-articles.xml.bz2

Step 3: Backup manifest

Windows (CMD / PowerShell)

copy data\dataset_manifest.json data\manifest_v1.json

Linux / Mac

cp data/dataset_manifest.json data/manifest_v1.json

Step 4: Second preprocessing (creates chain)

Windows / Linux / Mac

uv run python -m openverifiablellm.utils simplewiki-20260201-pages-articles.xml.bz2

Step 5: Check manifest has parent hash

Windows / Linux / Mac

uv run python -c "import json; m = json.load(open('data/dataset_manifest.json')); print('parent_manifest_hash:', m.get('parent_manifest_hash')[:16] + '...')"

Step 6: Verify without chain (should PASS)

Windows / Linux / Mac

uv run python -m openverifiablellm.verify simplewiki-20260201-pages-articles.xml.bz2

Step 7: Verify WITH chain (should PASS - chain is valid)

Windows

uv run python -m openverifiablellm.verify simplewiki-20260201-pages-articles.xml.bz2 --previous-manifest data/manifest_v1.json

Linux / Mac

uv run python -m openverifiablellm.verify simplewiki-20260201-pages-articles.xml.bz2 \\
--previous-manifest data/manifest_v1.json

Step 8: Tamper the manifest

Windows / Linux / Mac

uv run python -c "import json; m = json.load(open('data/manifest_v1.json')); m['tampered'] = True; json.dump(m, open('data/manifest_v1.json', 'w'), indent=2); print(' Tampered')"

Step 9: Verify with tampered manifest (should FAIL - tampering detected!)

Windows

uv run python -m openverifiablellm.verify simplewiki-20260201-pages-articles.xml.bz2 --previous-manifest data/manifest_v1.json

Linux / Mac

uv run python -m openverifiablellm.verify simplewiki-20260201-pages-articles.xml.bz2 \\
--previous-manifest data/manifest_v1.json

Step 10: Restore manifest

Windows / Linux / Mac

uv run python -c "import json; m = json.load(open('data/manifest_v1.json')); del m['tampered']; json.dump(m, open('data/manifest_v1.json', 'w'), indent=2); print(' Restored')"

Step 11: Verify restored (should PASS again)

Windows

uv run python -m openverifiablellm.verify simplewiki-20260201-pages-articles.xml.bz2 --previous-manifest data/manifest_v1.json

Linux / Mac

uv run python -m openverifiablellm.verify simplewiki-20260201-pages-articles.xml.bz2 \\
--previous-manifest data/manifest_v1.json

If all steps above succeed, the feature has been successfully validated.

Screenshots/Recordings:

image

Additional Notes:

Before Tampering verifcation passes

image

After Tampering Verification Fails:

image

Purpose

This PR introduces cryptographically linked dataset manifests to detect tampering in preprocessing outputs.

Each new manifest stores the SHA256 hash of the previous manifest using the field:

parent_manifest_hash

If any earlier manifest is modified, the stored hash will no longer match and verification will fail.


Checklist

  • My code follows the project's code style and conventions
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings or errors
  • I have joined the Discord server and I will share a link to this PR with the project maintainers there
  • I have read the Contributing Guidelines

⚠️ AI Notice - Important!

We encourage contributors to use AI tools responsibly when creating Pull Requests. While AI can be a valuable aid, it is essential to ensure that your contributions meet the task requirements, build successfully, include relevant tests, and pass all linters. Submissions that do not meet these standards may be closed without warning to maintain the quality and integrity of the project. Please take the time to understand the changes you are proposing and their impact.

Summary by CodeRabbit

  • New Features

    • Tamper-evident manifest chain: cryptographic, canonical-hash linkage between sequential dataset manifests.
    • CLI accepts an optional --previous-manifest for explicit chain verification; verification reports include the provided path and chain check outcomes.
    • Generated manifests now record a parent_manifest_hash to preserve provenance.
  • Tests

    • Comprehensive test suite covering chain creation, verification, tamper detection, and backward compatibility.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 9, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Adds a tamper-evident manifest chain: new manifest_chain.py implements canonical JSON + SHA-256 hashing and verification; generate_manifest() writes parent_manifest_hash; verification flow and CLI accept/verify previous manifests; comprehensive tests added to validate chain integrity and tamper detection.

Changes

Cohort / File(s) Summary
Manifest Chain Module
openverifiablellm/manifest_chain.py
New module providing _canonical_json, compute_manifest_hash, get_parent_manifest_hash, verify_manifest_chain_link, and verify_manifest_chain with parsing, canonicalization, SHA-256 hashing, structured results, and error handling.
Manifest Generation & Utils
openverifiablellm/utils.py
generate_manifest() now computes and embeds parent_manifest_hash (reads prior dataset_manifest.json) and logs it; export_merkle_proof() signature extended with output_path; minor formatting/logging tweaks.
Verification Integration
openverifiablellm/verify.py
Imports manifest_chain utilities, extends VerificationReport with previous_manifest_path, updates verify_preprocessing() and CLI (--previous-manifest) to perform link or awareness checks and record manifest_chain_link / manifest_chain_awareness results.
Chain Verification Tests
tests/test_manifest_chain.py
New tests covering canonical JSON, hash computation, parent-hash retrieval, single-link and chain verification across manifests, tamper scenarios, error cases, and backward compatibility for manifests without parent_manifest_hash.

Sequence Diagram

sequenceDiagram
    participant User
    participant Preprocessing
    participant ManifestChain
    participant FileSystem
    participant Verification

    User->>Preprocessing: run preprocessing
    Preprocessing->>FileSystem: read existing `dataset_manifest.json`?
    FileSystem-->>Preprocessing: manifest JSON or not found
    Preprocessing->>ManifestChain: compute_manifest_hash(previous_manifest)
    ManifestChain->>FileSystem: read previous manifest (if present)
    FileSystem-->>ManifestChain: previous manifest JSON
    ManifestChain->>ManifestChain: canonicalize JSON -> SHA-256
    ManifestChain-->>Preprocessing: parent_manifest_hash
    Preprocessing->>FileSystem: write new `dataset_manifest.json` (includes parent_manifest_hash)

    User->>Verification: run verification (--previous-manifest optional)
    Verification->>ManifestChain: verify_manifest_chain or verify_manifest_chain_link
    ManifestChain->>FileSystem: read current and previous manifests as needed
    FileSystem-->>ManifestChain: manifest JSONs
    ManifestChain->>ManifestChain: compute previous hash, compare to stored parent_manifest_hash
    ManifestChain-->>Verification: chain valid / chain tampered / report details
    Verification-->>User: verification report (includes chain check)
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested labels

Python Lang

Suggested reviewers

  • Archit381

Poem

🐰 I hop through JSON keys and nibble every line,

I stamp the parent hash so histories align.
A crunchy SHA-256, a carrot for the trail,
Break one little link and the whole chain tells the tale. 🥕🔗

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly and clearly describes the main change: implementing manifest chain tamper detection using cryptographic linking of dataset manifests.
Linked Issues check ✅ Passed The PR fully implements all requirements from issue #35: adds parent_manifest_hash field to manifests [#35], updates generate_manifest to write the field [#35], implements verify_manifest_chain functions for chain verification [#35], and maintains backward compatibility with existing tests [#35].
Out of Scope Changes check ✅ Passed All changes are directly scoped to issue #35: new manifest_chain module, updates to utils.py for manifest generation, updates to verify.py for chain verification, and comprehensive tests for the feature.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions bot added size/XL and removed size/XL labels Mar 9, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/manifest_chain.py`:
- Around line 90-96: The hashing routine currently removes
"parent_manifest_hash" before canonicalizing and hashing; include the full
manifest (do not pop "parent_manifest_hash") so the manifest hash covers the
parent link (adjust the code around the variable hashable and the call to
_canonical_json to stop removing that key), and update the corresponding
assertion in tests/test_manifest_chain.py to expect that changing a manifest's
parent_manifest_hash will break/alter the child's verification (flip the test
assertion accordingly).
- Around line 253-266: The branch that handles an explicit
previous_manifest_path calls verify_manifest_chain_link(previous_manifest_path,
manifest_data) but lets exceptions bubble up; change verify_manifest_chain() to
catch exceptions from verify_manifest_chain_link and return a structured report
instead of raising: call verify_manifest_chain_link in a try/except, set
chain_valid to False on exception, set chain_message to a clear message
including the exception text (e.g., "Chain link broken — could not read or
verify previous manifest: <error>"), and still return has_parent_hash_field and
parent_hash_value alongside chain_valid and chain_message so this exit path
matches the other structured-report returns.

In `@openverifiablellm/verify.py`:
- Around line 267-300: The branch currently treats a provided but missing
previous_manifest_path as if no previous was requested; change the logic so that
when previous_manifest_path is not None but .exists() is False you immediately
add a CheckResult named "manifest_chain_link" with status CheckStatus.FAIL and a
clear detail like "previous manifest provided but cannot be opened/missing",
rather than falling through to the manifest_chain_awareness path; keep the
existing try/except around verify_manifest_chain_link(previous_manifest_path,
manifest) for the exists() True case, and leave
verify_manifest_chain(manifest_path) only for the explicit None case; add a
unit/regression test covering previous_manifest_path provided but missing that
expects manifest_chain_link to be FAIL.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: dbe369c3-eae0-49d4-a310-c6231ca347b1

📥 Commits

Reviewing files that changed from the base of the PR and between 35ce9dc and 9c28b5a.

📒 Files selected for processing (4)
  • openverifiablellm/manifest_chain.py
  • openverifiablellm/utils.py
  • openverifiablellm/verify.py
  • tests/test_manifest_chain.py

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
@github-actions github-actions bot added size/XL and removed size/XL labels Mar 9, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
openverifiablellm/manifest_chain.py (1)

90-96: ⚠️ Potential issue | 🟠 Major

Hash the full previous manifest, including its parent link.

parent_manifest_hash is not self-referential here; it authenticates the previous manifest. Dropping it means a child still verifies after someone edits only the previous manifest’s own parent pointer, so ancestry tampering remains invisible.

Suggested fix
-    # Create a copy and remove parent_manifest_hash before hashing
-    # (so the hash doesn't include itself)
-    hashable = manifest_data.copy()
-    hashable.pop("parent_manifest_hash", None)
-
-    canonical = _canonical_json(hashable)
+    canonical = _canonical_json(manifest_data)
     return hashlib.sha256(canonical.encode("utf-8")).hexdigest()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openverifiablellm/manifest_chain.py` around lines 90 - 96, The code is
removing parent_manifest_hash before computing the manifest hash, which omits
the parent's link and allows ancestry tampering; stop popping
"parent_manifest_hash" so the hash includes the full previous manifest link —
i.e., compute the canonical JSON from manifest_data (or from the "hashable" copy
without removing parent_manifest_hash) and then call _canonical_json(...) and
hashlib.sha256(...) on that result so the resulting hexdigest incorporates
parent_manifest_hash; locate the block that creates "hashable =
manifest_data.copy()", the pop("parent_manifest_hash", None) call, and replace
it by leaving parent_manifest_hash intact before calling _canonical_json and
hashlib.sha256.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/manifest_chain.py`:
- Around line 272-282: The current branch in manifest_chain.py sets
chain_valid=True when parent_manifest_hash exists and is non-empty, which lets
unverified/non-root manifests pass; change the logic so that when has_field is
True and parent_hash_value is non-empty you do NOT mark chain_valid=True by
default—instead set chain_valid=False and a message like "parent_manifest_hash
present but previous_manifest_path not provided (cannot verify non-root
manifest)"; only set chain_valid=True for the empty parent_hash_value (root) or
after explicit verification of the previous manifest when previous_manifest_path
(and its verification routine) is present and succeeds. Reference the
variables/logic handling has_field, parent_hash_value, chain_valid and
previous_manifest_path or the verification function to locate and update the
code.

---

Duplicate comments:
In `@openverifiablellm/manifest_chain.py`:
- Around line 90-96: The code is removing parent_manifest_hash before computing
the manifest hash, which omits the parent's link and allows ancestry tampering;
stop popping "parent_manifest_hash" so the hash includes the full previous
manifest link — i.e., compute the canonical JSON from manifest_data (or from the
"hashable" copy without removing parent_manifest_hash) and then call
_canonical_json(...) and hashlib.sha256(...) on that result so the resulting
hexdigest incorporates parent_manifest_hash; locate the block that creates
"hashable = manifest_data.copy()", the pop("parent_manifest_hash", None) call,
and replace it by leaving parent_manifest_hash intact before calling
_canonical_json and hashlib.sha256.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 6a8c0e79-a83b-480b-9d53-0b43f2724ad8

📥 Commits

Reviewing files that changed from the base of the PR and between 9c28b5a and e442106.

📒 Files selected for processing (1)
  • openverifiablellm/manifest_chain.py

@github-actions github-actions bot added size/XL and removed size/XL labels Mar 9, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/manifest_chain.py`:
- Around line 171-175: When loading current_data from current_path in
manifest_chain.py you raise a new ValueError on json.JSONDecodeError without
chaining the original exception; update the exception raise to chain the
original JSONDecodeError (use "from e") so the traceback preserves context—apply
the same pattern used/planned for compute_manifest_hash to the current_data
JSONDecodeError handling.
- Around line 123-129: The except block around compute_manifest_hash(path)
should log the full traceback before re-raising; replace logger.error("Could not
compute parent manifest hash: %s", e) with a logger.exception call (e.g.,
logger.exception("Could not compute parent manifest hash")) so the traceback for
the exception thrown by compute_manifest_hash is captured when parent_hash
computation fails and the exception is re-raised.
- Around line 84-88: The manifest JSON error handling currently raises a new
ValueError without chaining the original JSONDecodeError; modify the except
block where json.load(f) is caught (the manifest_data loading logic using
manifest_path) to re-raise the ValueError using exception chaining (i.e., raise
ValueError(...) from e) so the original JSONDecodeError is preserved in the
traceback.
- Around line 238-247: The except block around reading manifest_path and
json.load currently catches all Exceptions; narrow it to the expected errors
(e.g., FileNotFoundError, PermissionError, OSError for file I/O and
json.JSONDecodeError for parse errors) so unexpected bugs aren't masked; update
the try/except to catch those specific exceptions and return the same failure
payload (using the caught exception message) while letting other exceptions
propagate (or re-raise) so they can be surfaced and debugged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 061af4f1-31d5-4b1d-a3f9-ed2aeedd38e7

📥 Commits

Reviewing files that changed from the base of the PR and between e442106 and 93ba434.

📒 Files selected for processing (2)
  • openverifiablellm/manifest_chain.py
  • openverifiablellm/verify.py

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
@github-actions github-actions bot added size/XL and removed size/XL labels Mar 9, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (3)
openverifiablellm/manifest_chain.py (3)

174-175: ⚠️ Potential issue | 🟡 Minor

Chain the original exception to preserve traceback context.

Same issue as in compute_manifest_hash — chain from the original JSONDecodeError.

Suggested fix
             except json.JSONDecodeError as e:
-                raise ValueError(f"Malformed current manifest JSON: {e}")
+                raise ValueError(f"Malformed current manifest JSON: {e}") from e
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openverifiablellm/manifest_chain.py` around lines 174 - 175, The except block
catching json.JSONDecodeError should preserve the original traceback by chaining
the exception; update the handler that currently does "except
json.JSONDecodeError as e: raise ValueError(f'...{e}')" to "raise
ValueError(f'...{e}') from e" so the original JSONDecodeError is attached (same
pattern as fix for compute_manifest_hash).

238-247: 🛠️ Refactor suggestion | 🟠 Major

Narrow exception type to avoid masking unexpected bugs.

Catching bare Exception can hide programming errors. Limit to expected failure modes for file I/O and JSON parsing.

Suggested fix
     try:
         with manifest_path.open("r", encoding="utf-8") as f:
             manifest_data = json.load(f)
-    except Exception as e:
+    except (OSError, json.JSONDecodeError) as e:
         return {
             "chain_valid": False,
             "chain_message": f"Failed to read manifest: {e}",
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openverifiablellm/manifest_chain.py` around lines 238 - 247, The try/except
around reading/parsing the manifest currently catches all Exceptions; narrow it
to expected errors by catching file I/O and JSON parse errors (e.g.,
FileNotFoundError, PermissionError, OSError and json.JSONDecodeError) when
opening manifest_path and loading manifest_data, and return the same error dict
including the caught exception info; leave other exceptions to propagate so real
bugs aren't masked.

87-88: ⚠️ Potential issue | 🟡 Minor

Chain the original exception to preserve traceback context.

When re-raising as ValueError, chain from the original JSONDecodeError so the full traceback is preserved for debugging.

Suggested fix
             except json.JSONDecodeError as e:
-                raise ValueError(f"Malformed manifest JSON: {e}")
+                raise ValueError(f"Malformed manifest JSON: {e}") from e
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openverifiablellm/manifest_chain.py` around lines 87 - 88, In the except
block that catches json.JSONDecodeError as e (the manifest parsing code in
manifest_chain.py), re-raise the ValueError while chaining the original
exception by using "raise ValueError(f\"Malformed manifest JSON: {e}\") from e"
so the original traceback is preserved; keep the existing error message and
variable names (json.JSONDecodeError, e, ValueError) intact when applying this
change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/manifest_chain.py`:
- Line 29: Replace deprecated typing.Dict usage with built-in dict: remove Dict
from the import list in manifest_chain.py and change all type hints like
Dict[str, Any] to dict[str, Any] (e.g., in signatures/annotations for functions
or methods such as any occurrences around ManifestChain-related
functions/classes and variables that currently use Dict). Keep
typing.Any/Optional/Union as needed and ensure imports reflect only used typing
symbols.
- Around line 127-129: In the except block that catches Exception in the
manifest hashing logic (the block using "except Exception as e" and calling
logger.exception with the message "Could not compute parent manifest hash: %s"),
remove the redundant exception object from the logger.exception call and stop
passing e as a formatting argument; instead call logger.exception with a plain
descriptive message (e.g., "Could not compute parent manifest hash") so the
traceback is logged automatically, then re-raise as currently done.

---

Duplicate comments:
In `@openverifiablellm/manifest_chain.py`:
- Around line 174-175: The except block catching json.JSONDecodeError should
preserve the original traceback by chaining the exception; update the handler
that currently does "except json.JSONDecodeError as e: raise
ValueError(f'...{e}')" to "raise ValueError(f'...{e}') from e" so the original
JSONDecodeError is attached (same pattern as fix for compute_manifest_hash).
- Around line 238-247: The try/except around reading/parsing the manifest
currently catches all Exceptions; narrow it to expected errors by catching file
I/O and JSON parse errors (e.g., FileNotFoundError, PermissionError, OSError and
json.JSONDecodeError) when opening manifest_path and loading manifest_data, and
return the same error dict including the caught exception info; leave other
exceptions to propagate so real bugs aren't masked.
- Around line 87-88: In the except block that catches json.JSONDecodeError as e
(the manifest parsing code in manifest_chain.py), re-raise the ValueError while
chaining the original exception by using "raise ValueError(f\"Malformed manifest
JSON: {e}\") from e" so the original traceback is preserved; keep the existing
error message and variable names (json.JSONDecodeError, e, ValueError) intact
when applying this change.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 5542dc06-9955-4c82-a284-bb003f5e02cb

📥 Commits

Reviewing files that changed from the base of the PR and between 93ba434 and 6ac7c7d.

📒 Files selected for processing (1)
  • openverifiablellm/manifest_chain.py

@Archit381
Copy link
Member

@aniket866 Pull from main, fix the conflicts and check again. We are using uv to package now so test if there's any issue

@github-actions github-actions bot added size/XL and removed size/XL labels Mar 11, 2026
@github-actions github-actions bot added size/XL and removed size/XL labels Mar 11, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)
openverifiablellm/utils.py (3)

1-13: ⚠️ Potential issue | 🟡 Minor

Sort this import block to unblock CI.

Ruff is already failing on this file’s import ordering, so this needs to be normalized before merge.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openverifiablellm/utils.py` around lines 1 - 13, The import block in utils.py
is not sorted per the project's style (Ruff/isort), causing CI failures; reorder
the imports into standard library first (bz2, hashlib, json, logging, platform,
re, sys, pathlib.Path), then third-party (defusedxml.ElementTree as ET), then
local application imports (from openverifiablellm.environment import
generate_environment_fingerprint and from openverifiablellm.manifest_chain
import get_parent_manifest_hash), and remove any duplicate/unused imports; you
can run isort or manually adjust the order and ensure typing imports (Union,
Optional, Dict, Any, List, Tuple) are grouped with standard library imports.

401-405: ⚠️ Potential issue | 🟠 Major

Remove the second extract_text_from_xml() call.

The module entrypoint preprocesses the dump twice. That doubles the runtime, and the second pass rewrites wiki_clean.txt after the manifest has already been recorded, so a later failure can leave the manifest describing different on-disk state than the final output.

Suggested fix
     extract_text_from_xml(
         sys.argv[1],
         write_manifest="--no-manifest" not in sys.argv[2:],
     )
-    extract_text_from_xml(sys.argv[1])
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openverifiablellm/utils.py` around lines 401 - 405, The file calls
extract_text_from_xml twice (once with write_manifest set based on sys.argv and
immediately again without arguments), causing duplicated work and a manifest
mismatch; remove the second extract_text_from_xml(sys.argv[1]) invocation so
extract_text_from_xml is only called once with the intended write_manifest
argument (leave the existing extract_text_from_xml(sys.argv[1],
write_manifest="--no-manifest" not in sys.argv[2:]) call intact).

267-269: ⚠️ Potential issue | 🟠 Major

Preserve the old chunk_size default in export_merkle_proof().

This change turns a previously optional parameter into a required one, which is a backwards-incompatible runtime break for existing callers that relied on the default chunk size. Keep chunk_size defaulted and make the new output_path parameter keyword-only instead.

Suggested fix
 def export_merkle_proof(
-    proof: List[Tuple[str, bool]], chunk_index: int, chunk_size: int, output_path: Union[str, Path]
+    proof: List[Tuple[str, bool]],
+    chunk_index: int,
+    chunk_size: int = MERKLE_CHUNK_SIZE_BYTES,
+    *,
+    output_path: Union[str, Path],
 ) -> None:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openverifiablellm/utils.py` around lines 267 - 269, Restore the previous
default for chunk_size in export_merkle_proof by making chunk_size optional
again (e.g. chunk_size: int = DEFAULT_CHUNK_SIZE or = 1024 if no constant
exists) and make output_path a keyword-only parameter by adding a '*' before it
(signature: def export_merkle_proof(..., chunk_index: int, chunk_size: int =
DEFAULT_CHUNK_SIZE, *, output_path: Union[str, Path]) -> None). Ensure
DEFAULT_CHUNK_SIZE is defined or use the original numeric default and update any
docstring/tests if they reference the former behavior.
openverifiablellm/verify.py (1)

20-39: ⚠️ Potential issue | 🟡 Minor

Normalize the import block here as well.

CI is failing on Ruff import ordering for this file, so the new imports need to be re-sorted before this can merge.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openverifiablellm/verify.py` around lines 20 - 39, Reorder the imports to
satisfy Ruff: group standard-library imports first (sorted alphabetically), then
third-party (none), then local package imports (sorted). In particular, sort the
stdlib imports so dataclasses and enum come before json/logging/os, with pathlib
and typing placed alphabetically (e.g., dataclasses, enum, json, logging, os,
pathlib, platform, shutil, subprocess, sys, tempfile, typing). Then place local
imports from openverifiablellm in alphabetical order by module and alphabetize
imported names — e.g., import utils first, then from
openverifiablellm.environment import generate_environment_fingerprint, and from
openverifiablellm.manifest_chain import verify_manifest_chain,
verify_manifest_chain_link (members sorted alphabetically).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@openverifiablellm/utils.py`:
- Around line 1-13: The import block in utils.py is not sorted per the project's
style (Ruff/isort), causing CI failures; reorder the imports into standard
library first (bz2, hashlib, json, logging, platform, re, sys, pathlib.Path),
then third-party (defusedxml.ElementTree as ET), then local application imports
(from openverifiablellm.environment import generate_environment_fingerprint and
from openverifiablellm.manifest_chain import get_parent_manifest_hash), and
remove any duplicate/unused imports; you can run isort or manually adjust the
order and ensure typing imports (Union, Optional, Dict, Any, List, Tuple) are
grouped with standard library imports.
- Around line 401-405: The file calls extract_text_from_xml twice (once with
write_manifest set based on sys.argv and immediately again without arguments),
causing duplicated work and a manifest mismatch; remove the second
extract_text_from_xml(sys.argv[1]) invocation so extract_text_from_xml is only
called once with the intended write_manifest argument (leave the existing
extract_text_from_xml(sys.argv[1], write_manifest="--no-manifest" not in
sys.argv[2:]) call intact).
- Around line 267-269: Restore the previous default for chunk_size in
export_merkle_proof by making chunk_size optional again (e.g. chunk_size: int =
DEFAULT_CHUNK_SIZE or = 1024 if no constant exists) and make output_path a
keyword-only parameter by adding a '*' before it (signature: def
export_merkle_proof(..., chunk_index: int, chunk_size: int = DEFAULT_CHUNK_SIZE,
*, output_path: Union[str, Path]) -> None). Ensure DEFAULT_CHUNK_SIZE is defined
or use the original numeric default and update any docstring/tests if they
reference the former behavior.

In `@openverifiablellm/verify.py`:
- Around line 20-39: Reorder the imports to satisfy Ruff: group standard-library
imports first (sorted alphabetically), then third-party (none), then local
package imports (sorted). In particular, sort the stdlib imports so dataclasses
and enum come before json/logging/os, with pathlib and typing placed
alphabetically (e.g., dataclasses, enum, json, logging, os, pathlib, platform,
shutil, subprocess, sys, tempfile, typing). Then place local imports from
openverifiablellm in alphabetical order by module and alphabetize imported names
— e.g., import utils first, then from openverifiablellm.environment import
generate_environment_fingerprint, and from openverifiablellm.manifest_chain
import verify_manifest_chain, verify_manifest_chain_link (members sorted
alphabetically).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: b148a511-c1d7-46b7-bff5-43ad23dc2fc3

📥 Commits

Reviewing files that changed from the base of the PR and between 6ac7c7d and c94d335.

📒 Files selected for processing (2)
  • openverifiablellm/utils.py
  • openverifiablellm/verify.py

@aniket866
Copy link
Contributor Author

Hi @Archit381 I have resolved the conflicts and also have updated PR descrition instructions with UV support

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE]: Provenance Chain Linking

2 participants