Implement Manifest Chain Tamper Detection for cryptographically linked dataset manifests.#53
Implement Manifest Chain Tamper Detection for cryptographically linked dataset manifests.#53aniket866 wants to merge 5 commits intoAOSSIE-Org:mainfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughAdds a tamper-evident manifest chain: new Changes
Sequence DiagramsequenceDiagram
participant User
participant Preprocessing
participant ManifestChain
participant FileSystem
participant Verification
User->>Preprocessing: run preprocessing
Preprocessing->>FileSystem: read existing `dataset_manifest.json`?
FileSystem-->>Preprocessing: manifest JSON or not found
Preprocessing->>ManifestChain: compute_manifest_hash(previous_manifest)
ManifestChain->>FileSystem: read previous manifest (if present)
FileSystem-->>ManifestChain: previous manifest JSON
ManifestChain->>ManifestChain: canonicalize JSON -> SHA-256
ManifestChain-->>Preprocessing: parent_manifest_hash
Preprocessing->>FileSystem: write new `dataset_manifest.json` (includes parent_manifest_hash)
User->>Verification: run verification (--previous-manifest optional)
Verification->>ManifestChain: verify_manifest_chain or verify_manifest_chain_link
ManifestChain->>FileSystem: read current and previous manifests as needed
FileSystem-->>ManifestChain: manifest JSONs
ManifestChain->>ManifestChain: compute previous hash, compare to stored parent_manifest_hash
ManifestChain-->>Verification: chain valid / chain tampered / report details
Verification-->>User: verification report (includes chain check)
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@openverifiablellm/manifest_chain.py`:
- Around line 90-96: The hashing routine currently removes
"parent_manifest_hash" before canonicalizing and hashing; include the full
manifest (do not pop "parent_manifest_hash") so the manifest hash covers the
parent link (adjust the code around the variable hashable and the call to
_canonical_json to stop removing that key), and update the corresponding
assertion in tests/test_manifest_chain.py to expect that changing a manifest's
parent_manifest_hash will break/alter the child's verification (flip the test
assertion accordingly).
- Around line 253-266: The branch that handles an explicit
previous_manifest_path calls verify_manifest_chain_link(previous_manifest_path,
manifest_data) but lets exceptions bubble up; change verify_manifest_chain() to
catch exceptions from verify_manifest_chain_link and return a structured report
instead of raising: call verify_manifest_chain_link in a try/except, set
chain_valid to False on exception, set chain_message to a clear message
including the exception text (e.g., "Chain link broken — could not read or
verify previous manifest: <error>"), and still return has_parent_hash_field and
parent_hash_value alongside chain_valid and chain_message so this exit path
matches the other structured-report returns.
In `@openverifiablellm/verify.py`:
- Around line 267-300: The branch currently treats a provided but missing
previous_manifest_path as if no previous was requested; change the logic so that
when previous_manifest_path is not None but .exists() is False you immediately
add a CheckResult named "manifest_chain_link" with status CheckStatus.FAIL and a
clear detail like "previous manifest provided but cannot be opened/missing",
rather than falling through to the manifest_chain_awareness path; keep the
existing try/except around verify_manifest_chain_link(previous_manifest_path,
manifest) for the exists() True case, and leave
verify_manifest_chain(manifest_path) only for the explicit None case; add a
unit/regression test covering previous_manifest_path provided but missing that
expects manifest_chain_link to be FAIL.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: dbe369c3-eae0-49d4-a310-c6231ca347b1
📒 Files selected for processing (4)
openverifiablellm/manifest_chain.pyopenverifiablellm/utils.pyopenverifiablellm/verify.pytests/test_manifest_chain.py
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
openverifiablellm/manifest_chain.py (1)
90-96:⚠️ Potential issue | 🟠 MajorHash the full previous manifest, including its parent link.
parent_manifest_hashis not self-referential here; it authenticates the previous manifest. Dropping it means a child still verifies after someone edits only the previous manifest’s own parent pointer, so ancestry tampering remains invisible.Suggested fix
- # Create a copy and remove parent_manifest_hash before hashing - # (so the hash doesn't include itself) - hashable = manifest_data.copy() - hashable.pop("parent_manifest_hash", None) - - canonical = _canonical_json(hashable) + canonical = _canonical_json(manifest_data) return hashlib.sha256(canonical.encode("utf-8")).hexdigest()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@openverifiablellm/manifest_chain.py` around lines 90 - 96, The code is removing parent_manifest_hash before computing the manifest hash, which omits the parent's link and allows ancestry tampering; stop popping "parent_manifest_hash" so the hash includes the full previous manifest link — i.e., compute the canonical JSON from manifest_data (or from the "hashable" copy without removing parent_manifest_hash) and then call _canonical_json(...) and hashlib.sha256(...) on that result so the resulting hexdigest incorporates parent_manifest_hash; locate the block that creates "hashable = manifest_data.copy()", the pop("parent_manifest_hash", None) call, and replace it by leaving parent_manifest_hash intact before calling _canonical_json and hashlib.sha256.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@openverifiablellm/manifest_chain.py`:
- Around line 272-282: The current branch in manifest_chain.py sets
chain_valid=True when parent_manifest_hash exists and is non-empty, which lets
unverified/non-root manifests pass; change the logic so that when has_field is
True and parent_hash_value is non-empty you do NOT mark chain_valid=True by
default—instead set chain_valid=False and a message like "parent_manifest_hash
present but previous_manifest_path not provided (cannot verify non-root
manifest)"; only set chain_valid=True for the empty parent_hash_value (root) or
after explicit verification of the previous manifest when previous_manifest_path
(and its verification routine) is present and succeeds. Reference the
variables/logic handling has_field, parent_hash_value, chain_valid and
previous_manifest_path or the verification function to locate and update the
code.
---
Duplicate comments:
In `@openverifiablellm/manifest_chain.py`:
- Around line 90-96: The code is removing parent_manifest_hash before computing
the manifest hash, which omits the parent's link and allows ancestry tampering;
stop popping "parent_manifest_hash" so the hash includes the full previous
manifest link — i.e., compute the canonical JSON from manifest_data (or from the
"hashable" copy without removing parent_manifest_hash) and then call
_canonical_json(...) and hashlib.sha256(...) on that result so the resulting
hexdigest incorporates parent_manifest_hash; locate the block that creates
"hashable = manifest_data.copy()", the pop("parent_manifest_hash", None) call,
and replace it by leaving parent_manifest_hash intact before calling
_canonical_json and hashlib.sha256.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 6a8c0e79-a83b-480b-9d53-0b43f2724ad8
📒 Files selected for processing (1)
openverifiablellm/manifest_chain.py
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@openverifiablellm/manifest_chain.py`:
- Around line 171-175: When loading current_data from current_path in
manifest_chain.py you raise a new ValueError on json.JSONDecodeError without
chaining the original exception; update the exception raise to chain the
original JSONDecodeError (use "from e") so the traceback preserves context—apply
the same pattern used/planned for compute_manifest_hash to the current_data
JSONDecodeError handling.
- Around line 123-129: The except block around compute_manifest_hash(path)
should log the full traceback before re-raising; replace logger.error("Could not
compute parent manifest hash: %s", e) with a logger.exception call (e.g.,
logger.exception("Could not compute parent manifest hash")) so the traceback for
the exception thrown by compute_manifest_hash is captured when parent_hash
computation fails and the exception is re-raised.
- Around line 84-88: The manifest JSON error handling currently raises a new
ValueError without chaining the original JSONDecodeError; modify the except
block where json.load(f) is caught (the manifest_data loading logic using
manifest_path) to re-raise the ValueError using exception chaining (i.e., raise
ValueError(...) from e) so the original JSONDecodeError is preserved in the
traceback.
- Around line 238-247: The except block around reading manifest_path and
json.load currently catches all Exceptions; narrow it to the expected errors
(e.g., FileNotFoundError, PermissionError, OSError for file I/O and
json.JSONDecodeError for parse errors) so unexpected bugs aren't masked; update
the try/except to catch those specific exceptions and return the same failure
payload (using the caught exception message) while letting other exceptions
propagate (or re-raise) so they can be surfaced and debugged.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 061af4f1-31d5-4b1d-a3f9-ed2aeedd38e7
📒 Files selected for processing (2)
openverifiablellm/manifest_chain.pyopenverifiablellm/verify.py
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (3)
openverifiablellm/manifest_chain.py (3)
174-175:⚠️ Potential issue | 🟡 MinorChain the original exception to preserve traceback context.
Same issue as in
compute_manifest_hash— chain from the originalJSONDecodeError.Suggested fix
except json.JSONDecodeError as e: - raise ValueError(f"Malformed current manifest JSON: {e}") + raise ValueError(f"Malformed current manifest JSON: {e}") from e🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@openverifiablellm/manifest_chain.py` around lines 174 - 175, The except block catching json.JSONDecodeError should preserve the original traceback by chaining the exception; update the handler that currently does "except json.JSONDecodeError as e: raise ValueError(f'...{e}')" to "raise ValueError(f'...{e}') from e" so the original JSONDecodeError is attached (same pattern as fix for compute_manifest_hash).
238-247: 🛠️ Refactor suggestion | 🟠 MajorNarrow exception type to avoid masking unexpected bugs.
Catching bare
Exceptioncan hide programming errors. Limit to expected failure modes for file I/O and JSON parsing.Suggested fix
try: with manifest_path.open("r", encoding="utf-8") as f: manifest_data = json.load(f) - except Exception as e: + except (OSError, json.JSONDecodeError) as e: return { "chain_valid": False, "chain_message": f"Failed to read manifest: {e}",🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@openverifiablellm/manifest_chain.py` around lines 238 - 247, The try/except around reading/parsing the manifest currently catches all Exceptions; narrow it to expected errors by catching file I/O and JSON parse errors (e.g., FileNotFoundError, PermissionError, OSError and json.JSONDecodeError) when opening manifest_path and loading manifest_data, and return the same error dict including the caught exception info; leave other exceptions to propagate so real bugs aren't masked.
87-88:⚠️ Potential issue | 🟡 MinorChain the original exception to preserve traceback context.
When re-raising as
ValueError, chain from the originalJSONDecodeErrorso the full traceback is preserved for debugging.Suggested fix
except json.JSONDecodeError as e: - raise ValueError(f"Malformed manifest JSON: {e}") + raise ValueError(f"Malformed manifest JSON: {e}") from e🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@openverifiablellm/manifest_chain.py` around lines 87 - 88, In the except block that catches json.JSONDecodeError as e (the manifest parsing code in manifest_chain.py), re-raise the ValueError while chaining the original exception by using "raise ValueError(f\"Malformed manifest JSON: {e}\") from e" so the original traceback is preserved; keep the existing error message and variable names (json.JSONDecodeError, e, ValueError) intact when applying this change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@openverifiablellm/manifest_chain.py`:
- Line 29: Replace deprecated typing.Dict usage with built-in dict: remove Dict
from the import list in manifest_chain.py and change all type hints like
Dict[str, Any] to dict[str, Any] (e.g., in signatures/annotations for functions
or methods such as any occurrences around ManifestChain-related
functions/classes and variables that currently use Dict). Keep
typing.Any/Optional/Union as needed and ensure imports reflect only used typing
symbols.
- Around line 127-129: In the except block that catches Exception in the
manifest hashing logic (the block using "except Exception as e" and calling
logger.exception with the message "Could not compute parent manifest hash: %s"),
remove the redundant exception object from the logger.exception call and stop
passing e as a formatting argument; instead call logger.exception with a plain
descriptive message (e.g., "Could not compute parent manifest hash") so the
traceback is logged automatically, then re-raise as currently done.
---
Duplicate comments:
In `@openverifiablellm/manifest_chain.py`:
- Around line 174-175: The except block catching json.JSONDecodeError should
preserve the original traceback by chaining the exception; update the handler
that currently does "except json.JSONDecodeError as e: raise
ValueError(f'...{e}')" to "raise ValueError(f'...{e}') from e" so the original
JSONDecodeError is attached (same pattern as fix for compute_manifest_hash).
- Around line 238-247: The try/except around reading/parsing the manifest
currently catches all Exceptions; narrow it to expected errors by catching file
I/O and JSON parse errors (e.g., FileNotFoundError, PermissionError, OSError and
json.JSONDecodeError) when opening manifest_path and loading manifest_data, and
return the same error dict including the caught exception info; leave other
exceptions to propagate so real bugs aren't masked.
- Around line 87-88: In the except block that catches json.JSONDecodeError as e
(the manifest parsing code in manifest_chain.py), re-raise the ValueError while
chaining the original exception by using "raise ValueError(f\"Malformed manifest
JSON: {e}\") from e" so the original traceback is preserved; keep the existing
error message and variable names (json.JSONDecodeError, e, ValueError) intact
when applying this change.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 5542dc06-9955-4c82-a284-bb003f5e02cb
📒 Files selected for processing (1)
openverifiablellm/manifest_chain.py
|
@aniket866 Pull from main, fix the conflicts and check again. We are using |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (4)
openverifiablellm/utils.py (3)
1-13:⚠️ Potential issue | 🟡 MinorSort this import block to unblock CI.
Ruff is already failing on this file’s import ordering, so this needs to be normalized before merge.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@openverifiablellm/utils.py` around lines 1 - 13, The import block in utils.py is not sorted per the project's style (Ruff/isort), causing CI failures; reorder the imports into standard library first (bz2, hashlib, json, logging, platform, re, sys, pathlib.Path), then third-party (defusedxml.ElementTree as ET), then local application imports (from openverifiablellm.environment import generate_environment_fingerprint and from openverifiablellm.manifest_chain import get_parent_manifest_hash), and remove any duplicate/unused imports; you can run isort or manually adjust the order and ensure typing imports (Union, Optional, Dict, Any, List, Tuple) are grouped with standard library imports.
401-405:⚠️ Potential issue | 🟠 MajorRemove the second
extract_text_from_xml()call.The module entrypoint preprocesses the dump twice. That doubles the runtime, and the second pass rewrites
wiki_clean.txtafter the manifest has already been recorded, so a later failure can leave the manifest describing different on-disk state than the final output.Suggested fix
extract_text_from_xml( sys.argv[1], write_manifest="--no-manifest" not in sys.argv[2:], ) - extract_text_from_xml(sys.argv[1])🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@openverifiablellm/utils.py` around lines 401 - 405, The file calls extract_text_from_xml twice (once with write_manifest set based on sys.argv and immediately again without arguments), causing duplicated work and a manifest mismatch; remove the second extract_text_from_xml(sys.argv[1]) invocation so extract_text_from_xml is only called once with the intended write_manifest argument (leave the existing extract_text_from_xml(sys.argv[1], write_manifest="--no-manifest" not in sys.argv[2:]) call intact).
267-269:⚠️ Potential issue | 🟠 MajorPreserve the old
chunk_sizedefault inexport_merkle_proof().This change turns a previously optional parameter into a required one, which is a backwards-incompatible runtime break for existing callers that relied on the default chunk size. Keep
chunk_sizedefaulted and make the newoutput_pathparameter keyword-only instead.Suggested fix
def export_merkle_proof( - proof: List[Tuple[str, bool]], chunk_index: int, chunk_size: int, output_path: Union[str, Path] + proof: List[Tuple[str, bool]], + chunk_index: int, + chunk_size: int = MERKLE_CHUNK_SIZE_BYTES, + *, + output_path: Union[str, Path], ) -> None:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@openverifiablellm/utils.py` around lines 267 - 269, Restore the previous default for chunk_size in export_merkle_proof by making chunk_size optional again (e.g. chunk_size: int = DEFAULT_CHUNK_SIZE or = 1024 if no constant exists) and make output_path a keyword-only parameter by adding a '*' before it (signature: def export_merkle_proof(..., chunk_index: int, chunk_size: int = DEFAULT_CHUNK_SIZE, *, output_path: Union[str, Path]) -> None). Ensure DEFAULT_CHUNK_SIZE is defined or use the original numeric default and update any docstring/tests if they reference the former behavior.openverifiablellm/verify.py (1)
20-39:⚠️ Potential issue | 🟡 MinorNormalize the import block here as well.
CI is failing on Ruff import ordering for this file, so the new imports need to be re-sorted before this can merge.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@openverifiablellm/verify.py` around lines 20 - 39, Reorder the imports to satisfy Ruff: group standard-library imports first (sorted alphabetically), then third-party (none), then local package imports (sorted). In particular, sort the stdlib imports so dataclasses and enum come before json/logging/os, with pathlib and typing placed alphabetically (e.g., dataclasses, enum, json, logging, os, pathlib, platform, shutil, subprocess, sys, tempfile, typing). Then place local imports from openverifiablellm in alphabetical order by module and alphabetize imported names — e.g., import utils first, then from openverifiablellm.environment import generate_environment_fingerprint, and from openverifiablellm.manifest_chain import verify_manifest_chain, verify_manifest_chain_link (members sorted alphabetically).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@openverifiablellm/utils.py`:
- Around line 1-13: The import block in utils.py is not sorted per the project's
style (Ruff/isort), causing CI failures; reorder the imports into standard
library first (bz2, hashlib, json, logging, platform, re, sys, pathlib.Path),
then third-party (defusedxml.ElementTree as ET), then local application imports
(from openverifiablellm.environment import generate_environment_fingerprint and
from openverifiablellm.manifest_chain import get_parent_manifest_hash), and
remove any duplicate/unused imports; you can run isort or manually adjust the
order and ensure typing imports (Union, Optional, Dict, Any, List, Tuple) are
grouped with standard library imports.
- Around line 401-405: The file calls extract_text_from_xml twice (once with
write_manifest set based on sys.argv and immediately again without arguments),
causing duplicated work and a manifest mismatch; remove the second
extract_text_from_xml(sys.argv[1]) invocation so extract_text_from_xml is only
called once with the intended write_manifest argument (leave the existing
extract_text_from_xml(sys.argv[1], write_manifest="--no-manifest" not in
sys.argv[2:]) call intact).
- Around line 267-269: Restore the previous default for chunk_size in
export_merkle_proof by making chunk_size optional again (e.g. chunk_size: int =
DEFAULT_CHUNK_SIZE or = 1024 if no constant exists) and make output_path a
keyword-only parameter by adding a '*' before it (signature: def
export_merkle_proof(..., chunk_index: int, chunk_size: int = DEFAULT_CHUNK_SIZE,
*, output_path: Union[str, Path]) -> None). Ensure DEFAULT_CHUNK_SIZE is defined
or use the original numeric default and update any docstring/tests if they
reference the former behavior.
In `@openverifiablellm/verify.py`:
- Around line 20-39: Reorder the imports to satisfy Ruff: group standard-library
imports first (sorted alphabetically), then third-party (none), then local
package imports (sorted). In particular, sort the stdlib imports so dataclasses
and enum come before json/logging/os, with pathlib and typing placed
alphabetically (e.g., dataclasses, enum, json, logging, os, pathlib, platform,
shutil, subprocess, sys, tempfile, typing). Then place local imports from
openverifiablellm in alphabetical order by module and alphabetize imported names
— e.g., import utils first, then from openverifiablellm.environment import
generate_environment_fingerprint, and from openverifiablellm.manifest_chain
import verify_manifest_chain, verify_manifest_chain_link (members sorted
alphabetically).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: b148a511-c1d7-46b7-bff5-43ad23dc2fc3
📒 Files selected for processing (2)
openverifiablellm/utils.pyopenverifiablellm/verify.py
|
Hi @Archit381 I have resolved the conflicts and also have updated PR descrition instructions with UV support |
Addressed Issues:
Closes #35
Step 1: Download dump
Windows / Linux / Mac
Step 2: First preprocessing
Windows / Linux / Mac
Step 3: Backup manifest
Windows (CMD / PowerShell)
Linux / Mac
Step 4: Second preprocessing (creates chain)
Windows / Linux / Mac
Step 5: Check manifest has parent hash
Windows / Linux / Mac
uv run python -c "import json; m = json.load(open('data/dataset_manifest.json')); print('parent_manifest_hash:', m.get('parent_manifest_hash')[:16] + '...')"Step 6: Verify without chain (should PASS)
Windows / Linux / Mac
Step 7: Verify WITH chain (should PASS - chain is valid)
Windows
Linux / Mac
uv run python -m openverifiablellm.verify simplewiki-20260201-pages-articles.xml.bz2 \\ --previous-manifest data/manifest_v1.jsonStep 8: Tamper the manifest
Windows / Linux / Mac
uv run python -c "import json; m = json.load(open('data/manifest_v1.json')); m['tampered'] = True; json.dump(m, open('data/manifest_v1.json', 'w'), indent=2); print(' Tampered')"Step 9: Verify with tampered manifest (should FAIL - tampering detected!)
Windows
Linux / Mac
uv run python -m openverifiablellm.verify simplewiki-20260201-pages-articles.xml.bz2 \\ --previous-manifest data/manifest_v1.jsonStep 10: Restore manifest
Windows / Linux / Mac
uv run python -c "import json; m = json.load(open('data/manifest_v1.json')); del m['tampered']; json.dump(m, open('data/manifest_v1.json', 'w'), indent=2); print(' Restored')"Step 11: Verify restored (should PASS again)
Windows
Linux / Mac
uv run python -m openverifiablellm.verify simplewiki-20260201-pages-articles.xml.bz2 \\ --previous-manifest data/manifest_v1.jsonIf all steps above succeed, the feature has been successfully validated.
Screenshots/Recordings:
Additional Notes:
Before Tampering verifcation passes
After Tampering Verification Fails:
Purpose
This PR introduces cryptographically linked dataset manifests to detect tampering in preprocessing outputs.
Each new manifest stores the SHA256 hash of the previous manifest using the field:
If any earlier manifest is modified, the stored hash will no longer match and verification will fail.
Checklist
We encourage contributors to use AI tools responsibly when creating Pull Requests. While AI can be a valuable aid, it is essential to ensure that your contributions meet the task requirements, build successfully, include relevant tests, and pass all linters. Submissions that do not meet these standards may be closed without warning to maintain the quality and integrity of the project. Please take the time to understand the changes you are proposing and their impact.
Summary by CodeRabbit
New Features
Tests