Add experiments for preprocessing determinism and Merkle verification#66
Add experiments for preprocessing determinism and Merkle verification#66tani-dubey wants to merge 3 commits intoAOSSIE-Org:mainfrom
Conversation
WalkthroughAdds an experiments/ directory with README and three CLI experiment scripts to validate preprocessing determinism, Merkle root tamper detection, and a tampered Wikipedia sample generator; scripts read/produce dataset manifests and compare hashes/roots for verification. Changes
Sequence Diagram(s)sequenceDiagram
participant User as User (CLI)
participant Preproc as Preprocessor (extract_text_from_xml)
participant FS as Filesystem / Manifest (data/dataset_manifest.json)
participant Comparator as Comparator
User->>Preproc: provide dataset path
Preproc->>FS: write/read processed artifacts + manifest
FS-->>Preproc: return manifest with raw_merkle_root, processed_merkle_root
Preproc->>User: return manifest fields
User->>Comparator: provide two manifest results
Comparator->>Comparator: compare raw_merkle_root and processed_merkle_root
Comparator-->>User: report "tampering detected" or "identical"
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 10
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@experiments/data_subset/sample_wiki_generate.py`:
- Line 3: Fix the minor typo in the inline comment in sample_wiki_generate.py:
replace the current comment "# To make this tampered I deleted e of online" with
a clearer phrasing such as "# To make this tampered I deleted the 'e' in
'online'." to correct grammar and clarify intent; update the comment near the
tampering logic or the top of the file where this sentence appears.
- Around line 19-20: The file currently performs file I/O (the with
bz2.open(...) block that writes xml_content to
"experiments/data_subset/tampered_sample_wiki.xml.bz2") at import time; move
that write block into an if __name__ == "__main__": guard so the write only runs
when the script is executed directly, ensuring xml_content is still defined or
constructed above the guard (or by a helper function you call from the guard)
and preventing side effects on import.
In `@experiments/merkle_verification/test_merkle.py`:
- Line 22: Update the comment in experiments/merkle_verification/test_merkle.py
that currently reads "#read genertaed manifest" to correct the typo to "#read
generated manifest" so the inline comment in the test function (around the
manifest-read logic) accurately reflects the operation.
- Around line 1-5: The import block is unsorted causing CI failure; reorder the
imports in experiments/merkle_verification/test_merkle.py so standard library
imports are alphabetized (e.g., argparse, json, logging, pathlib.Path) followed
by third-party/local imports like openverifiablellm.utils and its
extract_text_from_xml, or simply run the linter auto-fix (ruff --fix) to apply
the correct sort automatically.
In `@experiments/preprocessing_determinism/test_preprocessing.py`:
- Around line 38-41: The call to logging.basicConfig has inconsistent
indentation for its arguments; align the keyword arguments (level=logging.INFO,
format="%(levelname)s - %(message)s") vertically under the function call so
indentation is consistent with surrounding code (e.g., indent the two argument
lines to match the opening parenthesis of logging.basicConfig). Locate the
logging.basicConfig call in test_preprocessing.py and adjust the argument
indentation so the block is consistently formatted.
- Line 21: Fix the typo in the inline comment "#read genertaed manifest" by
changing it to "# read generated manifest" (or similar correct phrasing) so the
comment referencing the manifest read is clear; locate the comment in
experiments/preprocessing_determinism/test_preprocessing.py (look for the exact
string "read genertaed manifest") and update it accordingly.
- Line 10: Update the descriptive comment in
experiments/preprocessing_determinism/test_preprocessing.py by fixing the typo:
replace "compairing" with "comparing" in the sentence "Experiment to test
Deterministic preprocessing, by compairing generated hash on 2 runs." so it
reads "Experiment to test Deterministic preprocessing, by comparing generated
hash on 2 runs." Ensure you edit the same comment or docstring where that phrase
appears.
- Around line 1-5: The import block in test_preprocessing.py is unsorted;
reorder the imports to follow canonical groups and alphabetical order (stdlib
first: import json, import logging, import sys; then third-party/other: from
pathlib import Path; then local: from openverifiablellm.utils import
extract_text_from_xml), or simply run the linter auto-fix (ruff --fix) to apply
the correct sorting; update the import statement order in the module (look for
the top-of-file imports: sys, json, logging, Path, extract_text_from_xml)
accordingly.
In `@experiments/README.md`:
- Line 1: Change the first line heading from a level-2 to a level-1 heading by
replacing "## Experiments" with "# Experiments" so the README starts with a
top-level h1; update the header text at the top of the file (the "##
Experiments" line) to "# Experiments".
- Around line 69-78: Convert the plain text items into a proper Markdown list by
prefixing each entry with a consistent list marker (e.g., "- "): update
"sample_wiki.xml.bz2 (original)", "tampered_sample_wiki.xml.bz2 (modified)", and
the two comparison keys "raw_merkle_root" and "processed_merkle_root" so they
appear as list items under the sentence "The experiment compares:". Ensure the
same marker is used for all four lines to match the document's Markdown style.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 34e326dd-7c10-40cb-a918-21f1ebd91fd8
📒 Files selected for processing (4)
experiments/README.mdexperiments/data_subset/sample_wiki_generate.pyexperiments/merkle_verification/test_merkle.pyexperiments/preprocessing_determinism/test_preprocessing.py
| @@ -0,0 +1,20 @@ | |||
| import bz2 | |||
|
|
|||
| # To make this tampered I deleted e of online | |||
There was a problem hiding this comment.
Minor typo in comment.
📝 Proposed fix
-# To make this tampered I deleted e of online
+# To make this tampered, I deleted 'e' from "online"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # To make this tampered I deleted e of online | |
| # To make this tampered, I deleted 'e' from "online" |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@experiments/data_subset/sample_wiki_generate.py` at line 3, Fix the minor
typo in the inline comment in sample_wiki_generate.py: replace the current
comment "# To make this tampered I deleted e of online" with a clearer phrasing
such as "# To make this tampered I deleted the 'e' in 'online'." to correct
grammar and clarify intent; update the comment near the tampering logic or the
top of the file where this sentence appears.
| with bz2.open("experiments/data_subset/tampered_sample_wiki.xml.bz2", "wt", encoding="utf-8") as f: | ||
| f.write(xml_content) No newline at end of file |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Wrap file-writing logic in an if __name__ == "__main__" guard.
The script executes file I/O at module import time. This can cause unintended side effects if the module is imported elsewhere. Adding a main guard is a best practice for executable scripts.
♻️ Proposed fix
-with bz2.open("experiments/data_subset/tampered_sample_wiki.xml.bz2", "wt", encoding="utf-8") as f:
- f.write(xml_content)
+if __name__ == "__main__":
+ with bz2.open("experiments/data_subset/tampered_sample_wiki.xml.bz2", "wt", encoding="utf-8") as f:
+ f.write(xml_content)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| with bz2.open("experiments/data_subset/tampered_sample_wiki.xml.bz2", "wt", encoding="utf-8") as f: | |
| f.write(xml_content) | |
| if __name__ == "__main__": | |
| with bz2.open("experiments/data_subset/tampered_sample_wiki.xml.bz2", "wt", encoding="utf-8") as f: | |
| f.write(xml_content) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@experiments/data_subset/sample_wiki_generate.py` around lines 19 - 20, The
file currently performs file I/O (the with bz2.open(...) block that writes
xml_content to "experiments/data_subset/tampered_sample_wiki.xml.bz2") at import
time; move that write block into an if __name__ == "__main__": guard so the
write only runs when the script is executed directly, ensuring xml_content is
still defined or constructed above the guard (or by a helper function you call
from the guard) and preventing side effects on import.
| """Run preprocessing and return processed Merkle root.""" | ||
| extract_text_from_xml(path1) | ||
|
|
||
| #read genertaed manifest |
There was a problem hiding this comment.
Fix typo: "genertaed" → "generated".
📝 Proposed fix
- `#read` genertaed manifest
+ # Read generated manifest📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| #read genertaed manifest | |
| # Read generated manifest |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@experiments/merkle_verification/test_merkle.py` at line 22, Update the
comment in experiments/merkle_verification/test_merkle.py that currently reads
"#read genertaed manifest" to correct the typo to "#read generated manifest" so
the inline comment in the test function (around the manifest-read logic)
accurately reflects the operation.
| logger = logging.getLogger(__name__) | ||
|
|
||
| """ | ||
| Experiment to test Deterministic preprocessing, by compairing generated hash on 2 runs. |
There was a problem hiding this comment.
Fix typo: "compairing" → "comparing".
📝 Proposed fix
-Experiment to test Deterministic preprocessing, by compairing generated hash on 2 runs.
+Experiment to test deterministic preprocessing by comparing generated hash on 2 runs.📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| Experiment to test Deterministic preprocessing, by compairing generated hash on 2 runs. | |
| Experiment to test deterministic preprocessing by comparing generated hash on 2 runs. |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@experiments/preprocessing_determinism/test_preprocessing.py` at line 10,
Update the descriptive comment in
experiments/preprocessing_determinism/test_preprocessing.py by fixing the typo:
replace "compairing" with "comparing" in the sentence "Experiment to test
Deterministic preprocessing, by compairing generated hash on 2 runs." so it
reads "Experiment to test Deterministic preprocessing, by comparing generated
hash on 2 runs." Ensure you edit the same comment or docstring where that phrase
appears.
| # Run preprocessing | ||
| extract_text_from_xml(input_path) | ||
|
|
||
| #read genertaed manifest |
There was a problem hiding this comment.
Fix typo: "genertaed" → "generated".
📝 Proposed fix
- `#read` genertaed manifest
+ # Read generated manifest📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| #read genertaed manifest | |
| # Read generated manifest |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@experiments/preprocessing_determinism/test_preprocessing.py` at line 21, Fix
the typo in the inline comment "#read genertaed manifest" by changing it to "#
read generated manifest" (or similar correct phrasing) so the comment
referencing the manifest read is clear; locate the comment in
experiments/preprocessing_determinism/test_preprocessing.py (look for the exact
string "read genertaed manifest") and update it accordingly.
| logging.basicConfig( | ||
| level=logging.INFO, | ||
| format="%(levelname)s - %(message)s" | ||
| ) |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Fix inconsistent indentation.
The arguments to logging.basicConfig have inconsistent indentation relative to the function call.
📝 Proposed fix
logging.basicConfig(
- level=logging.INFO,
- format="%(levelname)s - %(message)s"
+ level=logging.INFO,
+ format="%(levelname)s - %(message)s"
)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| logging.basicConfig( | |
| level=logging.INFO, | |
| format="%(levelname)s - %(message)s" | |
| ) | |
| logging.basicConfig( | |
| level=logging.INFO, | |
| format="%(levelname)s - %(message)s" | |
| ) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@experiments/preprocessing_determinism/test_preprocessing.py` around lines 38
- 41, The call to logging.basicConfig has inconsistent indentation for its
arguments; align the keyword arguments (level=logging.INFO,
format="%(levelname)s - %(message)s") vertically under the function call so
indentation is consistent with surrounding code (e.g., indent the two argument
lines to match the opening parenthesis of logging.basicConfig). Locate the
logging.basicConfig call in test_preprocessing.py and adjust the argument
indentation so the block is consistently formatted.
| @@ -0,0 +1,100 @@ | |||
| ## Experiments | |||
There was a problem hiding this comment.
Use a top-level heading (h1) as the first line.
The static analysis tool flagged that the first line should be a top-level heading. Change ## Experiments to # Experiments.
📝 Proposed fix
-## Experiments
+# Experiments📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| ## Experiments | |
| # Experiments |
🧰 Tools
🪛 markdownlint-cli2 (0.21.0)
[warning] 1-1: First line in a file should be a top-level heading
(MD041, first-line-heading, first-line-h1)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@experiments/README.md` at line 1, Change the first line heading from a
level-2 to a level-1 heading by replacing "## Experiments" with "# Experiments"
so the README starts with a top-level h1; update the header text at the top of
the file (the "## Experiments" line) to "# Experiments".
| sample_wiki.xml.bz2 (original) | ||
|
|
||
| tampered_sample_wiki.xml.bz2 (modified) | ||
|
|
||
| The experiment compares: | ||
|
|
||
| raw_merkle_root | ||
|
|
||
| processed_merkle_root | ||
|
|
There was a problem hiding this comment.
Format list items consistently using Markdown syntax.
Lines 69-78 are missing Markdown list formatting. They should use - or * prefixes for consistency with the rest of the document.
📝 Proposed fix
Two datasets are used:
-sample_wiki.xml.bz2 (original)
+- `sample_wiki.xml.bz2` (original)
-tampered_sample_wiki.xml.bz2 (modified)
+- `tampered_sample_wiki.xml.bz2` (modified)
The experiment compares:
-raw_merkle_root
+- `raw_merkle_root`
-processed_merkle_root
+- `processed_merkle_root`📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| sample_wiki.xml.bz2 (original) | |
| tampered_sample_wiki.xml.bz2 (modified) | |
| The experiment compares: | |
| raw_merkle_root | |
| processed_merkle_root | |
| Two datasets are used: | |
| - `sample_wiki.xml.bz2` (original) | |
| - `tampered_sample_wiki.xml.bz2` (modified) | |
| The experiment compares: | |
| - `raw_merkle_root` | |
| - `processed_merkle_root` |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@experiments/README.md` around lines 69 - 78, Convert the plain text items
into a proper Markdown list by prefixing each entry with a consistent list
marker (e.g., "- "): update "sample_wiki.xml.bz2 (original)",
"tampered_sample_wiki.xml.bz2 (modified)", and the two comparison keys
"raw_merkle_root" and "processed_merkle_root" so they appear as list items under
the sentence "The experiment compares:". Ensure the same marker is used for all
four lines to match the document's Markdown style.
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@experiments/merkle_verification/test_merkle.py`:
- Around line 34-46: The code has inconsistent spacing around assignment and
keyword-argument equals (e.g., "parser= argparse.ArgumentParser", "args=
parser.parse_args", and the dict-like kwargs in logging.basicConfig). Update
these lines so spacing is consistent: use a single space before and after
assignment operators (parser = argparse.ArgumentParser, args =
parser.parse_args) and standard spacing for keyword arguments in function calls
(description="...", level=logging.INFO, format="..."), touching the symbols
parser, argparse.ArgumentParser, parser.add_argument, args, parser.parse_args,
and logging.basicConfig to locate and fix them.
- Around line 19-30: The test fails because run() calls
extract_text_from_xml(path1) which defaults to write_manifest=False so no
manifest is written; update run in test_merkle.py to call
extract_text_from_xml(path1, write_manifest=True) (or otherwise ensure
extract_text_from_xml writes the manifest) so that the subsequent read of
MANIFEST_PATH and access to manifest["raw_merkle_root"] /
manifest["processed_merkle_root"] operates on a freshly written manifest.
In `@experiments/preprocessing_determinism/test_preprocessing.py`:
- Around line 44-45: The two assignment statements use no spaces around the "=";
update the assignments to follow PEP8 by adding spaces around the operator for
both variables (result1 and result2) when calling run(sys.argv[1]). Locate the
two lines that invoke run and assign to result1 and result2 and insert spaces
around the "=" so the assignments read with standard spacing.
- Around line 19-30: The test calls extract_text_from_xml(input_path) but that
function defaults to write_manifest=False so no manifest is produced; change the
call to pass write_manifest=True (i.e., extract_text_from_xml(input_path,
write_manifest=True)) so MANIFEST_PATH is created/updated before opening it,
ensuring the subsequent json.load uses the manifest just written (keep existing
MANIFEST_PATH usage and existing return keys).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: f71a6cfd-5251-4ae5-821d-a3c91d6cc7d3
📒 Files selected for processing (2)
experiments/merkle_verification/test_merkle.pyexperiments/preprocessing_determinism/test_preprocessing.py
| parser= argparse.ArgumentParser( | ||
| description= "Test tamper detection using Merkle root" | ||
| ) | ||
|
|
||
| parser.add_argument("--path1",required=True,help="Original dataset") | ||
| parser.add_argument("--path2",required=True,help="Tampered dataset") | ||
|
|
||
| args= parser.parse_args() | ||
|
|
||
| logging.basicConfig( | ||
| level= logging.INFO, | ||
| format="%(levelname)s - %(message)s" | ||
| ) |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Inconsistent spacing around = operators.
Several lines have inconsistent spacing around = which reduces readability.
✏️ Proposed fix
- parser= argparse.ArgumentParser(
- description= "Test tamper detection using Merkle root"
+ parser = argparse.ArgumentParser(
+ description="Test tamper detection using Merkle root"
)
- parser.add_argument("--path1",required=True,help="Original dataset")
- parser.add_argument("--path2",required=True,help="Tampered dataset")
+ parser.add_argument("--path1", required=True, help="Original dataset")
+ parser.add_argument("--path2", required=True, help="Tampered dataset")
- args= parser.parse_args()
+ args = parser.parse_args()
logging.basicConfig(
- level= logging.INFO,
+ level=logging.INFO,
format="%(levelname)s - %(message)s"
)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| parser= argparse.ArgumentParser( | |
| description= "Test tamper detection using Merkle root" | |
| ) | |
| parser.add_argument("--path1",required=True,help="Original dataset") | |
| parser.add_argument("--path2",required=True,help="Tampered dataset") | |
| args= parser.parse_args() | |
| logging.basicConfig( | |
| level= logging.INFO, | |
| format="%(levelname)s - %(message)s" | |
| ) | |
| parser = argparse.ArgumentParser( | |
| description="Test tamper detection using Merkle root" | |
| ) | |
| parser.add_argument("--path1", required=True, help="Original dataset") | |
| parser.add_argument("--path2", required=True, help="Tampered dataset") | |
| args = parser.parse_args() | |
| logging.basicConfig( | |
| level=logging.INFO, | |
| format="%(levelname)s - %(message)s" | |
| ) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@experiments/merkle_verification/test_merkle.py` around lines 34 - 46, The
code has inconsistent spacing around assignment and keyword-argument equals
(e.g., "parser= argparse.ArgumentParser", "args= parser.parse_args", and the
dict-like kwargs in logging.basicConfig). Update these lines so spacing is
consistent: use a single space before and after assignment operators (parser =
argparse.ArgumentParser, args = parser.parse_args) and standard spacing for
keyword arguments in function calls (description="...", level=logging.INFO,
format="..."), touching the symbols parser, argparse.ArgumentParser,
parser.add_argument, args, parser.parse_args, and logging.basicConfig to locate
and fix them.
| # Run preprocessing | ||
| extract_text_from_xml(input_path) | ||
|
|
||
| #read genertaed manifest | ||
| with MANIFEST_PATH.open() as f: | ||
| manifest = json.load(f) | ||
|
|
||
| return { | ||
| "processed_sha256": manifest["processed_sha256"], | ||
| "processed_merkle_root": manifest["processed_merkle_root"], | ||
| "environment_hash": manifest["environment_hash"], | ||
| } |
There was a problem hiding this comment.
Critical: Manifest is never written—experiment will fail or use stale data.
extract_text_from_xml defaults to write_manifest=False (see openverifiablellm/utils.py:159). Since no write_manifest=True is passed, the manifest file is never created/updated, and the subsequent read on line 23-24 will either raise FileNotFoundError or read stale data from a previous unrelated run.
🐛 Proposed fix
def run(input_path):
# Run preprocessing
- extract_text_from_xml(input_path)
+ extract_text_from_xml(input_path, write_manifest=True)
- `#read` genertaed manifest
+ # Read generated manifest
with MANIFEST_PATH.open() as f:
manifest = json.load(f)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Run preprocessing | |
| extract_text_from_xml(input_path) | |
| #read genertaed manifest | |
| with MANIFEST_PATH.open() as f: | |
| manifest = json.load(f) | |
| return { | |
| "processed_sha256": manifest["processed_sha256"], | |
| "processed_merkle_root": manifest["processed_merkle_root"], | |
| "environment_hash": manifest["environment_hash"], | |
| } | |
| # Run preprocessing | |
| extract_text_from_xml(input_path, write_manifest=True) | |
| # Read generated manifest | |
| with MANIFEST_PATH.open() as f: | |
| manifest = json.load(f) | |
| return { | |
| "processed_sha256": manifest["processed_sha256"], | |
| "processed_merkle_root": manifest["processed_merkle_root"], | |
| "environment_hash": manifest["environment_hash"], | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@experiments/preprocessing_determinism/test_preprocessing.py` around lines 19
- 30, The test calls extract_text_from_xml(input_path) but that function
defaults to write_manifest=False so no manifest is produced; change the call to
pass write_manifest=True (i.e., extract_text_from_xml(input_path,
write_manifest=True)) so MANIFEST_PATH is created/updated before opening it,
ensuring the subsequent json.load uses the manifest just written (keep existing
MANIFEST_PATH usage and existing return keys).
| result1= run(sys.argv[1]) | ||
| result2= run(sys.argv[1]) |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Add spaces around assignment operators for consistency.
Minor style: PEP 8 recommends spaces around = in assignments.
✏️ Proposed fix
- result1= run(sys.argv[1])
- result2= run(sys.argv[1])
+ result1 = run(sys.argv[1])
+ result2 = run(sys.argv[1])📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| result1= run(sys.argv[1]) | |
| result2= run(sys.argv[1]) | |
| result1 = run(sys.argv[1]) | |
| result2 = run(sys.argv[1]) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@experiments/preprocessing_determinism/test_preprocessing.py` around lines 44
- 45, The two assignment statements use no spaces around the "="; update the
assignments to follow PEP8 by adding spaces around the operator for both
variables (result1 and result2) when calling run(sys.argv[1]). Locate the two
lines that invoke run and assign to result1 and result2 and insert spaces around
the "=" so the assignments read with standard spacing.
Addressed Issues:
Fixes #37
This PR adds
experiments/directory containing small reproducible experiments to validate assumptions about the OpenVerifiableLLM pipeline.It has experiments :
preprocessing determinism experiment
Verifies that running the preprocessing pipeline multiple times on the same dataset produces identical artifacts (processed_sha256, processed_merkle_root, environment_hash).
Merkle root tamper detection experiment
Demonstrates how dataset modifications can be detected using Merkle roots.
small dataset subset for testing
Provides a lightweight Wikipedia XML example to allow experiments to run quickly without requiring large dumps.
experiments README with usage instructions
Additional Notes:
While implementing the tamper detection experiment, I observed that the processed merkle_root alone is unable to show tamper for small change ( like removing or adding a single alphabet from sample.bz2 goes undetected).
To address this, the experiment compares both:
raw_merkle_rootprocessed_merkle_rootChecklist
We encourage contributors to use AI tools responsibly when creating Pull Requests. While AI can be a valuable aid, it is essential to ensure that your contributions meet the task requirements, build successfully, include relevant tests, and pass all linters. Submissions that do not meet these standards may be closed without warning to maintain the quality and integrity of the project. Please take the time to understand the changes you are proposing and their impact.
Summary by CodeRabbit
Documentation
New Features