[Feature] : Add Merkle proof export and ZKP-compatible dataset verification and Template for ZKPoT and ZKPoI #44
Conversation
WalkthroughRefactored Merkle proof generation to output ZKP-compatible JSON with structured public_inputs and witness fields. Added automatic proof generation during text extraction via new Changes
Sequence DiagramsequenceDiagram
participant Input as Data Input
participant Extract as Text Extractor
participant Tree as Merkle Tree Builder
participant Export as Proof Exporter
participant Verify as Proof Verifier
Input->>Extract: XML content
Extract->>Extract: Split into chunks
Extract->>Tree: All chunks
Tree->>Tree: Build tree level-by-level
Tree->>Export: Proof data + merkle_root
Export->>Export: Structure JSON<br/>(public_inputs + witness)
Export->>Verify: Structured proof file
Verify->>Verify: Validate public_inputs<br/>and witness fields
Verify->>Verify: Check merkle_root
Verify-->>Input: Verification result
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested labels
Poem
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs). Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tests/test_util.py (1)
247-279: 🧹 Nitpick | 🔵 TrivialSuggest adding direct tests for
export_all_merkle_proofs.The test file covers
export_merkle_proofwell but lacks direct tests for the newexport_all_merkle_proofsfunction. Consider adding tests for:
- Multiple chunks exported correctly
- Single chunk (edge case)
- Empty file (returns 0)
- Proof files are readable and verifiable
💡 Example test structure
def test_export_all_merkle_proofs(tmp_path): file = tmp_path / "data.txt" content = b"hello world this is merkle proof test" file.write_bytes(content) output_dir = tmp_path / "proofs" num_proofs = utils.export_all_merkle_proofs(file, output_dir, chunk_size=8) assert num_proofs == 5 # 37 bytes / 8 bytes per chunk = 5 chunks assert len(list(output_dir.glob("*.json"))) == 5 # Verify at least one proof is valid root = utils.compute_merkle_root(file, chunk_size=8) with file.open("rb") as f: chunk = f.read(8) proof_file = output_dir / "data.txt_chunk_0_proof.json" assert utils.verify_merkle_proof_from_file(proof_file, chunk, root)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_util.py` around lines 247 - 279, Add direct tests for export_all_merkle_proofs: create a tmp file with known content, call utils.export_all_merkle_proofs(file, output_dir, chunk_size=8) and assert the returned count equals expected chunks, assert the number of generated *.json files in output_dir matches that count, and for at least one (and edge cases: single-chunk and empty file returning 0) open the corresponding proof file and verify it using utils.verify_merkle_proof_from_file with utils.compute_merkle_root and the appropriate chunk bytes; reference the export_all_merkle_proofs, compute_merkle_root, and verify_merkle_proof_from_file helpers when locating code to test.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@openverifiablellm/utils.py`:
- Around line 411-412: Change the exception raised when the sibling_hashes type
check fails from ValueError to TypeError: locate the check that does "if not
isinstance(proof, list): raise ValueError('Malformed proof: sibling_hashes must
be a list')" and replace the raised exception with TypeError while keeping the
same message (e.g., raise TypeError("Malformed proof: sibling_hashes must be a
list")). Ensure this change is made in the function or scope where the variable
proof and the sibling_hashes validation occur.
- Around line 296-370: Add direct unit tests for export_all_merkle_proofs that
create temporary files and an output directory (use pytest tmp_path), then
assert behavior for: an empty file returns 0 and produces no proof files; a
single-chunk file produces one proof whose exported JSON exists and whose proof
reconstructs the merkle root; and a file with an odd number of chunks exercises
padding (e.g., 3 chunks) and produces correct proofs. For verification, read
each exported proof JSON (from export_merkle_proof outputs), compute or
reconstruct the merkle root by iteratively hashing leaves using compute_sha256
and the sibling/is_left flags, and assert the reconstructed root equals the
merkle_root field in the JSON; also assert the function returns the expected
num_leaves. Reference export_all_merkle_proofs, export_merkle_proof, and
compute_sha256 to locate code under test and use tmp_path for isolated file IO.
---
Outside diff comments:
In `@tests/test_util.py`:
- Around line 247-279: Add direct tests for export_all_merkle_proofs: create a
tmp file with known content, call utils.export_all_merkle_proofs(file,
output_dir, chunk_size=8) and assert the returned count equals expected chunks,
assert the number of generated *.json files in output_dir matches that count,
and for at least one (and edge cases: single-chunk and empty file returning 0)
open the corresponding proof file and verify it using
utils.verify_merkle_proof_from_file with utils.compute_merkle_root and the
appropriate chunk bytes; reference the export_all_merkle_proofs,
compute_merkle_root, and verify_merkle_proof_from_file helpers when locating
code to test.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: d88cea98-7a03-4345-a6b8-1f053f2b904d
📒 Files selected for processing (2)
openverifiablellm/utils.pytests/test_util.py
| def export_all_merkle_proofs( | ||
| file_path: Union[str, Path], | ||
| output_dir: Union[str, Path], | ||
| chunk_size: int = MERKLE_CHUNK_SIZE_BYTES | ||
| ) -> int: | ||
| """ | ||
| Efficiently generate and export Merkle proofs for all chunks of a file. | ||
| Saves them as individual JSON files in the output directory. | ||
| Returns the number of proofs generated. | ||
| """ | ||
| path = Path(file_path) | ||
| output_dir = Path(output_dir) | ||
| output_dir.mkdir(parents=True, exist_ok=True) | ||
|
|
||
| if chunk_size <= 0: | ||
| raise ValueError("chunk_size must be a positive integer") | ||
|
|
||
| leaves = [] | ||
| with path.open("rb") as f: | ||
| while chunk := f.read(chunk_size): | ||
| leaf_hex = compute_sha256(data=chunk) | ||
| leaves.append(bytes.fromhex(leaf_hex)) | ||
|
|
||
| if not leaves: | ||
| return 0 | ||
|
|
||
| num_leaves = len(leaves) | ||
|
|
||
| # Build the Merkle tree level by level | ||
| tree = [] | ||
| current_level = list(leaves) | ||
|
|
||
| while len(current_level) > 1: | ||
| if len(current_level) % 2 == 1: | ||
| current_level.append(current_level[-1]) | ||
|
|
||
| tree.append(list(current_level)) | ||
|
|
||
| next_level = [] | ||
| for i in range(0, len(current_level), 2): | ||
| combined = current_level[i] + current_level[i + 1] | ||
| parent_hex = compute_sha256(data=combined) | ||
| next_level.append(bytes.fromhex(parent_hex)) | ||
|
|
||
| current_level = next_level | ||
|
|
||
| tree.append(current_level) | ||
|
|
||
| merkle_root = tree[-1][0].hex() | ||
| prefix = path.name | ||
|
|
||
| # Export a proof for each chunk | ||
| for chunk_index in range(num_leaves): | ||
| proof = [] | ||
| index = chunk_index | ||
|
|
||
| for level in tree[:-1]: | ||
| sibling_index = index ^ 1 | ||
| sibling = level[sibling_index] | ||
| is_left = sibling_index < index | ||
| proof.append((sibling.hex(), is_left)) | ||
| index //= 2 | ||
|
|
||
| output_path = output_dir / f"{prefix}_chunk_{chunk_index}_proof.json" | ||
|
|
||
| # Reuse existing single-proof exporter | ||
| export_merkle_proof( | ||
| proof=proof, | ||
| chunk_index=chunk_index, | ||
| chunk_size=chunk_size, | ||
| merkle_root=merkle_root, | ||
| output_path=output_path | ||
| ) | ||
|
|
||
| return num_leaves |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check for existing tests covering export_all_merkle_proofs
rg -n "export_all_merkle_proofs" --type pyRepository: AOSSIE-Org/OpenVerifiableLLM
Length of output: 255
🏁 Script executed:
#!/bin/bash
# Find test files and check for tests of export_all_merkle_proofs
echo "=== Looking for test files ==="
find . -type f -name "*test*.py" | head -20
echo -e "\n=== Searching for export_all_merkle_proofs in test files ==="
rg -n "export_all_merkle_proofs" --type py tests/ 2>/dev/null || echo "No matches in tests/"
echo -e "\n=== Checking context around line 212 in utils.py ==="
sed -n '205,220p' openverifiablellm/utils.py
echo -e "\n=== Searching for extract_text_from_xml function ==="
rg -n "def extract_text_from_xml" --type pyRepository: AOSSIE-Org/OpenVerifiableLLM
Length of output: 1098
🏁 Script executed:
#!/bin/bash
# Find which function contains line 212
echo "=== Context showing function definition around line 212 ==="
sed -n '150,225p' openverifiablellm/utils.py | head -80
echo -e "\n=== Looking for function definitions near line 212 ==="
rg -n "^def " openverifiablellm/utils.py | awk -F: '$1 < 212 {prev=$0} $1 > 212 && !shown {print "Previous function:", prev; print "Next function at:", $0; shown=1}'
echo -e "\n=== Check test_util.py for test coverage ==="
wc -l ./tests/test_util.pyRepository: AOSSIE-Org/OpenVerifiableLLM
Length of output: 2942
🏁 Script executed:
#!/bin/bash
# Check for tests of extract_text_from_xml
echo "=== Searching for tests of extract_text_from_xml ==="
rg -n "extract_text_from_xml" ./tests/test_util.py
echo -e "\n=== Looking for test functions in test_util.py ==="
rg -n "^def test_" ./tests/test_util.py
echo -e "\n=== Show actual test file content to understand coverage ==="
head -50 ./tests/test_util.pyRepository: AOSSIE-Org/OpenVerifiableLLM
Length of output: 3058
🏁 Script executed:
#!/bin/bash
# Check the actual test_extract_text_from_xml tests
echo "=== Test implementation for extract_text_from_xml ==="
sed -n '101,155p' ./tests/test_util.py
echo -e "\n=== Checking if tests mock/patch export_all_merkle_proofs ==="
rg -n "export_all_merkle_proofs" ./tests/test_util.py
echo -e "\n=== Check for any monkeypatch or mock of export_all_merkle_proofs ==="
rg -n -B3 -A3 "monkeypatch" ./tests/test_util.py | head -60Repository: AOSSIE-Org/OpenVerifiableLLM
Length of output: 3089
Add direct unit tests for export_all_merkle_proofs to verify proof generation and edge cases.
This public function currently lacks explicit test coverage. While it's indirectly exercised through extract_text_from_xml tests, those tests only verify preprocessing output, not proof generation. Direct tests should cover:
- Empty files
- Single chunk files
- Odd chunk counts (tree padding logic)
- Proof verification against the computed root
The implementation itself is correct and efficient, but explicit tests would validate the merkle tree construction and proof path extraction.
🧰 Tools
🪛 Ruff (0.15.4)
[warning] 311-311: Avoid specifying long messages outside the exception class
(TRY003)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@openverifiablellm/utils.py` around lines 296 - 370, Add direct unit tests for
export_all_merkle_proofs that create temporary files and an output directory
(use pytest tmp_path), then assert behavior for: an empty file returns 0 and
produces no proof files; a single-chunk file produces one proof whose exported
JSON exists and whose proof reconstructs the merkle root; and a file with an odd
number of chunks exercises padding (e.g., 3 chunks) and produces correct proofs.
For verification, read each exported proof JSON (from export_merkle_proof
outputs), compute or reconstruct the merkle root by iteratively hashing leaves
using compute_sha256 and the sibling/is_left flags, and assert the reconstructed
root equals the merkle_root field in the JSON; also assert the function returns
the expected num_leaves. Reference export_all_merkle_proofs,
export_merkle_proof, and compute_sha256 to locate code under test and use
tmp_path for isolated file IO.
Addressed Issues:
Fixes #36
Summary
Problem
Previous Output Format
What i have done
implemented ZKP-Structured Proof Format by refactoring
export_merkle_proof()to explicitly separate public_inputs and witness data.which chunk
data/processed/proofs/wiki_clean.txt_chunk_3_proof.jsongenrated on
~350mb simplewiki-20260201-pages-articles-multistream.xml.bz2 file(from x = 0 to 531) data/processed/proofs/wiki_clean.txt_chunk_X_proof.jsonImplemented Fully Automated Chunk Proof Generation
extract_text_from_xml()pipelineTree-Padding Bug Fix
During automation, encountered and resolved an
IndexErrorwithin the Merkle tree generation algorithm. i fixed the level-building logic to properly handle padded nodes (when the number of chunks is odd), ensuring trees of any size compile perfectly.Additional Notes:
Checklist
We encourage contributors to use AI tools responsibly when creating Pull Requests. While AI can be a valuable aid, it is essential to ensure that your contributions meet the task requirements, build successfully, include relevant tests, and pass all linters. Submissions that do not meet these standards may be closed without warning to maintain the quality and integrity of the project. Please take the time to understand the changes you are proposing and their impact.
Summary by CodeRabbit
Release Notes
New Features
Tests