Skip to content

[Feature] : Add Merkle proof export and ZKP-compatible dataset verification and Template for ZKPoT and ZKPoI #44

Open
Shubhamx404 wants to merge 1 commit intoAOSSIE-Org:mainfrom
Shubhamx404:zkp-merkle-proofs-initial
Open

[Feature] : Add Merkle proof export and ZKP-compatible dataset verification and Template for ZKPoT and ZKPoI #44
Shubhamx404 wants to merge 1 commit intoAOSSIE-Org:mainfrom
Shubhamx404:zkp-merkle-proofs-initial

Conversation

@Shubhamx404
Copy link
Contributor

@Shubhamx404 Shubhamx404 commented Mar 6, 2026

Addressed Issues:

Fixes #36

Summary

Problem

  • Before this update, our preprocessing pipeline for Wikipedia data dumps generated Merkle proofs for data chunks, but output them in a flat, unstructured JSON format. While mathematically correct, this format was fundamentally incompatible with modern Zero-Knowledge Proof (ZKP) circuit systems.

Previous Output Format

{
  "chunk_index": 1,
  "chunk_size": 1048576,
  "proof": [
    ["e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855", true],
    ["...", false]
  ]
}

What i have done

  • Workflow
Screenshot 2026-03-07 015029

implemented ZKP-Structured Proof Format by refactoring export_merkle_proof() to explicitly separate public_inputs and witness data.

  • What json looks like
    which chunk data/processed/proofs/wiki_clean.txt_chunk_3_proof.json
    genrated on ~350mb simplewiki-20260201-pages-articles-multistream.xml.bz2 file
{
  "public_inputs": {
    "merkle_root": "20e6c499f3ee18e4e086efe32d3fb3efd6a321435f67b43ee2027e4474d113df",
    "chunk_index": 3,
    "chunk_size": 1048576
  },
  "witness": {
    "sibling_hashes": [
      [
        "ca43886e4dce3f098911b4204af08cfdebde3262a0bb9d6e8493423276cf91a5",
        true
      ],
      [
        "25be4424d5133d01f0a87bc06240e56c7d7f63879a0495e714f9046b09ff18ea",
        true
      ],
      [
        "ad6ad2d54fca3c1a664cb0f6eb11cc051ec2f6519d5cebe80f89a14076316a54",
        false
      ],
      [
        "6c6333b5857d491064caebf24647752cba0f24c3075390df702ee175113c50f7",
        false
      ],
      [
        "58926098b11d071575744ceee5e65ba5ab93fc68337bc596c15276256cec3764",
        false
      ],
      [
        "064311dc6bbc32d2911ed76bd2cfee384a28f8bf1a78ccf3ca2a21ea5a7a27e3",
        false
      ],
      [
        "861a03e63fae3f3d85f1bbb13b6ab828187ee45f8a1219bcb4d9893bc0ea4737",
        false
      ],
      [
        "3e65ac9847d1769becc1c9bfd3fac04f354818d87f867320b59a2e014c9fbf35",
        false
      ],
      [
        "6f120a370d3610aee6436726fc8f385c7396227be9d0d91474df54c5e1a18fe7",
        false
      ],
      [
        "aa05437fcb1e242916a65042e362d86bc2d94776a0e4d4fb61a9ce41ec081192",
        false
      ]
    ]
  }
}
  • Around 531 ZKP Proof genrated (from x = 0 to 531) data/processed/proofs/wiki_clean.txt_chunk_X_proof.json

Implemented Fully Automated Chunk Proof Generation

  • here function is now tightly integrated into the extract_text_from_xml() pipeline
  • As soon as a Wikipedia dump is processed into clean text, the system automatically computes the entire Merkle tree and actively exports an individual, ZK-ready JSON proof for every single 1MB chunk of the dataset into a dedicated data/processed/proofs/ directory.

Tree-Padding Bug Fix

During automation, encountered and resolved an IndexError within the Merkle tree generation algorithm. i fixed the level-building logic to properly handle padded nodes (when the number of chunks is odd), ensuring trees of any size compile perfectly.

Additional Notes:

Checklist

  • [ * ] My code follows the project's code style and conventions
  • [ * ] I have made corresponding changes to the documentation
  • [ * ] My changes generate no new warnings or errors
  • [ * ] I have joined the Discord server and I will share a link to this PR with the project maintainers there
  • [ * ] I have read the Contributing Guidelines

⚠️ AI Notice - Important!

We encourage contributors to use AI tools responsibly when creating Pull Requests. While AI can be a valuable aid, it is essential to ensure that your contributions meet the task requirements, build successfully, include relevant tests, and pass all linters. Submissions that do not meet these standards may be closed without warning to maintain the quality and integrity of the project. Please take the time to understand the changes you are proposing and their impact.

Summary by CodeRabbit

Release Notes

  • New Features

    • Automatic generation of cryptographic proofs for processed text, with results exported upon extraction completion.
    • Enhanced proof export functionality supporting batch processing and structured output format.
  • Tests

    • Extended verification test coverage to validate proof consistency and accuracy.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 6, 2026

Walkthrough

Refactored Merkle proof generation to output ZKP-compatible JSON with structured public_inputs and witness fields. Added automatic proof generation during text extraction via new export_all_merkle_proofs() function. Updated verification to parse new structured format and validate root consistency.

Changes

Cohort / File(s) Summary
Merkle Proof ZKP Integration
openverifiablellm/utils.py
Redesigned export_merkle_proof() to output structured JSON with public_inputs (merkle_root, chunk_index, chunk_size) and witness (sibling_hashes); added export_all_merkle_proofs() for batch proof generation; integrated proof generation into extract_text_from_xml() preprocessing workflow; updated verify_merkle_proof_from_file() to parse and validate new format with root consistency checking.
Test Coverage Updates
tests/test_util.py
Updated export_merkle_proof() calls to pass new merkle_root parameter; added test scenario to validate rejection of mismatched merkle roots during file-based verification.

Sequence Diagram

sequenceDiagram
    participant Input as Data Input
    participant Extract as Text Extractor
    participant Tree as Merkle Tree Builder
    participant Export as Proof Exporter
    participant Verify as Proof Verifier

    Input->>Extract: XML content
    Extract->>Extract: Split into chunks
    Extract->>Tree: All chunks
    Tree->>Tree: Build tree level-by-level
    Tree->>Export: Proof data + merkle_root
    Export->>Export: Structure JSON<br/>(public_inputs + witness)
    Export->>Verify: Structured proof file
    Verify->>Verify: Validate public_inputs<br/>and witness fields
    Verify->>Verify: Check merkle_root
    Verify-->>Input: Verification result
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested labels

Python Lang

Poem

🐰 A hop through cryptographic art,
Merkle roots and proofs set apart,
Witness and inputs, clearly defined,
ZKP circuits? Perfectly aligned!

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly identifies the main feature—adding Merkle proof export with ZKP-compatible dataset verification—which aligns with the core changes of restructuring Merkle proofs and automating proof generation.
Linked Issues check ✅ Passed The PR successfully implements all primary objectives from issue #36: restructuring export_merkle_proof() with public_inputs/witness separation, updating verify_merkle_proof_from_file() for the new format, and automating chunk-level proof generation for ZKP circuits.
Out of Scope Changes check ✅ Passed All changes are directly aligned with issue #36 requirements: Merkle proof restructuring, ZKP-compatible JSON format, automated proof generation in preprocessing, and verification updates; no unrelated functionality was introduced.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions bot added size/M and removed size/M labels Mar 6, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tests/test_util.py (1)

247-279: 🧹 Nitpick | 🔵 Trivial

Suggest adding direct tests for export_all_merkle_proofs.

The test file covers export_merkle_proof well but lacks direct tests for the new export_all_merkle_proofs function. Consider adding tests for:

  • Multiple chunks exported correctly
  • Single chunk (edge case)
  • Empty file (returns 0)
  • Proof files are readable and verifiable
💡 Example test structure
def test_export_all_merkle_proofs(tmp_path):
    file = tmp_path / "data.txt"
    content = b"hello world this is merkle proof test"
    file.write_bytes(content)
    
    output_dir = tmp_path / "proofs"
    num_proofs = utils.export_all_merkle_proofs(file, output_dir, chunk_size=8)
    
    assert num_proofs == 5  # 37 bytes / 8 bytes per chunk = 5 chunks
    assert len(list(output_dir.glob("*.json"))) == 5
    
    # Verify at least one proof is valid
    root = utils.compute_merkle_root(file, chunk_size=8)
    with file.open("rb") as f:
        chunk = f.read(8)
    proof_file = output_dir / "data.txt_chunk_0_proof.json"
    assert utils.verify_merkle_proof_from_file(proof_file, chunk, root)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_util.py` around lines 247 - 279, Add direct tests for
export_all_merkle_proofs: create a tmp file with known content, call
utils.export_all_merkle_proofs(file, output_dir, chunk_size=8) and assert the
returned count equals expected chunks, assert the number of generated *.json
files in output_dir matches that count, and for at least one (and edge cases:
single-chunk and empty file returning 0) open the corresponding proof file and
verify it using utils.verify_merkle_proof_from_file with
utils.compute_merkle_root and the appropriate chunk bytes; reference the
export_all_merkle_proofs, compute_merkle_root, and verify_merkle_proof_from_file
helpers when locating code to test.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/utils.py`:
- Around line 411-412: Change the exception raised when the sibling_hashes type
check fails from ValueError to TypeError: locate the check that does "if not
isinstance(proof, list): raise ValueError('Malformed proof: sibling_hashes must
be a list')" and replace the raised exception with TypeError while keeping the
same message (e.g., raise TypeError("Malformed proof: sibling_hashes must be a
list")). Ensure this change is made in the function or scope where the variable
proof and the sibling_hashes validation occur.
- Around line 296-370: Add direct unit tests for export_all_merkle_proofs that
create temporary files and an output directory (use pytest tmp_path), then
assert behavior for: an empty file returns 0 and produces no proof files; a
single-chunk file produces one proof whose exported JSON exists and whose proof
reconstructs the merkle root; and a file with an odd number of chunks exercises
padding (e.g., 3 chunks) and produces correct proofs. For verification, read
each exported proof JSON (from export_merkle_proof outputs), compute or
reconstruct the merkle root by iteratively hashing leaves using compute_sha256
and the sibling/is_left flags, and assert the reconstructed root equals the
merkle_root field in the JSON; also assert the function returns the expected
num_leaves. Reference export_all_merkle_proofs, export_merkle_proof, and
compute_sha256 to locate code under test and use tmp_path for isolated file IO.

---

Outside diff comments:
In `@tests/test_util.py`:
- Around line 247-279: Add direct tests for export_all_merkle_proofs: create a
tmp file with known content, call utils.export_all_merkle_proofs(file,
output_dir, chunk_size=8) and assert the returned count equals expected chunks,
assert the number of generated *.json files in output_dir matches that count,
and for at least one (and edge cases: single-chunk and empty file returning 0)
open the corresponding proof file and verify it using
utils.verify_merkle_proof_from_file with utils.compute_merkle_root and the
appropriate chunk bytes; reference the export_all_merkle_proofs,
compute_merkle_root, and verify_merkle_proof_from_file helpers when locating
code to test.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: d88cea98-7a03-4345-a6b8-1f053f2b904d

📥 Commits

Reviewing files that changed from the base of the PR and between 4f9fdd1 and 40f5748.

📒 Files selected for processing (2)
  • openverifiablellm/utils.py
  • tests/test_util.py

Comment on lines +296 to +370
def export_all_merkle_proofs(
file_path: Union[str, Path],
output_dir: Union[str, Path],
chunk_size: int = MERKLE_CHUNK_SIZE_BYTES
) -> int:
"""
Efficiently generate and export Merkle proofs for all chunks of a file.
Saves them as individual JSON files in the output directory.
Returns the number of proofs generated.
"""
path = Path(file_path)
output_dir = Path(output_dir)
output_dir.mkdir(parents=True, exist_ok=True)

if chunk_size <= 0:
raise ValueError("chunk_size must be a positive integer")

leaves = []
with path.open("rb") as f:
while chunk := f.read(chunk_size):
leaf_hex = compute_sha256(data=chunk)
leaves.append(bytes.fromhex(leaf_hex))

if not leaves:
return 0

num_leaves = len(leaves)

# Build the Merkle tree level by level
tree = []
current_level = list(leaves)

while len(current_level) > 1:
if len(current_level) % 2 == 1:
current_level.append(current_level[-1])

tree.append(list(current_level))

next_level = []
for i in range(0, len(current_level), 2):
combined = current_level[i] + current_level[i + 1]
parent_hex = compute_sha256(data=combined)
next_level.append(bytes.fromhex(parent_hex))

current_level = next_level

tree.append(current_level)

merkle_root = tree[-1][0].hex()
prefix = path.name

# Export a proof for each chunk
for chunk_index in range(num_leaves):
proof = []
index = chunk_index

for level in tree[:-1]:
sibling_index = index ^ 1
sibling = level[sibling_index]
is_left = sibling_index < index
proof.append((sibling.hex(), is_left))
index //= 2

output_path = output_dir / f"{prefix}_chunk_{chunk_index}_proof.json"

# Reuse existing single-proof exporter
export_merkle_proof(
proof=proof,
chunk_index=chunk_index,
chunk_size=chunk_size,
merkle_root=merkle_root,
output_path=output_path
)

return num_leaves
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check for existing tests covering export_all_merkle_proofs
rg -n "export_all_merkle_proofs" --type py

Repository: AOSSIE-Org/OpenVerifiableLLM

Length of output: 255


🏁 Script executed:

#!/bin/bash
# Find test files and check for tests of export_all_merkle_proofs
echo "=== Looking for test files ==="
find . -type f -name "*test*.py" | head -20

echo -e "\n=== Searching for export_all_merkle_proofs in test files ==="
rg -n "export_all_merkle_proofs" --type py tests/ 2>/dev/null || echo "No matches in tests/"

echo -e "\n=== Checking context around line 212 in utils.py ==="
sed -n '205,220p' openverifiablellm/utils.py

echo -e "\n=== Searching for extract_text_from_xml function ==="
rg -n "def extract_text_from_xml" --type py

Repository: AOSSIE-Org/OpenVerifiableLLM

Length of output: 1098


🏁 Script executed:

#!/bin/bash
# Find which function contains line 212
echo "=== Context showing function definition around line 212 ==="
sed -n '150,225p' openverifiablellm/utils.py | head -80

echo -e "\n=== Looking for function definitions near line 212 ==="
rg -n "^def " openverifiablellm/utils.py | awk -F: '$1 < 212 {prev=$0} $1 > 212 && !shown {print "Previous function:", prev; print "Next function at:", $0; shown=1}'

echo -e "\n=== Check test_util.py for test coverage ==="
wc -l ./tests/test_util.py

Repository: AOSSIE-Org/OpenVerifiableLLM

Length of output: 2942


🏁 Script executed:

#!/bin/bash
# Check for tests of extract_text_from_xml
echo "=== Searching for tests of extract_text_from_xml ==="
rg -n "extract_text_from_xml" ./tests/test_util.py

echo -e "\n=== Looking for test functions in test_util.py ==="
rg -n "^def test_" ./tests/test_util.py

echo -e "\n=== Show actual test file content to understand coverage ==="
head -50 ./tests/test_util.py

Repository: AOSSIE-Org/OpenVerifiableLLM

Length of output: 3058


🏁 Script executed:

#!/bin/bash
# Check the actual test_extract_text_from_xml tests
echo "=== Test implementation for extract_text_from_xml ==="
sed -n '101,155p' ./tests/test_util.py

echo -e "\n=== Checking if tests mock/patch export_all_merkle_proofs ==="
rg -n "export_all_merkle_proofs" ./tests/test_util.py

echo -e "\n=== Check for any monkeypatch or mock of export_all_merkle_proofs ==="
rg -n -B3 -A3 "monkeypatch" ./tests/test_util.py | head -60

Repository: AOSSIE-Org/OpenVerifiableLLM

Length of output: 3089


Add direct unit tests for export_all_merkle_proofs to verify proof generation and edge cases.

This public function currently lacks explicit test coverage. While it's indirectly exercised through extract_text_from_xml tests, those tests only verify preprocessing output, not proof generation. Direct tests should cover:

  • Empty files
  • Single chunk files
  • Odd chunk counts (tree padding logic)
  • Proof verification against the computed root

The implementation itself is correct and efficient, but explicit tests would validate the merkle tree construction and proof path extraction.

🧰 Tools
🪛 Ruff (0.15.4)

[warning] 311-311: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openverifiablellm/utils.py` around lines 296 - 370, Add direct unit tests for
export_all_merkle_proofs that create temporary files and an output directory
(use pytest tmp_path), then assert behavior for: an empty file returns 0 and
produces no proof files; a single-chunk file produces one proof whose exported
JSON exists and whose proof reconstructs the merkle root; and a file with an odd
number of chunks exercises padding (e.g., 3 chunks) and produces correct proofs.
For verification, read each exported proof JSON (from export_merkle_proof
outputs), compute or reconstruct the merkle root by iteratively hashing leaves
using compute_sha256 and the sibling/is_left flags, and assert the reconstructed
root equals the merkle_root field in the JSON; also assert the function returns
the expected num_leaves. Reference export_all_merkle_proofs,
export_merkle_proof, and compute_sha256 to locate code under test and use
tmp_path for isolated file IO.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE]: ZKP Integration for Merkle Proofs

1 participant