[Feature] : Add Merkle proof export and ZKP-compatible dataset verification and Template for ZKPoT and ZKPoI by Shubhamx404 · Pull Request #44 · AOSSIE-Org/OpenVerifiableLLM

Shubhamx404 · 2026-03-06T21:25:06Z

Addressed Issues:

Fixes #36

Summary

Problem

Before this update, our preprocessing pipeline for Wikipedia data dumps generated Merkle proofs for data chunks, but output them in a flat, unstructured JSON format. While mathematically correct, this format was fundamentally incompatible with modern Zero-Knowledge Proof (ZKP) circuit systems.

Previous Output Format

{
  "chunk_index": 1,
  "chunk_size": 1048576,
  "proof": [
    ["e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855", true],
    ["...", false]
  ]
}

What i have done

Workflow

implemented ZKP-Structured Proof Format by refactoring `export_merkle_proof()` to explicitly separate public_inputs and witness data.

What json looks like
which chunk data/processed/proofs/wiki_clean.txt_chunk_3_proof.json
genrated on ~350mb simplewiki-20260201-pages-articles-multistream.xml.bz2 file

{
  "public_inputs": {
    "merkle_root": "20e6c499f3ee18e4e086efe32d3fb3efd6a321435f67b43ee2027e4474d113df",
    "chunk_index": 3,
    "chunk_size": 1048576
  },
  "witness": {
    "sibling_hashes": [
      [
        "ca43886e4dce3f098911b4204af08cfdebde3262a0bb9d6e8493423276cf91a5",
        true
      ],
      [
        "25be4424d5133d01f0a87bc06240e56c7d7f63879a0495e714f9046b09ff18ea",
        true
      ],
      [
        "ad6ad2d54fca3c1a664cb0f6eb11cc051ec2f6519d5cebe80f89a14076316a54",
        false
      ],
      [
        "6c6333b5857d491064caebf24647752cba0f24c3075390df702ee175113c50f7",
        false
      ],
      [
        "58926098b11d071575744ceee5e65ba5ab93fc68337bc596c15276256cec3764",
        false
      ],
      [
        "064311dc6bbc32d2911ed76bd2cfee384a28f8bf1a78ccf3ca2a21ea5a7a27e3",
        false
      ],
      [
        "861a03e63fae3f3d85f1bbb13b6ab828187ee45f8a1219bcb4d9893bc0ea4737",
        false
      ],
      [
        "3e65ac9847d1769becc1c9bfd3fac04f354818d87f867320b59a2e014c9fbf35",
        false
      ],
      [
        "6f120a370d3610aee6436726fc8f385c7396227be9d0d91474df54c5e1a18fe7",
        false
      ],
      [
        "aa05437fcb1e242916a65042e362d86bc2d94776a0e4d4fb61a9ce41ec081192",
        false
      ]
    ]
  }
}

Around 531 ZKP Proof genrated (from x = 0 to 531) data/processed/proofs/wiki_clean.txt_chunk_X_proof.json

Implemented Fully Automated Chunk Proof Generation

here function is now tightly integrated into the extract_text_from_xml() pipeline
As soon as a Wikipedia dump is processed into clean text, the system automatically computes the entire Merkle tree and actively exports an individual, ZK-ready JSON proof for every single 1MB chunk of the dataset into a dedicated data/processed/proofs/ directory.

Tree-Padding Bug Fix

During automation, encountered and resolved an IndexError within the Merkle tree generation algorithm. i fixed the level-building logic to properly handle padded nodes (when the number of chunks is odd), ensuring trees of any size compile perfectly.

Additional Notes:

Checklist

[ * ] My code follows the project's code style and conventions
[ * ] I have made corresponding changes to the documentation
[ * ] My changes generate no new warnings or errors
[ * ] I have joined the Discord server and I will share a link to this PR with the project maintainers there
[ * ] I have read the Contributing Guidelines

⚠️ AI Notice - Important!

We encourage contributors to use AI tools responsibly when creating Pull Requests. While AI can be a valuable aid, it is essential to ensure that your contributions meet the task requirements, build successfully, include relevant tests, and pass all linters. Submissions that do not meet these standards may be closed without warning to maintain the quality and integrity of the project. Please take the time to understand the changes you are proposing and their impact.

Summary by CodeRabbit

Release Notes

New Features
- Automatic generation of cryptographic proofs for processed text, with results exported upon extraction completion.
- Enhanced proof export functionality supporting batch processing and structured output format.
Tests
- Extended verification test coverage to validate proof consistency and accuracy.

coderabbitai · 2026-03-06T21:25:26Z

Walkthrough

Refactored Merkle proof generation to output ZKP-compatible JSON with structured public_inputs and witness fields. Added automatic proof generation during text extraction via new export_all_merkle_proofs() function. Updated verification to parse new structured format and validate root consistency.

Changes

Cohort / File(s)	Summary
Merkle Proof ZKP Integration `openverifiablellm/utils.py`	Redesigned `export_merkle_proof()` to output structured JSON with `public_inputs` (merkle_root, chunk_index, chunk_size) and `witness` (sibling_hashes); added `export_all_merkle_proofs()` for batch proof generation; integrated proof generation into `extract_text_from_xml()` preprocessing workflow; updated `verify_merkle_proof_from_file()` to parse and validate new format with root consistency checking.
Test Coverage Updates `tests/test_util.py`	Updated `export_merkle_proof()` calls to pass new `merkle_root` parameter; added test scenario to validate rejection of mismatched merkle roots during file-based verification.

Sequence Diagram

sequenceDiagram
    participant Input as Data Input
    participant Extract as Text Extractor
    participant Tree as Merkle Tree Builder
    participant Export as Proof Exporter
    participant Verify as Proof Verifier

    Input->>Extract: XML content
    Extract->>Extract: Split into chunks
    Extract->>Tree: All chunks
    Tree->>Tree: Build tree level-by-level
    Tree->>Export: Proof data + merkle_root
    Export->>Export: Structure JSON<br/>(public_inputs + witness)
    Export->>Verify: Structured proof file
    Verify->>Verify: Validate public_inputs<br/>and witness fields
    Verify->>Verify: Check merkle_root
    Verify-->>Input: Verification result

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Extend the Merkle proof system by adding portable proof export and standalone #11: Modifies Merkle proof export/verification functionality and on-disk proof formats, sharing overlapping changes to the same core proof generation and verification workflows.

Suggested labels

Python Lang

Poem

🐰 A hop through cryptographic art,
Merkle roots and proofs set apart,
Witness and inputs, clearly defined,
ZKP circuits? Perfectly aligned! ✨

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly identifies the main feature—adding Merkle proof export with ZKP-compatible dataset verification—which aligns with the core changes of restructuring Merkle proofs and automating proof generation.
Linked Issues check	✅ Passed	The PR successfully implements all primary objectives from issue `#36`: restructuring export_merkle_proof() with public_inputs/witness separation, updating verify_merkle_proof_from_file() for the new format, and automating chunk-level proof generation for ZKP circuits.
Out of Scope Changes check	✅ Passed	All changes are directly aligned with issue `#36` requirements: Merkle proof restructuring, ZKP-compatible JSON format, automated proof generation in preprocessing, and verification updates; no unrelated functionality was introduced.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tests/test_util.py (1)

247-279: 🧹 Nitpick | 🔵 Trivial

Suggest adding direct tests for export_all_merkle_proofs.

The test file covers export_merkle_proof well but lacks direct tests for the new export_all_merkle_proofs function. Consider adding tests for:

Multiple chunks exported correctly
Single chunk (edge case)
Empty file (returns 0)
Proof files are readable and verifiable

💡 Example test structure

def test_export_all_merkle_proofs(tmp_path):
    file = tmp_path / "data.txt"
    content = b"hello world this is merkle proof test"
    file.write_bytes(content)
    
    output_dir = tmp_path / "proofs"
    num_proofs = utils.export_all_merkle_proofs(file, output_dir, chunk_size=8)
    
    assert num_proofs == 5  # 37 bytes / 8 bytes per chunk = 5 chunks
    assert len(list(output_dir.glob("*.json"))) == 5
    
    # Verify at least one proof is valid
    root = utils.compute_merkle_root(file, chunk_size=8)
    with file.open("rb") as f:
        chunk = f.read(8)
    proof_file = output_dir / "data.txt_chunk_0_proof.json"
    assert utils.verify_merkle_proof_from_file(proof_file, chunk, root)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/test_util.py` around lines 247 - 279, Add direct tests for
export_all_merkle_proofs: create a tmp file with known content, call
utils.export_all_merkle_proofs(file, output_dir, chunk_size=8) and assert the
returned count equals expected chunks, assert the number of generated *.json
files in output_dir matches that count, and for at least one (and edge cases:
single-chunk and empty file returning 0) open the corresponding proof file and
verify it using utils.verify_merkle_proof_from_file with
utils.compute_merkle_root and the appropriate chunk bytes; reference the
export_all_merkle_proofs, compute_merkle_root, and verify_merkle_proof_from_file
helpers when locating code to test.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/utils.py`:
- Around line 411-412: Change the exception raised when the sibling_hashes type
check fails from ValueError to TypeError: locate the check that does "if not
isinstance(proof, list): raise ValueError('Malformed proof: sibling_hashes must
be a list')" and replace the raised exception with TypeError while keeping the
same message (e.g., raise TypeError("Malformed proof: sibling_hashes must be a
list")). Ensure this change is made in the function or scope where the variable
proof and the sibling_hashes validation occur.
- Around line 296-370: Add direct unit tests for export_all_merkle_proofs that
create temporary files and an output directory (use pytest tmp_path), then
assert behavior for: an empty file returns 0 and produces no proof files; a
single-chunk file produces one proof whose exported JSON exists and whose proof
reconstructs the merkle root; and a file with an odd number of chunks exercises
padding (e.g., 3 chunks) and produces correct proofs. For verification, read
each exported proof JSON (from export_merkle_proof outputs), compute or
reconstruct the merkle root by iteratively hashing leaves using compute_sha256
and the sibling/is_left flags, and assert the reconstructed root equals the
merkle_root field in the JSON; also assert the function returns the expected
num_leaves. Reference export_all_merkle_proofs, export_merkle_proof, and
compute_sha256 to locate code under test and use tmp_path for isolated file IO.

---

Outside diff comments:
In `@tests/test_util.py`:
- Around line 247-279: Add direct tests for export_all_merkle_proofs: create a
tmp file with known content, call utils.export_all_merkle_proofs(file,
output_dir, chunk_size=8) and assert the returned count equals expected chunks,
assert the number of generated *.json files in output_dir matches that count,
and for at least one (and edge cases: single-chunk and empty file returning 0)
open the corresponding proof file and verify it using
utils.verify_merkle_proof_from_file with utils.compute_merkle_root and the
appropriate chunk bytes; reference the export_all_merkle_proofs,
compute_merkle_root, and verify_merkle_proof_from_file helpers when locating
code to test.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: d88cea98-7a03-4345-a6b8-1f053f2b904d

📥 Commits

Reviewing files that changed from the base of the PR and between 4f9fdd1 and 40f5748.

📒 Files selected for processing (2)

openverifiablellm/utils.py
tests/test_util.py

coderabbitai · 2026-03-06T21:30:34Z

openverifiablellm/utils.py

+def export_all_merkle_proofs(
+    file_path: Union[str, Path],
+    output_dir: Union[str, Path],
+    chunk_size: int = MERKLE_CHUNK_SIZE_BYTES
+) -> int:
+    """
+    Efficiently generate and export Merkle proofs for all chunks of a file.
+    Saves them as individual JSON files in the output directory.
+    Returns the number of proofs generated.
+    """
+    path = Path(file_path)
+    output_dir = Path(output_dir)
+    output_dir.mkdir(parents=True, exist_ok=True)
+
+    if chunk_size <= 0:
+        raise ValueError("chunk_size must be a positive integer")
+
+    leaves = []
+    with path.open("rb") as f:
+        while chunk := f.read(chunk_size):
+            leaf_hex = compute_sha256(data=chunk)
+            leaves.append(bytes.fromhex(leaf_hex))
+
+    if not leaves:
+        return 0
+
+    num_leaves = len(leaves)
+
+    # Build the Merkle tree level by level
+    tree = []
+    current_level = list(leaves)
+
+    while len(current_level) > 1:
+        if len(current_level) % 2 == 1:
+            current_level.append(current_level[-1])
+
+        tree.append(list(current_level))
+
+        next_level = []
+        for i in range(0, len(current_level), 2):
+            combined = current_level[i] + current_level[i + 1]
+            parent_hex = compute_sha256(data=combined)
+            next_level.append(bytes.fromhex(parent_hex))
+
+        current_level = next_level
+
+    tree.append(current_level)
+
+    merkle_root = tree[-1][0].hex()
+    prefix = path.name
+
+    # Export a proof for each chunk
+    for chunk_index in range(num_leaves):
+        proof = []
+        index = chunk_index
+
+        for level in tree[:-1]:
+            sibling_index = index ^ 1
+            sibling = level[sibling_index]
+            is_left = sibling_index < index
+            proof.append((sibling.hex(), is_left))
+            index //= 2
+
+        output_path = output_dir / f"{prefix}_chunk_{chunk_index}_proof.json"
+
+        # Reuse existing single-proof exporter
+        export_merkle_proof(
+            proof=proof,
+            chunk_index=chunk_index,
+            chunk_size=chunk_size,
+            merkle_root=merkle_root,
+            output_path=output_path
+        )
+
+    return num_leaves


🧹 Nitpick | 🔵 Trivial

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Check for existing tests covering export_all_merkle_proofs rg -n "export_all_merkle_proofs" --type py

Repository: AOSSIE-Org/OpenVerifiableLLM

Length of output: 255

🏁 Script executed:

#!/bin/bash # Find test files and check for tests of export_all_merkle_proofs echo "=== Looking for test files ===" find . -type f -name "*test*.py" | head -20 echo -e "\n=== Searching for export_all_merkle_proofs in test files ===" rg -n "export_all_merkle_proofs" --type py tests/ 2>/dev/null || echo "No matches in tests/" echo -e "\n=== Checking context around line 212 in utils.py ===" sed -n '205,220p' openverifiablellm/utils.py echo -e "\n=== Searching for extract_text_from_xml function ===" rg -n "def extract_text_from_xml" --type py

Repository: AOSSIE-Org/OpenVerifiableLLM

Length of output: 1098

🏁 Script executed:

#!/bin/bash # Find which function contains line 212 echo "=== Context showing function definition around line 212 ===" sed -n '150,225p' openverifiablellm/utils.py | head -80 echo -e "\n=== Looking for function definitions near line 212 ===" rg -n "^def " openverifiablellm/utils.py | awk -F: '$1 < 212 {prev=$0} $1 > 212 && !shown {print "Previous function:", prev; print "Next function at:", $0; shown=1}' echo -e "\n=== Check test_util.py for test coverage ===" wc -l ./tests/test_util.py

Repository: AOSSIE-Org/OpenVerifiableLLM

Length of output: 2942

🏁 Script executed:

#!/bin/bash # Check for tests of extract_text_from_xml echo "=== Searching for tests of extract_text_from_xml ===" rg -n "extract_text_from_xml" ./tests/test_util.py echo -e "\n=== Looking for test functions in test_util.py ===" rg -n "^def test_" ./tests/test_util.py echo -e "\n=== Show actual test file content to understand coverage ===" head -50 ./tests/test_util.py

Repository: AOSSIE-Org/OpenVerifiableLLM

Length of output: 3058

🏁 Script executed:

#!/bin/bash # Check the actual test_extract_text_from_xml tests echo "=== Test implementation for extract_text_from_xml ===" sed -n '101,155p' ./tests/test_util.py echo -e "\n=== Checking if tests mock/patch export_all_merkle_proofs ===" rg -n "export_all_merkle_proofs" ./tests/test_util.py echo -e "\n=== Check for any monkeypatch or mock of export_all_merkle_proofs ===" rg -n -B3 -A3 "monkeypatch" ./tests/test_util.py | head -60

Repository: AOSSIE-Org/OpenVerifiableLLM

Length of output: 3089

Add direct unit tests for export_all_merkle_proofs to verify proof generation and edge cases.

This public function currently lacks explicit test coverage. While it's indirectly exercised through extract_text_from_xml tests, those tests only verify preprocessing output, not proof generation. Direct tests should cover:

Empty files

Single chunk files

Odd chunk counts (tree padding logic)

Proof verification against the computed root

The implementation itself is correct and efficient, but explicit tests would validate the merkle tree construction and proof path extraction.

🧰 Tools

🪛 Ruff (0.15.4)

[warning] 311-311: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@openverifiablellm/utils.py` around lines 296 - 370, Add direct unit tests for export_all_merkle_proofs that create temporary files and an output directory (use pytest tmp_path), then assert behavior for: an empty file returns 0 and produces no proof files; a single-chunk file produces one proof whose exported JSON exists and whose proof reconstructs the merkle root; and a file with an odd number of chunks exercises padding (e.g., 3 chunks) and produces correct proofs. For verification, read each exported proof JSON (from export_merkle_proof outputs), compute or reconstruct the merkle root by iteratively hashing leaves using compute_sha256 and the sibling/is_left flags, and assert the reconstructed root equals the merkle_root field in the JSON; also assert the function returns the expected num_leaves. Reference export_all_merkle_proofs, export_merkle_proof, and compute_sha256 to locate code under test and use tmp_path for isolated file IO.

openverifiablellm/utils.py

Add Merkle proof export and ZKP-compatible dataset verification

40f5748

github-actions bot added enhancement New feature or request backend python size/M external-contributor pending-coderabbit-review labels Mar 6, 2026

github-actions bot added size/M and removed size/M labels Mar 6, 2026

coderabbitai bot requested changes Mar 6, 2026

View reviewed changes

Shubhamx404 mentioned this pull request Mar 6, 2026

[Feature] : Implement Zero-Knowledge Proof of Dataset Inclusion (ZK-DI) #46

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature] : Add Merkle proof export and ZKP-compatible dataset verification and Template for ZKPoT and ZKPoI #44

[Feature] : Add Merkle proof export and ZKP-compatible dataset verification and Template for ZKPoT and ZKPoI #44
Shubhamx404 wants to merge 1 commit intoAOSSIE-Org:mainfrom
Shubhamx404:zkp-merkle-proofs-initial

Shubhamx404 commented Mar 6, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Mar 6, 2026 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Mar 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

Shubhamx404 commented Mar 6, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Addressed Issues:

Summary

Problem

What i have done

implemented ZKP-Structured Proof Format by refactoring export_merkle_proof() to explicitly separate public_inputs and witness data.

Implemented Fully Automated Chunk Proof Generation

Tree-Padding Bug Fix

Additional Notes:

Checklist

⚠️ AI Notice - Important!

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Suggested labels

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Shubhamx404 commented Mar 6, 2026 •

edited by coderabbitai bot

Loading

implemented ZKP-Structured Proof Format by refactoring `export_merkle_proof()` to explicitly separate public_inputs and witness data.

coderabbitai bot commented Mar 6, 2026 •

edited

Loading