[Performance] : Optimize Merkle Tree Hashing by Using Byte-Level SHA256 by Shubhamx404 · Pull Request #33 · AOSSIE-Org/OpenVerifiableLLM

Shubhamx404 · 2026-03-05T18:24:13Z

Addressed Issues:

Fixes #26

This PR improves the performance and correctness of the Merkle tree implementation by removing repeated hex string conversions during hashing.

Summary

Previously, the functions compute_merkle_root, generate_merkle_proof```, and verify_merkle_proofrelied directly oncompute_sha256()```, which returns a hexadecimal string (.hexdigest()). However, hashlib.sha256() internally operates on raw bytes. Because of this mismatch, the implementation repeatedly converted between hex strings and raw bytes while constructing the Merkle tree.
For large files, this resulted in significant overhead.

Before

Each Merkle tree operation converted SHA256 outputs from hex strings to raw bytes and back, causing thousands of bytes.fromhex() operations for large datasets.
The frequent creation and conversion of hexadecimal strings introduced unnecessary memory allocations and data copying, increasing runtime overhead.

Solution

The Merkle tree implementation was optimized to operate on raw bytes instead of hexadecimal strings. A new helper function compute_sha256_bytes() returns the SHA256 digest using .digest(), allowing all intermediate hashes to remain in byte format. This removes repeated bytes.fromhex() conversions and reduces overhead, with hex conversion performed only once when returning the final Merkle root.

After

To eliminate unnecessary conversions, a new helper function was introduced compute_sha256_bytes()This function returns the raw binary digest (.digest()) instead of a hexadecimal string.

Workflow


Raw File
   │
   ▼
Read File Chunk (bytes)
   │
   ▼
SHA256 (.digest()) → bytes
   │
   ▼
Combine Hashes (bytes + bytes)
   │
   ▼
SHA256 (.digest())
   │
   ▼
Repeat until root
   │
   ▼
Convert to HEX ONLY ONCE
(return leaves[0].hex())

implementation example

leaf_bytes = compute_sha256_bytes(data=chunk)
leaves.append(leaf_bytes)

combined = left + right
parent_bytes = compute_sha256_bytes(data=combined)

Benchmarks , benchmark workflow and Timings

pc - 8gb ram DDR4 , 4gb amd gpu , ryzen 5600h

Before changes

This was run on local software ide

this was run on window powershell

After Changes

This was run on local ide

this was run on win powershell

i have also incorporated benchmarks for merkel_roots , `benchmark_custom.py`

Dataset File
   │
   ▼
benchmark_custom.py
   │
   ├── compute_merkle_root()
   │       │
   │       ▼
   │   optimized byte hashing
   │
   └── generate_merkle_proof()
           │
           ▼
     Merkle proof generation

Locally run using python benchmark_custom.py "File path "

Additional Notes:

#Final Architecture

               Dataset (e.g., simplewiki xml.bz2)
                              │
                              ▼
                    Chunking (1MB blocks)
                              │
                              ▼
                      Byte-Level SHA256
                              │
                              ▼
                  Merkle Tree Construction
                              │
                              ▼
                         Merkle Root
                              │
                              ▼
                      Dataset Manifest

Checklist

[ * ] My code follows the project's code style and conventions
[ * ] I have made corresponding changes to the documentation
[ * ] My changes generate no new warnings or errors
[ * ] I have joined the Discord server and I will share a link to this PR with the project maintainers there
[ * ] I have read the Contributing Guidelines

⚠️ AI Notice - Important!

We encourage contributors to use AI tools responsibly when creating Pull Requests. While AI can be a valuable aid, it is essential to ensure that your contributions meet the task requirements, build successfully, include relevant tests, and pass all linters. Submissions that do not meet these standards may be closed without warning to maintain the quality and integrity of the project. Please take the time to understand the changes you are proposing and their impact.

Summary by CodeRabbit

New Features
- CLI-accessible benchmark mode to measure Merkle computation speed and peak memory.
- Added a SHA-256 helper that accepts raw data or file input.
Improvements
- Merkle computation, proof generation, and verification now use byte-level hashing for consistency and efficiency.
- Simplified Merkle proof loading and verification workflow.
Tests
- Added tests covering the new hashing helper and Merkle root/proof workflows.

coderabbitai · 2026-03-05T18:24:29Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

Walkthrough

Added a bytes-returning SHA-256 helper and refactored Merkle routines to operate on raw bytes; introduced a benchmarking routine (tracemalloc + perf_counter) and CLI flag to run benchmarks; tests updated to use new helpers and proof-loading/verification flow.

Changes

Cohort / File(s)	Summary
Core Utilities `openverifiablellm/utils.py`	Added `compute_sha256_bytes(data=None, file_path=None) -> bytes`. Refactored `compute_merkle_root`, `generate_merkle_proof`, and `verify_merkle_proof` to use raw digest bytes internally and emit hex only for external outputs.
Benchmark & CLI `openverifiablellm/utils.py`	Added `run_benchmark(file_path, chunk_size=...)` using `tracemalloc` and `time.perf_counter`; extended `__main__` with `--BENCHMARK_MODE` and `--chunk_size`; imported `os`, `time`, `tracemalloc`, `argparse`.
Tests `tests/test_util.py`	Added tests for `compute_sha256_bytes` (data and file path). Updated Merkle tests to use `load_merkle_proof()` and call `verify_merkle_proof(chunk_bytes=..., proof=..., merkle_root=...)`.

Sequence Diagram(s)

sequenceDiagram
  participant CLI as CLI (--BENCHMARK_MODE)
  participant Utils as utils.run_benchmark / Merkle routines
  participant FS as FileSystem
  participant Hash as SHA256
  participant Prof as Profiler

  CLI->>Utils: run_benchmark(file_path, chunk_size)
  Utils->>Prof: tracemalloc.start & perf_counter start
  Utils->>FS: read file in chunks
  loop per chunk
    Utils->>Hash: compute_sha256_bytes(chunk)
    Hash-->>Utils: raw digest bytes
    Utils->>Utils: assemble leaves & compute parents (raw bytes)
  end
  Utils->>Prof: stop & get peak/time
  Utils-->>CLI: output root, proof timings, peak memory

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Add Merkle Tree–Based Chunk-Level Hashing for Dataset Verification this establishes dataset verification integrity . #8: Directly related Merkle hashing refactor and bytes-oriented SHA‑256 changes.
fix: resolve compute_sha256 signature mismatch in test_util.py #28: Related test/signature updates for SHA-256 helper usage and test fixes.

Suggested labels

Python Lang

Suggested reviewers

Archit381

Poem

🐰 I nibbled hex and cast it free,
Raw bytes now sing beneath the tree,
Leaves stack up fast, proofs hop in line,
Benchmarks hum and memory’s fine,
A happy rabbit pats the root with glee.

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main change: optimizing Merkle tree hashing by using byte-level SHA256 instead of hex conversions, which directly addresses the performance objective.
Linked Issues check	✅ Passed	The PR successfully implements all coding requirements from issue `#26`: introduces compute_sha256_bytes() returning raw bytes, refactors compute_merkle_root and generate_merkle_proof to use byte-level hashing, eliminates hex-to-bytes conversions, and adds benchmarking capabilities to measure performance improvements.
Out of Scope Changes check	✅ Passed	All changes are scoped to the stated objective: compute_sha256_bytes helper, Merkle routine refactoring, CLI benchmarking integration, and test updates directly support eliminating hex conversions and measuring performance gains.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@benchmark_custom.py`:
- Around line 42-44: The current except block that catches "except Exception as
e" and only does print(f"An error occurred: {e}") swallows failures; change it
to either catch specific expected exceptions or, after logging the error (use
logging.exception or similar for stacktrace), re-raise the exception or exit
with a non-zero status so automation sees failure; locate the "except Exception
as e" block and replace the print-only behavior with structured logging plus
either "raise" or a sys.exit(1) to ensure failures are propagated.
- Line 33: The local variable "proof" is assigned from
generate_merkle_proof(file_path, chunk_index=chunk_index, chunk_size=chunk_size)
but never used, causing Ruff F841; remove the unused assignment by either
calling generate_merkle_proof(...) without capturing its return or assign it to
a throwaway "_" (e.g., _ = generate_merkle_proof(...)) in the same spot where
"proof = generate_merkle_proof(...)" appears to silence the lint error while
keeping the call semantics.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: c44dedfc-9e6c-4647-aacd-c33adf46f5e9

📥 Commits

Reviewing files that changed from the base of the PR and between 4f9fdd1 and a949331.

📒 Files selected for processing (3)

benchmark_custom.py
openverifiablellm/utils.py
tests/test_util.py

benchmark_custom.py

Shubhamx404 · 2026-03-06T07:21:42Z

hey @Archit381 can you review this pr linked with issue #26

Archit381 · 2026-03-07T08:13:24Z

@Shubhamx404 Do the following changes pls

Instead of having additional python script for running benchmark, update utils so that when run it accepts a flag:

--BENCHMARK_MODE=TRUE

In your benchmark, reduce the logging to only keep minutes, seconds and ms
Also expand it further to include resource consumption metrics
Fix coderabbit requested changes

coderabbitai

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/utils.py`:
- Around line 468-472: The current fragile parsing that sets input_dump =
sys.argv[1] can misassign the file when flags like "--BENCHMARK_MODE=TRUE" come
first; replace this with proper argparse usage to define a positional input_dump
and an optional --BENCHMARK_MODE flag (so run_benchmark(input_dump) or
extract_text_from_xml(input_dump) are called reliably), or at minimum validate
that input_dump is an existing file path before using it and scan sys.argv for
the benchmark flag instead of assuming position; update the main entry logic
around input_dump, run_benchmark, and extract_text_from_xml accordingly.
- Line 427: The unpacked variable current_mem from the
tracemalloc.get_traced_memory() call is unused; rename it to _current_mem (or
_current_mem) to indicate intentional unused status. Update the assignment where
tracemalloc.get_traced_memory() is called (the current_mem, peak_mem =
tracemalloc.get_traced_memory() expression) to use the prefixed name so linters
and reviewers recognize it as intentionally unused while leaving peak_mem
unchanged.
- Around line 440-441: The chunk_index selection can exceed available chunks and
cause IndexError; instead of using size_mb > 10 to pick chunk_index=10 in the
call to generate_merkle_proof(file_path, chunk_index=chunk_index,
chunk_size=chunk_size), compute the actual chunk count from file size and
chunk_size (e.g., chunk_count = ceil(file_size_bytes / chunk_size)) and set
chunk_index = min(10, chunk_count - 1) (or 0 if chunk_count == 0) before calling
generate_merkle_proof so you always pass a valid chunk index.
- Around line 76-78: Add a unit test that asserts compute_merkle_root produces
the expected hardcoded root for a multi-chunk input: create a deterministic list
of chunks (byte strings), compute the merkle root via
compute_merkle_root(chunks) and compare its hex (or bytes) to a precomputed
expected value (from an independent SHA-256 concatenation calculation) to catch
byte-order/concatenation issues; reference the compute_merkle_root function and
use the same byte-formatting convention as compute_sha256_bytes for comparison,
and include at least a 3+ chunk test vector so the tree has multiple levels.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 0bea160f-0730-46ea-8594-8683af360b8b

📥 Commits

Reviewing files that changed from the base of the PR and between a949331 and c5186ae.

📒 Files selected for processing (1)

openverifiablellm/utils.py

openverifiablellm/utils.py

Shubhamx404 · 2026-03-07T08:48:46Z

Hey @Archit381 i implemented what you needed

Archit381 · 2026-03-07T09:01:15Z

Add before vs after resource consumption benchmarks in this PR

Also fix coderabbit issues

Shubhamx404 · 2026-03-07T09:13:05Z

@Archit381 showed memory usage before and after

before

After

Archit381 · 2026-03-07T09:35:57Z

@Shubhamx404 The results don't really show any significant upside to your change

Shubhamx404 · 2026-03-07T09:58:22Z

@Archit381
chunk_size is already set to 1mb (MERKLE_CHUNK_SIZE_BYTES = 1024 * 1024), the 355 mb file is broken down into ~355 chunks.

also genrating manifest after this takes less time and memory usage is indentical too before and after
also i implemented compute_merkle_root and compute_sha256_bytes()`` helper function. this function relies entirely on hashlib.sha256().digest() instead of .hexdigest()```.

reducing the chunk_size to very small values (e.g., 4 kb or 8 bytes) it creates millions of Merkle tree nodes. The old implementation slows down or crashes due to repeated bytes.fromhex() conversions, while the new byte-level chained-commitments-fix handles it efficiently and scales without stalling

i checked the codes and resolve the issue .. new benchmark score and below befor and after

@Archit381

Shubhamx404 · 2026-03-08T15:24:01Z

hey @Archit381 i resolved coderabbit issues , i again run benchmark for conformation

before

After

here is observation

Memory usage reduced from 10.05 mb to 2.13 mb (around 78% reduction)
Execution time improved from 4.5s to around 0.46s (around 10x faster)

coderabbit follow up Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/utils.py`:
- Around line 457-462: The unconditional proof logging at the end of the block
(logging generate_merkle_proof, peak_mem_proof, proof_time, chunk_index) should
be removed and instead only logged inside the branch that actually generates a
proof (the branch that sets end_time, chunk_index and traces memory); update the
generate_merkle_proof related logging so it is emitted once per proof generation
and avoid referencing end_time or chunk_index when the function returned early
for an empty file—i.e., move the logger.info calls for proof_time and
peak_mem_proof into the proof-generation branch and delete the
duplicate/unconditional logging at the end.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 6b9408b3-89bb-4909-9da4-ad5a6ec85640

📥 Commits

Reviewing files that changed from the base of the PR and between 50c1706 and f6dfae4.

📒 Files selected for processing (1)

openverifiablellm/utils.py

openverifiablellm/utils.py

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

Shubhamx404 · 2026-03-09T15:18:23Z

@Archit381 i again review everyting and test for benchmark

after

Shubhamx404 · 2026-03-09T15:18:54Z

@Archit381 can you please review this

Archit381 · 2026-03-10T17:39:30Z

@Shubhamx404 Pull from main, fix the conflicts and check again. We are using uv to package now so test if there's any issue

Shubhamx404 · 2026-03-10T17:43:03Z

@Archit381 okay , working on it

Shubhamx404 · 2026-03-10T18:23:24Z

@Archit381 all test passed

Archit381 · 2026-03-10T18:25:33Z

Fix Lint test

Shubhamx404 · 2026-03-10T18:36:36Z

hey @Archit381 fixed lint issue

Shubhamx404 · 2026-03-10T19:41:49Z

after changes

Fix compute_sha256 usage and implement chained cryptographic commitments

a949331

github-actions bot added backend python size/M external-contributor pending-coderabbit-review and removed size/M labels Mar 5, 2026

coderabbitai bot requested changes Mar 5, 2026

View reviewed changes

benchmark_custom.py Outdated Show resolved Hide resolved

benchmark_custom.py Outdated Show resolved Hide resolved

Shubhamx404 mentioned this pull request Mar 5, 2026

[Performance] : Remove redundant hex-to-bytes conversions in Merkle tree hashing #26

Open

github-actions bot added size/M and removed size/M labels Mar 5, 2026

Shubhamx404 changed the title ~~[Performance] : compute_sha256 usage and implement chained cryptographic commitments~~ [Performance] : Optimize Merkle Tree Hashing by Using Byte-Level SHA256 Mar 5, 2026

github-actions bot added size/M and removed size/M labels Mar 5, 2026

Archit381 self-requested a review March 7, 2026 06:38

Remove benchmark_custom.py from PR and update utils implementation

c5186ae

github-actions bot added size/M and removed size/M labels Mar 7, 2026

coderabbitai bot requested changes Mar 7, 2026

View reviewed changes

openverifiablellm/utils.py Show resolved Hide resolved

openverifiablellm/utils.py Outdated Show resolved Hide resolved

openverifiablellm/utils.py Outdated Show resolved Hide resolved

openverifiablellm/utils.py Outdated Show resolved Hide resolved

Update openverifiablellm/utils.py

f6dfae4

coderabbit follow up Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

github-actions bot added size/M and removed size/M labels Mar 9, 2026

coderabbitai bot requested changes Mar 9, 2026

View reviewed changes

openverifiablellm/utils.py Outdated Show resolved Hide resolved

Update openverifiablellm/utils.py

681c851

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

github-actions bot added size/M and removed size/M labels Mar 9, 2026

coderabbitai bot approved these changes Mar 9, 2026

View reviewed changes

Shubhamx404 added 2 commits March 10, 2026 23:38

Merge upstream/main into chained-commitments-fix

ee644bd

Updated verification tests and config

1f1df92

github-actions bot added configuration size/L and removed size/M labels Mar 10, 2026

fix: sort imports in utils.py (ruff I001)

a840b13

github-actions bot added size/L and removed size/L labels Mar 10, 2026

style: apply ruff format to utils.py and test_util.py

d589fc5

github-actions bot added size/L and removed size/L labels Mar 10, 2026

Uh oh!

Conversation

Shubhamx404 commented Mar 5, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Addressed Issues:

Summary

Before

Solution

After

Workflow

implementation example

Benchmarks , benchmark workflow and Timings

After Changes

i have also incorporated benchmarks for merkel_roots , benchmark_custom.py

Additional Notes:

Checklist

⚠️ AI Notice - Important!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Shubhamx404 commented Mar 6, 2026

Uh oh!

Archit381 commented Mar 7, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Shubhamx404 commented Mar 7, 2026

Uh oh!

Archit381 commented Mar 7, 2026

Uh oh!

Shubhamx404 commented Mar 7, 2026

Uh oh!

Archit381 commented Mar 7, 2026

Uh oh!

Shubhamx404 commented Mar 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

i checked the codes and resolve the issue .. new benchmark score and below befor and after

Uh oh!

Shubhamx404 commented Mar 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

here is observation

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Shubhamx404 commented Mar 9, 2026

Uh oh!

Shubhamx404 commented Mar 9, 2026

Uh oh!

Archit381 commented Mar 10, 2026

Uh oh!

Shubhamx404 commented Mar 10, 2026

Uh oh!

Shubhamx404 commented Mar 10, 2026

Uh oh!

Archit381 commented Mar 10, 2026

Uh oh!

Shubhamx404 commented Mar 10, 2026

Shubhamx404 commented Mar 5, 2026 •

edited by coderabbitai bot

Loading

i have also incorporated benchmarks for merkel_roots , `benchmark_custom.py`

coderabbitai bot commented Mar 5, 2026 •

edited

Loading

Shubhamx404 commented Mar 7, 2026 •

edited

Loading

Shubhamx404 commented Mar 8, 2026 •

edited

Loading