Skip to content

[Performance] : Optimize Merkle Tree Hashing by Using Byte-Level SHA256#33

Open
Shubhamx404 wants to merge 10 commits intoAOSSIE-Org:mainfrom
Shubhamx404:chained-commitments-fix
Open

[Performance] : Optimize Merkle Tree Hashing by Using Byte-Level SHA256#33
Shubhamx404 wants to merge 10 commits intoAOSSIE-Org:mainfrom
Shubhamx404:chained-commitments-fix

Conversation

@Shubhamx404
Copy link
Contributor

@Shubhamx404 Shubhamx404 commented Mar 5, 2026

Addressed Issues:

Fixes #26

This PR improves the performance and correctness of the Merkle tree implementation by removing repeated hex string conversions during hashing.

Summary

Previously, the functions compute_merkle_root, generate_merkle_proof```, and verify_merkle_proofrelied directly oncompute_sha256()```, which returns a hexadecimal string (.hexdigest()). However, hashlib.sha256() internally operates on raw bytes. Because of this mismatch, the implementation repeatedly converted between hex strings and raw bytes while constructing the Merkle tree.
For large files, this resulted in significant overhead.

Before

sha256
  • Each Merkle tree operation converted SHA256 outputs from hex strings to raw bytes and back, causing thousands of bytes.fromhex() operations for large datasets.
  • The frequent creation and conversion of hexadecimal strings introduced unnecessary memory allocations and data copying, increasing runtime overhead.

Solution

The Merkle tree implementation was optimized to operate on raw bytes instead of hexadecimal strings. A new helper function compute_sha256_bytes() returns the SHA256 digest using .digest(), allowing all intermediate hashes to remain in byte format. This removes repeated bytes.fromhex() conversions and reduces overhead, with hex conversion performed only once when returning the final Merkle root.

After

  • To eliminate unnecessary conversions, a new helper function was introduced compute_sha256_bytes()This function returns the raw binary digest (.digest()) instead of a hexadecimal string.

Workflow


Raw File
   │
   ▼
Read File Chunk (bytes)
   │
   ▼
SHA256 (.digest()) → bytes
   │
   ▼
Combine Hashes (bytes + bytes)
   │
   ▼
SHA256 (.digest())
   │
   ▼
Repeat until root
   │
   ▼
Convert to HEX ONLY ONCE
(return leaves[0].hex())

implementation example

leaf_bytes = compute_sha256_bytes(data=chunk)
leaves.append(leaf_bytes)

combined = left + right
parent_bytes = compute_sha256_bytes(data=combined)
image

Benchmarks , benchmark workflow and Timings

pc - 8gb ram DDR4 , 4gb amd gpu , ryzen 5600h

Before changes

This was run on local software ide
Screenshot 2026-03-05 214240

this was run on window powershell

Screenshot 2026-03-05 215158

After Changes

This was run on local ide

Screenshot 2026-03-05 220615

this was run on win powershell

Screenshot 2026-03-05 221018

i have also incorporated benchmarks for merkel_roots , benchmark_custom.py

Dataset File
   │
   ▼
benchmark_custom.py
   │
   ├── compute_merkle_root()
   │       │
   │       ▼
   │   optimized byte hashing
   │
   └── generate_merkle_proof()
           │
           ▼
     Merkle proof generation

Locally run using python benchmark_custom.py "File path "

Screenshot 2026-03-05 232937

Additional Notes:

#Final Architecture

               Dataset (e.g., simplewiki xml.bz2)
                              │
                              ▼
                    Chunking (1MB blocks)
                              │
                              ▼
                      Byte-Level SHA256
                              │
                              ▼
                  Merkle Tree Construction
                              │
                              ▼
                         Merkle Root
                              │
                              ▼
                      Dataset Manifest

Checklist

  • [ * ] My code follows the project's code style and conventions
  • [ * ] I have made corresponding changes to the documentation
  • [ * ] My changes generate no new warnings or errors
  • [ * ] I have joined the Discord server and I will share a link to this PR with the project maintainers there
  • [ * ] I have read the Contributing Guidelines

⚠️ AI Notice - Important!

We encourage contributors to use AI tools responsibly when creating Pull Requests. While AI can be a valuable aid, it is essential to ensure that your contributions meet the task requirements, build successfully, include relevant tests, and pass all linters. Submissions that do not meet these standards may be closed without warning to maintain the quality and integrity of the project. Please take the time to understand the changes you are proposing and their impact.

Summary by CodeRabbit

  • New Features

    • CLI-accessible benchmark mode to measure Merkle computation speed and peak memory.
    • Added a SHA-256 helper that accepts raw data or file input.
  • Improvements

    • Merkle computation, proof generation, and verification now use byte-level hashing for consistency and efficiency.
    • Simplified Merkle proof loading and verification workflow.
  • Tests

    • Added tests covering the new hashing helper and Merkle root/proof workflows.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 5, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Added a bytes-returning SHA-256 helper and refactored Merkle routines to operate on raw bytes; introduced a benchmarking routine (tracemalloc + perf_counter) and CLI flag to run benchmarks; tests updated to use new helpers and proof-loading/verification flow.

Changes

Cohort / File(s) Summary
Core Utilities
openverifiablellm/utils.py
Added compute_sha256_bytes(data=None, file_path=None) -> bytes. Refactored compute_merkle_root, generate_merkle_proof, and verify_merkle_proof to use raw digest bytes internally and emit hex only for external outputs.
Benchmark & CLI
openverifiablellm/utils.py
Added run_benchmark(file_path, chunk_size=...) using tracemalloc and time.perf_counter; extended __main__ with --BENCHMARK_MODE and --chunk_size; imported os, time, tracemalloc, argparse.
Tests
tests/test_util.py
Added tests for compute_sha256_bytes (data and file path). Updated Merkle tests to use load_merkle_proof() and call verify_merkle_proof(chunk_bytes=..., proof=..., merkle_root=...).

Sequence Diagram(s)

sequenceDiagram
  participant CLI as CLI (--BENCHMARK_MODE)
  participant Utils as utils.run_benchmark / Merkle routines
  participant FS as FileSystem
  participant Hash as SHA256
  participant Prof as Profiler

  CLI->>Utils: run_benchmark(file_path, chunk_size)
  Utils->>Prof: tracemalloc.start & perf_counter start
  Utils->>FS: read file in chunks
  loop per chunk
    Utils->>Hash: compute_sha256_bytes(chunk)
    Hash-->>Utils: raw digest bytes
    Utils->>Utils: assemble leaves & compute parents (raw bytes)
  end
  Utils->>Prof: stop & get peak/time
  Utils-->>CLI: output root, proof timings, peak memory
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested labels

Python Lang

Suggested reviewers

  • Archit381

Poem

🐰 I nibbled hex and cast it free,
Raw bytes now sing beneath the tree,
Leaves stack up fast, proofs hop in line,
Benchmarks hum and memory’s fine,
A happy rabbit pats the root with glee.

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: optimizing Merkle tree hashing by using byte-level SHA256 instead of hex conversions, which directly addresses the performance objective.
Linked Issues check ✅ Passed The PR successfully implements all coding requirements from issue #26: introduces compute_sha256_bytes() returning raw bytes, refactors compute_merkle_root and generate_merkle_proof to use byte-level hashing, eliminates hex-to-bytes conversions, and adds benchmarking capabilities to measure performance improvements.
Out of Scope Changes check ✅ Passed All changes are scoped to the stated objective: compute_sha256_bytes helper, Merkle routine refactoring, CLI benchmarking integration, and test updates directly support eliminating hex conversions and measuring performance gains.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@benchmark_custom.py`:
- Around line 42-44: The current except block that catches "except Exception as
e" and only does print(f"An error occurred: {e}") swallows failures; change it
to either catch specific expected exceptions or, after logging the error (use
logging.exception or similar for stacktrace), re-raise the exception or exit
with a non-zero status so automation sees failure; locate the "except Exception
as e" block and replace the print-only behavior with structured logging plus
either "raise" or a sys.exit(1) to ensure failures are propagated.
- Line 33: The local variable "proof" is assigned from
generate_merkle_proof(file_path, chunk_index=chunk_index, chunk_size=chunk_size)
but never used, causing Ruff F841; remove the unused assignment by either
calling generate_merkle_proof(...) without capturing its return or assign it to
a throwaway "_" (e.g., _ = generate_merkle_proof(...)) in the same spot where
"proof = generate_merkle_proof(...)" appears to silence the lint error while
keeping the call semantics.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: c44dedfc-9e6c-4647-aacd-c33adf46f5e9

📥 Commits

Reviewing files that changed from the base of the PR and between 4f9fdd1 and a949331.

📒 Files selected for processing (3)
  • benchmark_custom.py
  • openverifiablellm/utils.py
  • tests/test_util.py

@github-actions github-actions bot added size/M and removed size/M labels Mar 5, 2026
@Shubhamx404 Shubhamx404 changed the title [Performance] : compute_sha256 usage and implement chained cryptographic commitments [Performance] : Optimize Merkle Tree Hashing by Using Byte-Level SHA256 Mar 5, 2026
@github-actions github-actions bot added size/M and removed size/M labels Mar 5, 2026
@Shubhamx404
Copy link
Contributor Author

hey @Archit381 can you review this pr linked with issue #26

@Archit381 Archit381 self-requested a review March 7, 2026 06:38
@Archit381
Copy link
Member

@Shubhamx404 Do the following changes pls

  • Instead of having additional python script for running benchmark, update utils so that when run it accepts a flag:

--BENCHMARK_MODE=TRUE

  • In your benchmark, reduce the logging to only keep minutes, seconds and ms
  • Also expand it further to include resource consumption metrics
  • Fix coderabbit requested changes

@github-actions github-actions bot added size/M and removed size/M labels Mar 7, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/utils.py`:
- Around line 468-472: The current fragile parsing that sets input_dump =
sys.argv[1] can misassign the file when flags like "--BENCHMARK_MODE=TRUE" come
first; replace this with proper argparse usage to define a positional input_dump
and an optional --BENCHMARK_MODE flag (so run_benchmark(input_dump) or
extract_text_from_xml(input_dump) are called reliably), or at minimum validate
that input_dump is an existing file path before using it and scan sys.argv for
the benchmark flag instead of assuming position; update the main entry logic
around input_dump, run_benchmark, and extract_text_from_xml accordingly.
- Line 427: The unpacked variable current_mem from the
tracemalloc.get_traced_memory() call is unused; rename it to _current_mem (or
_current_mem) to indicate intentional unused status. Update the assignment where
tracemalloc.get_traced_memory() is called (the current_mem, peak_mem =
tracemalloc.get_traced_memory() expression) to use the prefixed name so linters
and reviewers recognize it as intentionally unused while leaving peak_mem
unchanged.
- Around line 440-441: The chunk_index selection can exceed available chunks and
cause IndexError; instead of using size_mb > 10 to pick chunk_index=10 in the
call to generate_merkle_proof(file_path, chunk_index=chunk_index,
chunk_size=chunk_size), compute the actual chunk count from file size and
chunk_size (e.g., chunk_count = ceil(file_size_bytes / chunk_size)) and set
chunk_index = min(10, chunk_count - 1) (or 0 if chunk_count == 0) before calling
generate_merkle_proof so you always pass a valid chunk index.
- Around line 76-78: Add a unit test that asserts compute_merkle_root produces
the expected hardcoded root for a multi-chunk input: create a deterministic list
of chunks (byte strings), compute the merkle root via
compute_merkle_root(chunks) and compare its hex (or bytes) to a precomputed
expected value (from an independent SHA-256 concatenation calculation) to catch
byte-order/concatenation issues; reference the compute_merkle_root function and
use the same byte-formatting convention as compute_sha256_bytes for comparison,
and include at least a 3+ chunk test vector so the tree has multiple levels.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 0bea160f-0730-46ea-8594-8683af360b8b

📥 Commits

Reviewing files that changed from the base of the PR and between a949331 and c5186ae.

📒 Files selected for processing (1)
  • openverifiablellm/utils.py

@Shubhamx404
Copy link
Contributor Author

Hey @Archit381 i implemented what you needed

image

@Archit381
Copy link
Member

Add before vs after resource consumption benchmarks in this PR

Also fix coderabbit issues

@Shubhamx404
Copy link
Contributor Author

@Archit381 showed memory usage before and after

before

image

After

img2

@Archit381
Copy link
Member

@Shubhamx404 The results don't really show any significant upside to your change

@Shubhamx404
Copy link
Contributor Author

Shubhamx404 commented Mar 7, 2026

@Archit381
chunk_size is already set to 1mb (MERKLE_CHUNK_SIZE_BYTES = 1024 * 1024), the 355 mb file is broken down into ~355 chunks.

also genrating manifest after this takes less time and memory usage is indentical too before and after
also i implemented compute_merkle_root and compute_sha256_bytes()`` helper function. this function relies entirely on hashlib.sha256().digest() instead of .hexdigest()```.

reducing the chunk_size to very small values (e.g., 4 kb or 8 bytes) it creates millions of Merkle tree nodes. The old implementation slows down or crashes due to repeated bytes.fromhex() conversions, while the new byte-level chained-commitments-fix handles it efficiently and scales without stalling

i checked the codes and resolve the issue .. new benchmark score and below befor and after

@Archit381

@Shubhamx404
Copy link
Contributor Author

Shubhamx404 commented Mar 8, 2026

hey @Archit381 i resolved coderabbit issues , i again run benchmark for conformation

before

image

After

image

here is observation

  • Memory usage reduced from 10.05 mb to 2.13 mb (around 78% reduction)
  • Execution time improved from 4.5s to around 0.46s (around 10x faster)

coderabbit follow up

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
@github-actions github-actions bot added size/M and removed size/M labels Mar 9, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/utils.py`:
- Around line 457-462: The unconditional proof logging at the end of the block
(logging generate_merkle_proof, peak_mem_proof, proof_time, chunk_index) should
be removed and instead only logged inside the branch that actually generates a
proof (the branch that sets end_time, chunk_index and traces memory); update the
generate_merkle_proof related logging so it is emitted once per proof generation
and avoid referencing end_time or chunk_index when the function returned early
for an empty file—i.e., move the logger.info calls for proof_time and
peak_mem_proof into the proof-generation branch and delete the
duplicate/unconditional logging at the end.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 6b9408b3-89bb-4909-9da4-ad5a6ec85640

📥 Commits

Reviewing files that changed from the base of the PR and between 50c1706 and f6dfae4.

📒 Files selected for processing (1)
  • openverifiablellm/utils.py

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
@github-actions github-actions bot added size/M and removed size/M labels Mar 9, 2026
@Shubhamx404
Copy link
Contributor Author

@Archit381 i again review everyting and test for benchmark

image

after

image

@Shubhamx404
Copy link
Contributor Author

@Archit381 can you please review this

@Archit381
Copy link
Member

@Shubhamx404 Pull from main, fix the conflicts and check again. We are using uv to package now so test if there's any issue

@Shubhamx404
Copy link
Contributor Author

@Archit381 okay , working on it

@Shubhamx404
Copy link
Contributor Author

image

@Archit381 all test passed

@Archit381
Copy link
Member

Fix Lint test

@github-actions github-actions bot added size/L and removed size/L labels Mar 10, 2026
@Shubhamx404
Copy link
Contributor Author

hey @Archit381 fixed lint issue

@github-actions github-actions bot added size/L and removed size/L labels Mar 10, 2026
@Shubhamx404
Copy link
Contributor Author

image

after changes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Performance] : Remove redundant hex-to-bytes conversions in Merkle tree hashing

2 participants