Skip to content

feat: wave-based batch uploads with configurable concurrency#6

Open
mickvandijke wants to merge 4 commits intomainfrom
feat-wave-batch-uploads
Open

feat: wave-based batch uploads with configurable concurrency#6
mickvandijke wants to merge 4 commits intomainfrom
feat-wave-batch-uploads

Conversation

@mickvandijke
Copy link
Contributor

@mickvandijke mickvandijke commented Mar 25, 2026

Summary

  • Wave-based batch uploads: chunks are grouped into waves of 64, each paid in a single EVM transaction. Storage from wave N is pipelined with quote collection for wave N+1
  • Concurrency throttling: quote collection, chunk storage, and merkle upload paths are all bounded by chunk_concurrency (defaults to half available CPU threads). Prevents overwhelming peer connections on small networks
  • Majority-only replication: chunks are sent to CLOSE_GROUP_MAJORITY (3) peers instead of all CLOSE_GROUP_SIZE (5), reducing per-chunk network load
  • Configurable via CLI: --chunk-concurrency <N> flag to override the auto-detected default

Test plan

  • cargo clippy --all-targets --all-features -- -D warnings passes
  • cargo fmt --all -- --check passes
  • All unit tests pass (97/97)
  • E2E chunk tests pass (9/9)
  • E2E data tests pass (6/6)
  • E2E file tests pass (4/4)
  • E2E merkle tests pass (2/2)
  • Manual 53MB file upload to local devnet succeeds

🤖 Generated with Claude Code

mickvandijke and others added 4 commits March 25, 2026 21:27
Replace per-chunk sequential EVM payments with batched wave-based
uploads. Chunks are grouped into waves (up to 64), quotes collected
concurrently, then paid in a single wallet.pay_for_quotes() call.
Stores from wave N pipeline with quote collection for wave N+1.

For a 3-chunk (4MB) upload this reduces payment from 3 separate
EVM transactions (~1s with nonce contention) to 1 transaction (~10ms).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Uploads of large files to small networks (e.g. local devnets) were
failing because the client blasted all chunks to all peers simultaneously,
overwhelming QUIC connections and causing cascading disconnects.

Three changes:
- Only replicate chunks to CLOSE_GROUP_MAJORITY peers instead of all
  CLOSE_GROUP_SIZE, reducing per-chunk network load
- Throttle both quote collection and chunk storage with
  buffer_unordered(chunk_concurrency) instead of unbounded parallelism
- Make chunk_concurrency configurable via ClientConfig and --chunk-concurrency
  CLI flag, defaulting to half the available CPU threads

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@mickvandijke mickvandijke changed the title feat: wave-based batch uploads with single EVM payment per wave feat: wave-based batch uploads with configurable concurrency Mar 25, 2026
Copy link
Contributor

@jacderida jacderida left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review of wave-based batch uploads

Nice work overall — the wave-based pipelining design is clean and the code reads well. A few concerns before merging:

1. chunk_put_to_close_group now has zero fault tolerance (chunk.rs:73-74)

This is the main concern. Previously, the function sent PUTs to all peers (typically CLOSE_GROUP_SIZE = 5) and returned success as soon as CLOSE_GROUP_MAJORITY (3) responded. This gave tolerance for up to 2 peer failures.

The PR truncates the target list to exactly CLOSE_GROUP_MAJORITY peers:

let target_peers = &peers[..peers.len().min(CLOSE_GROUP_MAJORITY)];

Now if any single peer is slow or unavailable, the entire chunk store fails with InsufficientPeers. On a small testnet or under network churn, this could make uploads significantly more fragile.

This also affects the existing chunk_put code path (not just the new batch path), since chunk_put_to_close_group is shared.

Suggestion: Keep sending to all peers but return early once CLOSE_GROUP_MAJORITY confirm — i.e., revert this change. The network cost of 2 extra concurrent PUTs is marginal compared to the resilience gain.

2. Duplicate chunks are assumed stored without verification (batch.rs:222-226)

In batch_upload_chunks, when a content-address collision is found locally:

if seen_addresses.insert(address) {
    unique_chunks.push(chunk);
} else {
    debug!("Skipping duplicate chunk {}", hex::encode(address));
    all_addresses.push(address);
}

The duplicate's address is added to the result immediately, but the first copy of that chunk hasn't been uploaded yet — it might fail later. The address gets returned regardless. This is probably fine in practice since the caller just counts addresses.len(), but it could be misleading if the return value is ever used to verify what was actually stored.

3. Wave error semantics: one chunk failure aborts the whole upload

If any single chunk in a wave fails during store_paid_chunks, the ? in the collect propagates the error and discards all successfully stored chunks from that wave. The previous per-chunk buffer_unordered approach had the same fail-fast behavior, so this isn't a regression, but worth noting — partial progress from the current wave is lost.

For a future iteration, it might be worth returning a partial result (stored addresses + errors) so callers can retry only failed chunks.

4. Minor: payment_mode_used is always PaymentMode::Single for batch path

The batch path reports PaymentMode::Single even though it's doing batched EVM transactions. This is technically accurate (not using merkle proofs) but could be confusing when debugging. Consider a PaymentMode::Batch variant, or at least a comment explaining the naming.

5. Minor: doc comment on chunk_put_to_close_group is now stale

The existing doc comment (line 55-58 on main) says "Sends the PUT concurrently to all peers". The PR updates it to say "first CLOSE_GROUP_MAJORITY peers", but per point 1 above, if you revert the truncation change, the original doc should be restored.


Summary: The batch upload design and pipelining are solid. The main blocker is point 1 — reducing fault tolerance in chunk_put_to_close_group affects all upload paths and could cause reliability issues. I'd suggest reverting that change and keeping the rest.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants