Skip to content

Need to resolve MPI deduplicatable data generation #278

@russfellows

Description

@russfellows

Bug: Multi-process / MPI runs produce 100% deduplicatable data across workers

This is related to issue #272

Summary

When kv_cache_benchmark is run with multiple processes (MPI or otherwise) targeting shared storage, every worker generates byte-for-byte identical data for the same logical request IDs. On a dedup-capable storage system this is invisible, but it means the benchmark measures write throughput of deduplicated data — not unique data — rendering multi-host storage stress results meaningless.

Affected Files

  • kv_cache_benchmark/kv_cache/cache.pyKVCacheGenerator, _seed_from_key()
  • kv_cache_benchmark/kv_cache/benchmark.pycache_key construction

Root Cause

Key generation is deterministic with no per-process identity

Cache keys are constructed in benchmark.py using only the request counter and a modulo of num_users:

# benchmark.py line ~382
user_id  = f"dataset_user_{req_id % self.num_users}"
cache_key = f"{user_id}_req_{req_id:06d}"

These strings are identical on every worker, because req_id is reset to 0 on each process independently.

Data generation depends only on the key string and a fixed global seed

In cache.py, KVCacheGenerator uses a fixed global_seed (default 0) and derives a per-entry seed via SHA-256 of the key string:

# cache.py line ~39
rng = np.random.default_rng(self.global_seed)   # same on every worker
self.precomputed_buffer = rng.uniform(...)        # same 256 MB buffer on every worker

# cache.py line ~49
return (key_hash64 ^ self.global_seed) & 0xFFFF_FFFF_FFFF_FFFF  # same XOR stamp for same key

Because both the key strings and the global seed are identical across workers, every worker produces bitwise-identical 4 KB blocks for every cache entry.

Impact

Workers (N) Dedup ratio Unique data written
1 0% 100%
2 50% 50%
8 87.5% 12.5%
16 93.75% 6.25%
64 98.4% 1.6%

A storage system with inline deduplication (e.g. many all-flash arrays and object stores) will absorb N× the logical write I/O while storing only 1× the data, appearing N× faster than it actually is for unique workloads. This makes the benchmark unreliable as a measure of raw write capacity in any multi-host scenario.

Steps to Reproduce

  1. Run the benchmark on 2+ hosts targeting the same shared storage mount or object store endpoint.
  2. Compare effective storage capacity consumed vs. logical bytes written — consumption will not scale with host count.
  3. Alternatively, inspect the raw data: any two workers' output files for the same time window will be byte-for-byte identical.

Expected Behavior

Each worker (MPI rank, process, or host) should produce unique data so that N workers write N× unique bytes to storage, properly stressing storage capacity and ingestion throughput.

Proposed Fix

Embed a per-worker identity into either the global_seed or the cache key string. Two equivalent options:

Option A — Unique seed per worker (minimal change)

Pass the MPI rank (or os.getpid() / hostname hash as fallback) as global_seed when constructing KVCacheGenerator:

import os, socket, hashlib

def _worker_seed() -> int:
    """Return a seed unique to this process on this host."""
    try:
        from mpi4py import MPI
        return MPI.COMM_WORLD.Get_rank()
    except ImportError:
        # Fallback: hash of hostname + PID
        ident = f"{socket.gethostname()}:{os.getpid()}"
        return int(hashlib.sha256(ident.encode()).hexdigest()[:16], 16)

# When constructing KVCacheGenerator:
generator = KVCacheGenerator(model_config, global_seed=_worker_seed())

This changes the 256 MB precomputed buffer and the XOR stamp for every worker, making all 4 KB blocks unique across workers while keeping them reproducible within a single worker run.

Option B — Unique key prefix per worker (more explicit)

Prefix every cache key with the worker identity:

worker_prefix = f"rank{mpi_rank}_host{hostname_hash}"
cache_key = f"{worker_prefix}_{user_id}_req_{req_id:06d}"

This keeps the same precomputed buffer but changes the XOR stamp per worker, which is sufficient to eliminate cross-worker dedup.

Recommendation

Option A (unique global_seed) is preferred because:

  • It also diversifies the 256 MB precomputed noise buffer, giving true statistical independence between workers.
  • It requires changes in only one place (benchmark initialization).
  • It is transparent to all downstream key-derivation and stamping logic.

The --seed CLI argument (if added) should document that it sets the per-worker base seed, and that MPI rank is XOR'd in automatically so users can still get reproducible multi-worker runs by fixing --seed.

Notes

  • Single-process runs are not affected — the current design is correct and the anti-dedup properties (no intra-entry or cross-entry block collisions) verified at 64 GB scale hold for single-process use.
  • The fix does not change the on-disk format, the benchmark output schema, or any config file fields.
  • This issue is distinct from the previously fixed 96.7% intra-entry dedup bug (commit 0aa9aee) — that was a single-process issue; this is a multi-process issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions