A simulation study examining when partitioned caches outperform shared LRU — evaluated against four Twitter production traces.
Harish Yerraguntla · Portfolio · GitHub
When does partitioned caching outperform shared LRU because of locality concentration — and can a simple workload metric predict the crossover without running the experiment?
1. Partitioned consistently beats LRU on P95 across all workload types.
The advantage is backend queue pressure distributed across shards, not hit rate. Both LRU and Partitioned track near-identical hit rates across every cluster tested.
| Concurrency | NO_CACHE | LRU | Partitioned | Client Affinity |
|---|---|---|---|---|
| 1K | 638ms | 191 ± 5ms | 187 ± 4ms | 224 ± 6ms |
| 10K | 4617ms | 281 ± 7ms | 240 ± 4ms | 266 ± 6ms |
| 25K | 11328ms | 431 ± 12ms | 330 ± 3ms | 356 ± 7ms |
| 50K | 22517ms | 684 ± 23ms | 480 ± 7ms | 511 ± 13ms |
2. The real crossover is Partitioned vs Client Affinity — not vs LRU.
Tested across 4 real Twitter twemcache clusters spanning α = 0.09–1.59:
| Cluster | α | LRU hit rate | Part. P95 | Affinity P95 | Winner |
|---|---|---|---|---|---|
| cluster10 | 0.092 | 0.04% | 388ms | 375ms | Affinity (barely) |
| cluster19 | 0.735 | 9.0% | 379ms | 375ms | Affinity (barely) |
| cluster7 | 1.067 | 71.0% | 272ms | 375ms | Partitioned +103ms |
| cluster24 | 1.585 | 90.9% | 157ms | 374ms | Partitioned +217ms |
Client affinity wins at low α via node-count distribution (queue routing benefit). It collapses above α ≈ 0.8 once clients share popular keys and per-client cache slots become insufficient.
3. A single metric predicts the crossover with R²=0.977.
log(avg_requests_per_unique_object) linearly predicts the Partitioned vs Affinity P95 gap across all 4 clusters. No simulation needed — compute this from any access log.
Decision rule:
avg_req/obj < 1.1→ streaming regime, architecture choice doesn't affect cache efficiencyavg_req/obj ≥ 1.1→ reuse regime, use Partitioned over Client Affinity
4. avg_req/obj outperforms chi-square stationarity as an architectural predictor.
Reproduced Leo Sciortino's chi-square stationarity metric (normalized X²/dof, multinomial test across temporal segments) across all 4 clusters and correlated both metrics against the Partitioned vs Affinity P95 gap:
| Metric | R² vs Part-Affinity gap | Note |
|---|---|---|
| log(avg_req/obj) | 0.956 | Simpler, one-pass, directly predictive |
| log(chi-square X²/dof) | 0.872 | Measures temporal stationarity |
| LRU hit rate | 0.930 | Requires running the experiment |
Chi-square failed to predict LRU performance improvement (Sciortino et al.'s finding). It partially predicts the Partitioned-Affinity gap (R²=0.872) but avg_req/obj is stronger (R²=0.956). The two metrics agree at extremes (cluster10, cluster24) but diverge at mid-range clusters — they measure different dimensions: reuse density vs temporal stationarity.
5. Non-stochastic oracle trace analysis confirms streaming workload diagnosis.
Analysis of cluster10 using the oracleGeneral binary format (next_access_vtime field):
| Metric | Value | Interpretation |
|---|---|---|
| One-hit wonder fraction | 50% | Half of all requests go to objects never seen again |
| Median reuse distance | 500,009 | Spans the full trace window — no temporal locality |
| Objects with ≥3 accesses | 0.06% | Essentially no object is accessed frequently enough to warm a cache |
| Fitted Zipf α (oracle) | 0.002 (R²=0.034) | Flat distribution — no Zipfian structure |
This directly quantifies IRM assumption violation, extending the qualitative finding ("streaming workloads are uncacheable") to a precise measurement.
Workload model. All four configurations are evaluated against the same merged multi-client request stream. Each client has a private hot key subset (5% of keys, Zipfian) plus a shared global tail. Streams are interleaved to simulate concurrent arrival. 5 trials per condition, 95% CI reported. 220 total trials.
Cache models.
| Config | Model | Routing |
|---|---|---|
NO_CACHE |
None | — |
LRU_CACHE |
Single shared LRU (10% of key space) | Key → global cache |
PARTITIONED_CACHE |
8 shards, each LRU (total = 10%) | Key mod 8 → shard |
CLIENT_AFFINITY |
Per-client LRU (total = 10%) | Client ID → node |
Latency model. Backend latency: lognormal (mean ≈ 120ms) + exponential queue wait scaling with concurrency. Cache hit: lognormal (mean ≈ 2ms LRU, ≈ 8ms partitioned/affinity). Error threshold: >800ms.
Trace validation. Traces streamed directly from CMU FTP — no full download required. First 500K GET/GETS operations loaded per cluster.
simulate.py — experiment engine (workload gen, cache models, MRC, CI)
run_crossover.py — multi-cluster crossover experiment (streams traces from URL)
predict_crossover.py — prediction model: log(avg_req/obj) → P95 gap, R²=0.977
analyze_nonstochastic.py — oracle trace analysis (next_access_vtime, one-hit wonders)
reproduce_chi_square.py — chi-square stationarity metric (reproduces Sciortino et al.)
index.html — interactive dashboard (7 tabs including Crossover Analysis)
results.json — synthetic experiment results (220 trials)
results_cluster24.json — cluster24 trace experiment results
crossover_results.json — 4-cluster crossover experiment results
prediction_model.json — fitted model output and decision rule
results_nonstochastic.json — oracle trace non-stochastic metrics
chi_square_results.json — chi-square stationarity values across 4 clusters
requirements.txt
- Latency model is synthetic. Lognormal parameters calibrated to typical Redis/DB values, not measured. Results are directionally valid, not quantitative benchmarks.
- Queue model is approximate. Exponential wait rather than M/M/c. Underestimates tail latency under sustained overload.
- Chi-square comparison uses partial traces. Our chi-square reproduction uses 800K requests vs Leo's full sample traces (up to 32M requests). Directional ordering is preserved but absolute values differ by ~19×.
- Prediction model has 4 data points. R²=0.977 is promising but needs validation on additional clusters to confirm generality.
- Validate prediction model on additional Twitter clusters across the α=0.1–2.0 range
- Understand why chi-square and avg_req/obj diverge on mid-range clusters (cluster19, cluster7)
- Extend to write-heavy and TTL workloads
pip install -r requirements.txt
# Synthetic experiments
python simulate.py
# Multi-cluster crossover (streams traces from CMU FTP — no download needed)
python run_crossover.py
# Prediction model
python predict_crossover.py
# Non-stochastic oracle analysis (requires oracleGeneral .zst trace)
python analyze_nonstochastic.py
# Dashboard
python3 -m http.server 8080
# visit http://localhost:8080- Ding et al., "Continuous-Time Modeling of Zipfian Workload Locality," TOMPECS 2025
- Hu et al., "LAMA: Optimized Locality-aware Memory Allocation for Key-value Cache," USENIX ATC 2015
- Yang et al., "A Large-scale Analysis of Hundreds of In-memory Key-value Cache Clusters at Twitter," OSDI 2020