A C++20 header-only library for hedged execution on multi-core CPUs. Run the same task on multiple physical cores simultaneously, take the first result, cooperatively cancel the rest.
Trades CPU resources for tail-latency reduction on high-priority compute workloads. Based on the technique described in Google's The Tail at Scale (Jeff Dean & Luiz André Barroso, CACM 2013).
#include <janus/janus.hpp>
// Race the same function on 2 cores (default)
auto result = janus::race([](std::stop_token token) -> int {
int sum = 0;
for (int i = 0; i < 1000000; ++i) {
if (token.stop_requested()) return sum;
sum += i;
}
return sum;
});
// Race with explicit configuration
janus::RaceConfig cfg{.num_runners = 4, .prefer_cross_numa = true};
auto result = janus::race(cfg, my_compute_fn);
Callables must accept std::stop_token as their first parameter and check stop_requested() periodically.
- C++20 (GCC 11+ or Clang 14+)
- Linux (uses
sched_setaffinity, sysfs topology) - pthreads
cmake -B build -S .
cmake --build build -j$(nproc)
ctest --test-dir build| Function | Description |
|---|---|
janus::race(fn) |
Race fn on 2 cores, return first result |
janus::race(cfg, fn) |
Race with explicit RaceConfig |
struct RaceConfig {
unsigned int num_runners = 2; // redundant executions
bool pin_to_cores = true; // CPU affinity
bool prefer_cross_numa = true; // spread across NUMA nodes
std::optional<std::vector<int>> cores; // explicit core list
};Environment: Intel Core Ultra 5 226V (8C/8T, 4.5 GHz), 8 MB L3, Linux 6.17, GCC 15.2, Release build (
-O3)
Workload: ~1.3ms baseline compute with 5% probability of a 5-20ms stall (simulating interrupt storms, cache evictions, thermal throttling). 1000 iterations per benchmark.
| Percentile | Single | Hedged (2 runners) | Hedged (4 runners) |
|---|---|---|---|
| p50 | 1.3ms | 1.5ms | 1.5ms |
| p95 | 2.6ms | 1.8ms | 2.3ms |
| p99 | 20.4ms | 2.8ms (7.3x) | 2.6ms (7.8x) |
| p99.9 | 23.9ms | 3.2ms (7.5x) | 2.8ms (8.5x) |
p50 stays flat while p99/p99.9 collapse -- with 2 runners, the chance of both hitting a stall is 0.25%; with 4 runners, 0.000625%.
| Runners | Spawn + sync cost |
|---|---|
| 2 | 78 μs |
| 4 | 120 μs |
| 8 | 159 μs |
- Discovers physical CPU topology via Linux sysfs
- Selects N distinct physical cores (NUMA-aware round-robin)
- Spawns N
std::jthreadinstances, each pinned to a core - First runner to finish wins via atomic CAS, sets the result
- Winner signals
std::stop_source-- losers checkstop_requested()and exit - All threads join via RAII; result returned to caller