Skip to content

zungur/Janus

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Janus

A C++20 header-only library for hedged execution on multi-core CPUs. Run the same task on multiple physical cores simultaneously, take the first result, cooperatively cancel the rest.

Trades CPU resources for tail-latency reduction on high-priority compute workloads. Based on the technique described in Google's The Tail at Scale (Jeff Dean & Luiz André Barroso, CACM 2013).

Quick start

#include <janus/janus.hpp>

// Race the same function on 2 cores (default)
auto result = janus::race([](std::stop_token token) -> int {
    int sum = 0;
    for (int i = 0; i < 1000000; ++i) {
        if (token.stop_requested()) return sum;
        sum += i;
    }
    return sum;
});

// Race with explicit configuration
janus::RaceConfig cfg{.num_runners = 4, .prefer_cross_numa = true};
auto result = janus::race(cfg, my_compute_fn);

Callables must accept std::stop_token as their first parameter and check stop_requested() periodically.

Requirements

  • C++20 (GCC 11+ or Clang 14+)
  • Linux (uses sched_setaffinity, sysfs topology)
  • pthreads

Building

cmake -B build -S .
cmake --build build -j$(nproc)
ctest --test-dir build

API

Function Description
janus::race(fn) Race fn on 2 cores, return first result
janus::race(cfg, fn) Race with explicit RaceConfig

RaceConfig

struct RaceConfig {
    unsigned int num_runners = 2;          // redundant executions
    bool pin_to_cores = true;              // CPU affinity
    bool prefer_cross_numa = true;         // spread across NUMA nodes
    std::optional<std::vector<int>> cores; // explicit core list
};

Benchmark results

Environment: Intel Core Ultra 5 226V (8C/8T, 4.5 GHz), 8 MB L3, Linux 6.17, GCC 15.2, Release build (-O3)

Workload: ~1.3ms baseline compute with 5% probability of a 5-20ms stall (simulating interrupt storms, cache evictions, thermal throttling). 1000 iterations per benchmark.

Tail latency collapse

Percentile Single Hedged (2 runners) Hedged (4 runners)
p50 1.3ms 1.5ms 1.5ms
p95 2.6ms 1.8ms 2.3ms
p99 20.4ms 2.8ms (7.3x) 2.6ms (7.8x)
p99.9 23.9ms 3.2ms (7.5x) 2.8ms (8.5x)

p50 stays flat while p99/p99.9 collapse -- with 2 runners, the chance of both hitting a stall is 0.25%; with 4 runners, 0.000625%.

Race overhead

Runners Spawn + sync cost
2 78 μs
4 120 μs
8 159 μs

How it works

  1. Discovers physical CPU topology via Linux sysfs
  2. Selects N distinct physical cores (NUMA-aware round-robin)
  3. Spawns N std::jthread instances, each pinned to a core
  4. First runner to finish wins via atomic CAS, sets the result
  5. Winner signals std::stop_source -- losers check stop_requested() and exit
  6. All threads join via RAII; result returned to caller

About

Hedged Execution for CPU Compute

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors