Oktoplus is a in-memory data store K:V where V is a container: std::list, std::map, boost::multi_index_container, std::set, you name it. Doing so the client can choose the best container for his own access data pattern.
RESP2 wire-compatible with Redis — point redis-cli, redis-benchmark, or any existing Redis client at port 6379 and it just works. Drop-in for the read/write path on the supported commands (lists 100%, sets 94%; strings on the roadmap).
Faster than Redis at the same workload on most cells benchmarked — single-client -P 16 runs ~6–26% above the Redis reference across LPUSH / LLEN / RPUSH / LPOP / RPOP / SADD; CPU-heavy multi-key workloads (e.g. LPOS scans) reach ~100× because command execution is multi-threaded and sharded per key, so N writers on N keys use N cores. See the benchmark tables below for the per-cell numbers.
Not a drop-in for everything yet. No persistence, no replication / clustering, no pub/sub / streams / scripting / transactions. If you need Redis as the system of record, stay on Redis; if you need it as a hot in-memory store and want the multi-core scaling and the richer container types (vector with O(1) INDEX, multi-set, multi-map, multi-index), Oktoplus is worth a try.
If this reminds you of Redis then you are right — Redis is the inspiration. Oktoplus differs along a few axes that come up repeatedly in the rest of this README:
- command execution is multi-threaded, sharded per-key
- the value type is the container itself, picked from a richer set (vector, deque, multi-set, multi-map, multi-index) so the access pattern dictates the data structure
- that lets, for example,
INDEXon a vector be O(1) where Redis'sLINDEXon a list is O(n) - it speaks RESP2 on the wire, so existing Redis client libraries and tooling work unchanged
Redis Commands Compatibility (RESP)
Oktoplus specific containers (already implemented, see specific documentation)
The server exposes the same data through two interfaces:
- RESP2 (default port
6379, always on) — primary wire protocol, wire-compatible with Redis using the RESP2 framing (+-:$*types,$-1\r\n/*-1\r\nnulls), so existing tooling likeredis-cliandredis-benchmarkworks out of the box. Override the bind address viaservice.resp_endpointin the JSON config. Includes the admin commandsFLUSHDB/FLUSHALL. RESP3 (HELLO-negotiated, native maps/sets/push, unified_\r\nnull) is on the roadmap — see TODO below. - gRPC (optional) — see
src/Libraries/Commands/commands.proto. Use it to generate a client in your favourite language. Includes admin RPCsflushAll/flushDbplus all the list / set / deque / vector commands. Disabled by default at build time to keep baseline RSS down to ~9 MiB — pass-DOKTOPLUS_WITH_GRPC=ONto cmake to compile it in, then enable at runtime by settingservice.endpointin the JSON config.
The per-family compatibility tables (LISTS, SETS, STRINGS) include a column showing which Redis commands are wired to gRPC and to RESP today.
- RESP3 protocol support: implement
HELLOfor protocol negotiation, gate the per-connection encoder on the negotiated version, swap to the unified_\r\nnull, and add the new type tags (#boolean,,double,(big number,=verbatim string,~set,%map,>push,|attribute). Today the server speaks RESP2 only; RESP3-capable clients (e.g.redis-cli -3) fall back to RESP2 becauseHELLOreturnsERR unknown command.
Memory-footprint techniques on the table for further reducing per-key overhead at small values (all are tricks Redis already applies — closing any of them lowers the bytes-per-key numbers in the memory section below):
- embstr encoding: when a value is ≤44 B, store the
redisObjectheader and the SDS in the same allocation. One alloc instead of two on the hot path for small values. - listpack encoding for small lists / hashes / sets: a single contiguous packed buffer of
[backlen, encoding, content]entries, no per-entry pointer or header overhead. A 1-element small-value list collapses to a single ~25 B allocation vs Oktoplus'sProtectedContainer + devector + okts::stor::stringchain. - Quicklist for large lists: linked list of listpack nodes; per-element overhead amortizes to a few bytes inside a node instead of one heap alloc per element.
- Shared integer objects: Redis interns the integers 0..9999 as global singletons.
RPUSH foo 42doesn't allocate; it stores a pointer to the shared "42" object. Cuts allocation pressure on numeric workloads. - Compact dict: replace
absl::flat_hash_map's control-byte + slot layout (fast but high per-bucket overhead) with a 24-bytedictEntry-style separate-chaining hash, possibly with incremental rehashing to bound worst-case stalls. - Lazy-free + activedefrag: offload deallocation to a background thread (
UNLINK,FLUSHDB ASYNC) and periodically walk the heap using jemalloc'sje_get_defrag_hintto migrate live objects out of fragmented slabs. Cuts steady-state retained metadata over time.
Server is multithread, two different clients working on different containers (type or name) have a minimal interaction. For example multiple clients performing a parallel batch insert on different keys can procede in parallel without blocking each other.
The benchmarks below run Oktoplus alongside Redis on the same machine as a known reference point — Redis is the system most readers will already have a feel for, so the Redis numbers are included as a yardstick rather than as a leaderboard. The script (benchmark_results/run_benchmark.sh) starts both servers itself, runs redis-benchmark at single-client -P 1/-P 16 and at varying concurrency -c 1..200, and dumps CSVs into benchmark_results/raw/.
Each redis-benchmark invocation runs N iterations (env var ITERATIONS, default 1; the published numbers below use N=5) and the published cell is the median rps across them. The harness flags any test whose max/min > 1.5× to separate signal from noise: single-run measurements understate random-key throughput because the first iteration pays cold-start costs.
Each per-cell CSV row also carries a trailing server_cpu_pct column — the average percent CPU the server consumed across the cell's wall-clock window (read from /proc/<pid>/stat utime+stime, divided by the cell duration). Values >100% mean multi-core utilisation: e.g. 1300% = 13 full cores saturated. The companion chart_parallelism_cpu.svg plots cores-saturated vs concurrency directly so per-key sharding shows up on a hardware-independent metric. Each CSV is also paired with a *.config sidecar describing the exact env-var values that produced it (defaults reproduce the published numbers).
Hardware: AMD EPYC Genoa devserver. Build: -O3 -march=native -mtune=native -ffast-math -fno-semantic-interposition -funroll-loops, linked against jemalloc (see OKTOPLUS_WITH_JEMALLOC in CMake). Workload: 100k ops/iteration, 100k key-space, single client unless stated otherwise.
Charts are generated from
benchmark_results/raw/*.csvbybenchmark_results/make_chart.py(no dependencies — pure-stdlib Python emitting SVG + HTML).An interactive Chart.js dashboard with the same data lives at
benchmark_results/report.html— view it rendered through htmlpreview.github.io.
| Test | Oktoplus rps | Redis rps | Okto / Redis |
|---|---|---|---|
| LPUSH | 29,542 | 30,423 | 97% |
| SADD | 32,268 | 29,412 | 110% |
| LRANGE_100 | 26,185 | 24,624 | 106% |
| LPOP (rand) | 30,864 | 28,927 | 107% |
| RPOP (rand) | 29,412 | 28,736 | 102% |
| LLEN (rand) | 30,998 | 28,818 | 108% |
| SCARD (rand) | 32,362 | 28,794 | 112% |
| Test | Oktoplus rps | Redis rps | Okto / Redis |
|---|---|---|---|
| LPUSH | 432,900 | 386,100 | 112% |
| SADD | 396,825 | 374,532 | 106% |
| LPUSH (LRANGE seed) | 438,596 | 363,636 | 121% |
| LRANGE_100 | 117,233 | 109,170 | 107% |
| RPUSH (rand) | 414,938 | 331,126 | 125% |
| LPOP (rand) | 371,747 | 341,297 | 109% |
| RPOP (rand) | 395,257 | 364,964 | 108% |
| LLEN | 487,805 | 387,597 | 126% |
| SCARD | 434,783 | 436,681 | 100% |
-P 1 with varying -c.
| Clients | Oktoplus rps | Redis rps | Okto / Redis |
|---|---|---|---|
| 1 | 31,496 | 30,331 | 104% |
| 10 | 74,239 | 77,760 | 95% |
| 50 | 76,278 | 98,232 | 78% |
| 100 | 82,919 | 83,752 | 99% |
| 200 | 81,367 | 90,827 | 90% |
-c N with -P 16 and __rand_int__ keys (different clients → different keys → different per-key mutexes). RPUSH at varying concurrency:
A slice from concurrent_random_*_p16.csv at -c 100:
| Test | Oktoplus rps | Redis rps | Okto / Redis |
|---|---|---|---|
| RPUSH (rand) | 1,020,408 | 925,926 | 110% |
| LPOP (rand) | 1,075,269 | 909,091 | 118% |
| RPOP (rand) | 1,111,111 | 1,190,476 | 93% |
| LLEN (rand) | 970,874 | 1,098,901 | 88% |
| SADD (rand) | 952,381 | 847,458 | 112% |
| SCARD (rand) | 1,041,667 | 1,265,823 | 82% |
Random-key push/pop workloads at -c 100 -P 16 saturate around ~1M rps for both servers because each command is short and command throughput is bounded by network and parsing, not by command CPU. The picture changes when the per-command CPU work dominates: with LPOS key:__rand_int__ <missing-value> against pre-populated lists, every call walks the whole list (10K elements ≈ ~100µs of CPU per call) while sending only ~5 bytes back over the wire. Per-key sharding lets the work parallelize across cores; the Redis line on the same chart is the natural reference point for a single-threaded execution model.
LPOS key:__rand_int__ NEVER_PRESENT against 1000 pre-populated keys, each holding 10,000 distinct elements (-P 16). The cores columns are server_cpu_pct / 100 averaged over the cell — i.e. how many full CPU cores the server saturated:
| Clients | Oktoplus rps | Okto cores | Redis rps | Redis cores | Okto / Redis |
|---|---|---|---|---|---|
| 1 | 65,445 | 0.9 | 8,217 | 1.0 | 8.0× |
| 4 | 260,417 | 3.4 | 8,408 | 1.0 | 31.0× |
| 16 | 862,069 | 10.3 | 8,537 | 1.0 | 101.0× |
| 64 | 943,396 | 11.5 | 8,342 | 1.0 | 113.1× |
| 128 | 925,926 | 6.5 | 8,392 | 1.0 | 110.3× |
The cores column expresses the same shape on a hardware-independent metric: Oktoplus's per-key sharding lets command execution scale with -c until the available cores are saturated; the Redis row sits at one core throughout because command execution is single-threaded by design. The cores-vs-clients curve is plotted separately at chart_parallelism_cpu.svg so the shape is visible without needing the absolute throughput axis:
Total rps depends on both how many cores the design can saturate and how much each core gets done. Dividing rps by the cores actually saturated isolates the per-core efficiency:
| Clients | Okto rps/core | Redis rps/core | Okto / Redis |
|---|---|---|---|
| 1 | 86,806 | 8,389 | 10.35× |
| 4 | 89,127 | 8,640 | 10.32× |
| 16 | 82,508 | 8,758 | 9.42× |
| 64 | 93,615 | 8,681 | 10.78× |
| 128 | 171,999 | 8,728 | 19.71× |
Two factors compose on this CPU-heavy workload: how many cores the workload uses (parallelism), and how much each core gets done per call (per-core efficiency). The per-core ratio sits around ~10× across -c, with no cross-core coordination penalty visible, and rises at -c 128 once each core has enough pipelined work to amortise the per-command parsing overhead. The per-core component reflects the data-structure choice: devector iterates with one cache line per ~four list elements (16-byte okts::stor::string slot) and pays no per-element decode, while a listpack-based representation walks node pointers and decodes each entry.
Bench script: benchmark_results/run_parallelism_advantage_bench.sh. At smaller N=1000 (10× shorter scans) the parallelism-driven ratio falls to ~14× at -c 128; at smaller -P 1 the network RTT eats most of the per-command CPU work and the ratio collapses to ~1.5×.
Same workload as the small-value -P 16 table above but with a 256-byte payload (-d 256 for built-ins, a 256-byte literal on the custom RPUSH).
| Test | Oktoplus rps | Redis rps | Okto / Redis |
|---|---|---|---|
| LPUSH | 384,615 | 358,423 | 107% |
| SADD | 414,938 | 369,004 | 112% |
| LPUSH (LRANGE seed) | 358,423 | 344,828 | 104% |
| LRANGE_100 | 50,176 | 53,191 | 94% |
| RPUSH (rand, 256B) | 313,480 | 290,698 | 108% |
| LPOP (rand) | 332,226 | 297,619 | 112% |
| RPOP (rand) | 392,157 | 350,877 | 112% |
| LLEN | 465,116 | 406,504 | 114% |
| SCARD | 465,116 | 386,100 | 120% |
Full per-test CSVs and the raw-results history are under benchmark_results/raw/.
Generated by benchmark_results/run_memory.sh — for each cell, start a fresh server, snapshot RSS, load N distinct keys via RPUSH key:i <value> over redis-cli --pipe, snapshot RSS again. bytes/key = (steady - baseline) * 1024 / N.
| N keys | value | Oktoplus bytes/key | Redis bytes/key | Okto / Redis |
|---|---|---|---|---|
| 100,000 | 3B | 133 | 71 | 1.9× |
| 100,000 | 64B | 199 | 135 | 1.5× |
| 100,000 | 256B | 397 | 374 | 1.1× |
| 100,000 | 1024B | 1,190 | 1,342 | 0.89× |
| 1,000,000 | 3B | 157 | 72 | 2.2× |
| 1,000,000 | 64B | 227 | 133 | 1.7× |
| 1,000,000 | 256B | 429 | 375 | 1.1× |
| 1,000,000 | 1024B | 1,224 | 1,345 | 0.91× |
Per-key fixed overhead extrapolated from the 3-byte rows (where the value cost is negligible) is ~133–157 B for Oktoplus and ~70 B for Redis. The ratio drops as the value grows — 1.9× at 3B, 1.5× at 64B, 1.1× at 256B, ~0.9× at 1 KB. The 1 KB inversion comes from the storage-path string type: okts::stor::string (16 B SSO-or-heap, vs std::string's 32 B in libstdc++) is used both for value slots inside the list/deque/vector containers and for the per-shard hash-map keys, and allocates exactly size bytes (no NUL terminator, no capacity slack), so jemalloc serves the heap block from a smaller size class than std::string's capacity+1-rounded allocation would land in. Full per-trial CSVs at benchmark_results/raw/memory.csv, full table at benchmark_results/memory_results.md.
FLUSHALL and FLUSHDB clear every container but do NOT ask the allocator to release pages back to the OS — that's exposed separately as MEMORY PURGE (Redis-compatible), which calls jemalloc's mallctl("arena.<all>.purge"). Decoupling the two means clients pay for what they ask for and the residual numbers below measure the same operation on both servers: the benchmark issues FLUSHALL + MEMORY PURGE on each.
| N keys | value | Oktoplus residual (KiB) | Redis residual (KiB) |
|---|---|---|---|
| 100,000 | 3B | 13,032 | 10,192 |
| 100,000 | 64B | 13,928 | 10,016 |
| 100,000 | 256B | 13,372 | 10,052 |
| 100,000 | 1024B | 16,820 | 11,020 |
| 1,000,000 | 3B | 13,652 | 11,612 |
| 1,000,000 | 64B | 16,008 | 11,804 |
| 1,000,000 | 256B | 22,416 | 14,196 |
| 1,000,000 | 1024B | 47,304 | 23,800 |
Baseline RSS is ~9.5 MiB for Oktoplus and ~9.3 MiB for Redis. gRPC is a build-time opt-in (-DOKTOPLUS_WITH_GRPC=ON); the default build links neither libprotobuf nor libgrpc++, which keeps both the static binary footprint and the shared-library mappings small. The Oktoplus binary also bakes in jemalloc tuning via a __malloc_conf weak symbol (narenas:1,muzzy_decay_ms:0,background_thread:true): one arena instead of 4 × CPU saves ~1.7 MiB of per-arena metadata fan-out at zero throughput cost (jemalloc's per-thread tcache absorbs almost every allocation before it touches the arena mutex), muzzy_decay_ms:0 skips the muzzy intermediate state so dirty pages go straight back to the OS, and background_thread:true runs one jemalloc maintenance thread that proactively purges dirty extents for clients that never call MEMORY PURGE. Delta over baseline (truly retained allocator memory) is ~3–17 MiB across the workload sweep, with the worst case at 1M × 1024B.
- Container choice matches access pattern. Native vectors give O(1)
INDEX. Multi-set and multi-map are first-class.boost::multi_index_containerwith up to 3 keys is on the roadmap. You pick the container; you don't reshape your data to fit a list or hash. - Concurrent writers on different keys run in parallel. The keyspace is split across 32 shards, each key has its own mutex. A workload of N writers touching N different keys uses N cores. The numbers in the parallelism section above measure this directly on a CPU-bound workload.
- Native gRPC alongside RESP. Generate a typed client in any language straight from
commands.proto— no need to (re)implement the wire protocol. Existing Redis tooling (redis-cli,redis-benchmark) works against the RESP port unchanged.
- No replication, clustering, or persistence — see the release plan below.
- No pub/sub, streams, scripting, or transactions.
- Command coverage: lists 76%, sets 94% on RESP / 18% on gRPC, strings 0% — see the per-family compatibility tables linked at the top.
- At hot-key high-concurrency without pipelining (
-P 1), the per-command wire path is slightly heavier than a hand-tuned single-threaded loop, so this cell sits ~3–24% below the Redis reference. With pipelining (-P 16) the same workload sits at ~1M rps for both servers. - Per-key fixed overhead is ~133–157 B at 3-byte values (vs ~70 B for Redis). The gap shrinks with value size and inverts at 1 KB (~0.89–0.91× Redis); see the TODO above for the small-value optimisations still on the table.
- Single-node, no production deployments.
- Support all REDIS commands (at least the one relative to data storage)
- Support the following containers: deque, list, map, multimap, multiset, set, unorderd_map, unordered_multimap, vector, boost::multi_index (up to at least 3 keys)
- Make it distributed using RAFT as consensus protocol
