Skip to content

kalman5/oktoplus

Repository files navigation

oktoplus

alt text

What is oktoplus

Oktoplus is a in-memory data store K:V where V is a container: std::list, std::map, boost::multi_index_container, std::set, you name it. Doing so the client can choose the best container for his own access data pattern.

Why try it

RESP2 wire-compatible with Redis — point redis-cli, redis-benchmark, or any existing Redis client at port 6379 and it just works. Drop-in for the read/write path on the supported commands (lists 100%, sets 94%; strings on the roadmap).

Faster than Redis at the same workload on most cells benchmarked — single-client -P 16 runs ~6–26% above the Redis reference across LPUSH / LLEN / RPUSH / LPOP / RPOP / SADD; CPU-heavy multi-key workloads (e.g. LPOS scans) reach ~100× because command execution is multi-threaded and sharded per key, so N writers on N keys use N cores. See the benchmark tables below for the per-cell numbers.

Not a drop-in for everything yet. No persistence, no replication / clustering, no pub/sub / streams / scripting / transactions. If you need Redis as the system of record, stay on Redis; if you need it as a hot in-memory store and want the multi-core scaling and the richer container types (vector with O(1) INDEX, multi-set, multi-map, multi-index), Oktoplus is worth a try.

If this reminds you of Redis then you are right — Redis is the inspiration. Oktoplus differs along a few axes that come up repeatedly in the rest of this README:

  • command execution is multi-threaded, sharded per-key
  • the value type is the container itself, picked from a richer set (vector, deque, multi-set, multi-map, multi-index) so the access pattern dictates the data structure
  • that lets, for example, INDEX on a vector be O(1) where Redis's LINDEX on a list is O(n)
  • it speaks RESP2 on the wire, so existing Redis client libraries and tooling work unchanged

Redis Commands Compatibility (RESP)

Oktoplus specific containers (already implemented, see specific documentation)

Wire protocols

The server exposes the same data through two interfaces:

  • RESP2 (default port 6379, always on) — primary wire protocol, wire-compatible with Redis using the RESP2 framing (+ - : $ * types, $-1\r\n / *-1\r\n nulls), so existing tooling like redis-cli and redis-benchmark works out of the box. Override the bind address via service.resp_endpoint in the JSON config. Includes the admin commands FLUSHDB / FLUSHALL. RESP3 (HELLO-negotiated, native maps/sets/push, unified _\r\n null) is on the roadmap — see TODO below.
  • gRPC (optional) — see src/Libraries/Commands/commands.proto. Use it to generate a client in your favourite language. Includes admin RPCs flushAll / flushDb plus all the list / set / deque / vector commands. Disabled by default at build time to keep baseline RSS down to ~9 MiB — pass -DOKTOPLUS_WITH_GRPC=ON to cmake to compile it in, then enable at runtime by setting service.endpoint in the JSON config.

The per-family compatibility tables (LISTS, SETS, STRINGS) include a column showing which Redis commands are wired to gRPC and to RESP today.

TODO

  • RESP3 protocol support: implement HELLO for protocol negotiation, gate the per-connection encoder on the negotiated version, swap to the unified _\r\n null, and add the new type tags (# boolean, , double, ( big number, = verbatim string, ~ set, % map, > push, | attribute). Today the server speaks RESP2 only; RESP3-capable clients (e.g. redis-cli -3) fall back to RESP2 because HELLO returns ERR unknown command.

Memory-footprint techniques on the table for further reducing per-key overhead at small values (all are tricks Redis already applies — closing any of them lowers the bytes-per-key numbers in the memory section below):

  • embstr encoding: when a value is ≤44 B, store the redisObject header and the SDS in the same allocation. One alloc instead of two on the hot path for small values.
  • listpack encoding for small lists / hashes / sets: a single contiguous packed buffer of [backlen, encoding, content] entries, no per-entry pointer or header overhead. A 1-element small-value list collapses to a single ~25 B allocation vs Oktoplus's ProtectedContainer + devector + okts::stor::string chain.
  • Quicklist for large lists: linked list of listpack nodes; per-element overhead amortizes to a few bytes inside a node instead of one heap alloc per element.
  • Shared integer objects: Redis interns the integers 0..9999 as global singletons. RPUSH foo 42 doesn't allocate; it stores a pointer to the shared "42" object. Cuts allocation pressure on numeric workloads.
  • Compact dict: replace absl::flat_hash_map's control-byte + slot layout (fast but high per-bucket overhead) with a 24-byte dictEntry-style separate-chaining hash, possibly with incremental rehashing to bound worst-case stalls.
  • Lazy-free + activedefrag: offload deallocation to a background thread (UNLINK, FLUSHDB ASYNC) and periodically walk the heap using jemalloc's je_get_defrag_hint to migrate live objects out of fragmented slabs. Cuts steady-state retained metadata over time.

Server is multithread, two different clients working on different containers (type or name) have a minimal interaction. For example multiple clients performing a parallel batch insert on different keys can procede in parallel without blocking each other.

Benchmarks

The benchmarks below run Oktoplus alongside Redis on the same machine as a known reference point — Redis is the system most readers will already have a feel for, so the Redis numbers are included as a yardstick rather than as a leaderboard. The script (benchmark_results/run_benchmark.sh) starts both servers itself, runs redis-benchmark at single-client -P 1/-P 16 and at varying concurrency -c 1..200, and dumps CSVs into benchmark_results/raw/.

Each redis-benchmark invocation runs N iterations (env var ITERATIONS, default 1; the published numbers below use N=5) and the published cell is the median rps across them. The harness flags any test whose max/min > 1.5× to separate signal from noise: single-run measurements understate random-key throughput because the first iteration pays cold-start costs.

Each per-cell CSV row also carries a trailing server_cpu_pct column — the average percent CPU the server consumed across the cell's wall-clock window (read from /proc/<pid>/stat utime+stime, divided by the cell duration). Values >100% mean multi-core utilisation: e.g. 1300% = 13 full cores saturated. The companion chart_parallelism_cpu.svg plots cores-saturated vs concurrency directly so per-key sharding shows up on a hardware-independent metric. Each CSV is also paired with a *.config sidecar describing the exact env-var values that produced it (defaults reproduce the published numbers).

Hardware: AMD EPYC Genoa devserver. Build: -O3 -march=native -mtune=native -ffast-math -fno-semantic-interposition -funroll-loops, linked against jemalloc (see OKTOPLUS_WITH_JEMALLOC in CMake). Workload: 100k ops/iteration, 100k key-space, single client unless stated otherwise.

Charts are generated from benchmark_results/raw/*.csv by benchmark_results/make_chart.py (no dependencies — pure-stdlib Python emitting SVG + HTML).

An interactive Chart.js dashboard with the same data lives at benchmark_results/report.html — view it rendered through htmlpreview.github.io.

Single client, no pipelining (-P 1)

Single client -P 1 throughput

Test Oktoplus rps Redis rps Okto / Redis
LPUSH 29,542 30,423 97%
SADD 32,268 29,412 110%
LRANGE_100 26,185 24,624 106%
LPOP (rand) 30,864 28,927 107%
RPOP (rand) 29,412 28,736 102%
LLEN (rand) 30,998 28,818 108%
SCARD (rand) 32,362 28,794 112%
Single client, pipelined (-P 16)

Single client -P 16 throughput, small values

Test Oktoplus rps Redis rps Okto / Redis
LPUSH 432,900 386,100 112%
SADD 396,825 374,532 106%
LPUSH (LRANGE seed) 438,596 363,636 121%
LRANGE_100 117,233 109,170 107%
RPUSH (rand) 414,938 331,126 125%
LPOP (rand) 371,747 341,297 109%
RPOP (rand) 395,257 364,964 108%
LLEN 487,805 387,597 126%
SCARD 434,783 436,681 100%
Many clients, no pipelining — LPUSH on a hot key

-P 1 with varying -c.

LPUSH on a hot key, varying clients

Clients Oktoplus rps Redis rps Okto / Redis
1 31,496 30,331 104%
10 74,239 77,760 95%
50 76,278 98,232 78%
100 82,919 83,752 99%
200 81,367 90,827 90%
Many clients, pipelined, random keys

-c N with -P 16 and __rand_int__ keys (different clients → different keys → different per-key mutexes). RPUSH at varying concurrency:

RPUSH random key, varying clients (-P 16)

A slice from concurrent_random_*_p16.csv at -c 100:

Test Oktoplus rps Redis rps Okto / Redis
RPUSH (rand) 1,020,408 925,926 110%
LPOP (rand) 1,075,269 909,091 118%
RPOP (rand) 1,111,111 1,190,476 93%
LLEN (rand) 970,874 1,098,901 88%
SADD (rand) 952,381 847,458 112%
SCARD (rand) 1,041,667 1,265,823 82%
Multi-key, CPU-heavy commands — per-key sharding

Random-key push/pop workloads at -c 100 -P 16 saturate around ~1M rps for both servers because each command is short and command throughput is bounded by network and parsing, not by command CPU. The picture changes when the per-command CPU work dominates: with LPOS key:__rand_int__ <missing-value> against pre-populated lists, every call walks the whole list (10K elements ≈ ~100µs of CPU per call) while sending only ~5 bytes back over the wire. Per-key sharding lets the work parallelize across cores; the Redis line on the same chart is the natural reference point for a single-threaded execution model.

LPOS scan on 10K-element lists, varying clients (-P 16)

LPOS key:__rand_int__ NEVER_PRESENT against 1000 pre-populated keys, each holding 10,000 distinct elements (-P 16). The cores columns are server_cpu_pct / 100 averaged over the cell — i.e. how many full CPU cores the server saturated:

Clients Oktoplus rps Okto cores Redis rps Redis cores Okto / Redis
1 65,445 0.9 8,217 1.0 8.0×
4 260,417 3.4 8,408 1.0 31.0×
16 862,069 10.3 8,537 1.0 101.0×
64 943,396 11.5 8,342 1.0 113.1×
128 925,926 6.5 8,392 1.0 110.3×

The cores column expresses the same shape on a hardware-independent metric: Oktoplus's per-key sharding lets command execution scale with -c until the available cores are saturated; the Redis row sits at one core throughout because command execution is single-threaded by design. The cores-vs-clients curve is plotted separately at chart_parallelism_cpu.svg so the shape is visible without needing the absolute throughput axis:

Server cores saturated during LPOS scan

Per-core efficiency (rps / cores saturated)

Total rps depends on both how many cores the design can saturate and how much each core gets done. Dividing rps by the cores actually saturated isolates the per-core efficiency:

Per-core efficiency during LPOS scan

Clients Okto rps/core Redis rps/core Okto / Redis
1 86,806 8,389 10.35×
4 89,127 8,640 10.32×
16 82,508 8,758 9.42×
64 93,615 8,681 10.78×
128 171,999 8,728 19.71×

Two factors compose on this CPU-heavy workload: how many cores the workload uses (parallelism), and how much each core gets done per call (per-core efficiency). The per-core ratio sits around ~10× across -c, with no cross-core coordination penalty visible, and rises at -c 128 once each core has enough pipelined work to amortise the per-command parsing overhead. The per-core component reflects the data-structure choice: devector iterates with one cache line per ~four list elements (16-byte okts::stor::string slot) and pays no per-element decode, while a listpack-based representation walks node pointers and decodes each entry.

Bench script: benchmark_results/run_parallelism_advantage_bench.sh. At smaller N=1000 (10× shorter scans) the parallelism-driven ratio falls to ~14× at -c 128; at smaller -P 1 the network RTT eats most of the per-command CPU work and the ratio collapses to ~1.5×.

Single client, pipelined (-P 16), 256-byte values

Same workload as the small-value -P 16 table above but with a 256-byte payload (-d 256 for built-ins, a 256-byte literal on the custom RPUSH).

Single client -P 16 throughput, 256-byte values

Test Oktoplus rps Redis rps Okto / Redis
LPUSH 384,615 358,423 107%
SADD 414,938 369,004 112%
LPUSH (LRANGE seed) 358,423 344,828 104%
LRANGE_100 50,176 53,191 94%
RPUSH (rand, 256B) 313,480 290,698 108%
LPOP (rand) 332,226 297,619 112%
RPOP (rand) 392,157 350,877 112%
LLEN 465,116 406,504 114%
SCARD 465,116 386,100 120%

Full per-test CSVs and the raw-results history are under benchmark_results/raw/.

Memory footprint

Generated by benchmark_results/run_memory.sh — for each cell, start a fresh server, snapshot RSS, load N distinct keys via RPUSH key:i <value> over redis-cli --pipe, snapshot RSS again. bytes/key = (steady - baseline) * 1024 / N.

Memory footprint, bytes per key

N keys value Oktoplus bytes/key Redis bytes/key Okto / Redis
100,000 3B 133 71 1.9×
100,000 64B 199 135 1.5×
100,000 256B 397 374 1.1×
100,000 1024B 1,190 1,342 0.89×
1,000,000 3B 157 72 2.2×
1,000,000 64B 227 133 1.7×
1,000,000 256B 429 375 1.1×
1,000,000 1024B 1,224 1,345 0.91×

Per-key fixed overhead extrapolated from the 3-byte rows (where the value cost is negligible) is ~133–157 B for Oktoplus and ~70 B for Redis. The ratio drops as the value grows — 1.9× at 3B, 1.5× at 64B, 1.1× at 256B, ~0.9× at 1 KB. The 1 KB inversion comes from the storage-path string type: okts::stor::string (16 B SSO-or-heap, vs std::string's 32 B in libstdc++) is used both for value slots inside the list/deque/vector containers and for the per-shard hash-map keys, and allocates exactly size bytes (no NUL terminator, no capacity slack), so jemalloc serves the heap block from a smaller size class than std::string's capacity+1-rounded allocation would land in. Full per-trial CSVs at benchmark_results/raw/memory.csv, full table at benchmark_results/memory_results.md.

Residual memory after FLUSHALL

FLUSHALL and FLUSHDB clear every container but do NOT ask the allocator to release pages back to the OS — that's exposed separately as MEMORY PURGE (Redis-compatible), which calls jemalloc's mallctl("arena.<all>.purge"). Decoupling the two means clients pay for what they ask for and the residual numbers below measure the same operation on both servers: the benchmark issues FLUSHALL + MEMORY PURGE on each.

Residual RSS after FLUSHALL

N keys value Oktoplus residual (KiB) Redis residual (KiB)
100,000 3B 13,032 10,192
100,000 64B 13,928 10,016
100,000 256B 13,372 10,052
100,000 1024B 16,820 11,020
1,000,000 3B 13,652 11,612
1,000,000 64B 16,008 11,804
1,000,000 256B 22,416 14,196
1,000,000 1024B 47,304 23,800

Baseline RSS is ~9.5 MiB for Oktoplus and ~9.3 MiB for Redis. gRPC is a build-time opt-in (-DOKTOPLUS_WITH_GRPC=ON); the default build links neither libprotobuf nor libgrpc++, which keeps both the static binary footprint and the shared-library mappings small. The Oktoplus binary also bakes in jemalloc tuning via a __malloc_conf weak symbol (narenas:1,muzzy_decay_ms:0,background_thread:true): one arena instead of 4 × CPU saves ~1.7 MiB of per-arena metadata fan-out at zero throughput cost (jemalloc's per-thread tcache absorbs almost every allocation before it touches the arena mutex), muzzy_decay_ms:0 skips the muzzy intermediate state so dirty pages go straight back to the OS, and background_thread:true runs one jemalloc maintenance thread that proactively purges dirty extents for clients that never call MEMORY PURGE. Delta over baseline (truly retained allocator memory) is ~3–17 MiB across the workload sweep, with the worst case at 1M × 1024B.

Design properties

  • Container choice matches access pattern. Native vectors give O(1) INDEX. Multi-set and multi-map are first-class. boost::multi_index_container with up to 3 keys is on the roadmap. You pick the container; you don't reshape your data to fit a list or hash.
  • Concurrent writers on different keys run in parallel. The keyspace is split across 32 shards, each key has its own mutex. A workload of N writers touching N different keys uses N cores. The numbers in the parallelism section above measure this directly on a CPU-bound workload.
  • Native gRPC alongside RESP. Generate a typed client in any language straight from commands.proto — no need to (re)implement the wire protocol. Existing Redis tooling (redis-cli, redis-benchmark) works against the RESP port unchanged.

What it doesn't do (yet)

  • No replication, clustering, or persistence — see the release plan below.
  • No pub/sub, streams, scripting, or transactions.
  • Command coverage: lists 76%, sets 94% on RESP / 18% on gRPC, strings 0% — see the per-family compatibility tables linked at the top.
  • At hot-key high-concurrency without pipelining (-P 1), the per-command wire path is slightly heavier than a hand-tuned single-threaded loop, so this cell sits ~3–24% below the Redis reference. With pipelining (-P 16) the same workload sits at ~1M rps for both servers.
  • Per-key fixed overhead is ~133–157 B at 3-byte values (vs ~70 B for Redis). The gap shrinks with value size and inverts at 1 KB (~0.89–0.91× Redis); see the TODO above for the small-value optimisations still on the table.
  • Single-node, no production deployments.

Release plan

  • Support all REDIS commands (at least the one relative to data storage)
  • Support the following containers: deque, list, map, multimap, multiset, set, unorderd_map, unordered_multimap, vector, boost::multi_index (up to at least 3 keys)
  • Make it distributed using RAFT as consensus protocol

How To Build


About

In memory K/V database

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors