VortexDB

A Recursive WiscKey Storage Engine

🔗 Quick Links

⚡ The Performance Gap - ⚙️ Architecture - 📦 Getting Started
📐 Usage Example - ✨ Features - 📊 Benchmarks - 📂 Project Structure

What Problem Does VortexDB Solve?

In traditional LSM-Trees (like LevelDB), both keys and values are moved during compaction. When values are large, this causes massive Write Amplification, slowing down the entire system.

VortexDB solves this by implementing a Separation of Concerns:

Keys remain in the LSM-Tree (SSTables) for fast indexing.
Values are stored in an append-only Value Log (VLog).

This ensures that during compaction, the engine only reshuffles small pointers, leaving the heavy data untouched.

(back to top)

⚡ The Performance Gap

Metric	Standard LSM-Tree	VortexDB (WiscKey)
Compaction Cost	High (Keys + Values moved)	Low (Only Keys moved)
Write Amplification	O(depth of tree)	Near-Constant O(1)
Large Value Handling	Slows down merging	Native Efficiency
Organization	Flat Key-Space	Recursive Sub-Vortices

(back to top)

⚙️ Architecture: The "Vortex Hook"

VortexDB's standout feature is its ability to spawn child databases within parent databases, creating a tree-like storage hierarchy similar to a file system.

VortexDB uses a custom routing logic that resolves paths (e.g., root/users/configs/theme). Each "Sub-Vortex" is a fully independent VortexDB instance with its own Memtable, SSTables, and VLog.

sequenceDiagram
    participant Client
    participant Server
    participant Memtable
    participant VLog
    participant SSTable

    Client->>Server: SET "user_1" "data_payload"
    Server->>Server: Infer Type (JSON/String/Int)
    Server->>VLog: Append Entry + CRC32
    VLog-->>Server: Return ValuePointer {file_id, offset, size}
    Server->>Memtable: Insert Key + ValuePointer
    Note over Memtable: If size > 100k
    Memtable->>SSTable: Flush (Write Sparse Index)
    SSTable->>Server: Register in Manifest

(back to top)

✨Technical Features

🏗️ Storage & Indexing

Sparse Indexing:
SSTables store a key-offset every 100 keys. This allows O(logN) lookups with minimal memory footprint.
Cuckoo Membership Filter:
A 4-slot bucket Cuckoo Filter intercepts "ghost reads" before they hit the disk, significantly reducing I/O for non-existent keys.
Type Inference Engine:
VortexDB automatically detects and tags data types (INT, BOOL, JSON, STRING) at the point of entry, optimizing storage and retrieval logic.

🚀 Runtime Optimizations

LRU Value Cache:
The VLogReader maintains a 2,000-entry Least Recently Used (LRU) cache to eliminate disk seek latency for "hot" data.
Zero-Copy Buffering: The VLogWriter utilizes a 64KB internal buffer to batch disk writes, protecting the SSD from excessive small-write cycles.
Multi-Threaded Winsock Server:
A high-concurrency TCP server using std::shared_mutex (Single-Writer/Multiple-Reader) for thread-safe network access.

(back to top)

---

📐 Usage & Interactive Shell

VortexDB includes a powerful CLI that lets you navigate your database like a directory.

💻 Interactive Shell Example

Vortex [root] > cd users
Vortex [root/users] > put ameetesh {"role": "admin"}
Vortex [root/users] > get ameetesh
{"role": "admin"}
Vortex [root/users] > stats
--- [root_users] HEALTH REPORT ---
   Live Keys     : 1
   Total Entries : 1
   Waste Ratio   : 0.00%

🌐 Network Protocol (TCP Port 8080)

You can communicate with the engine via any TCP client:

-SET <key> <value> - Writes data and returns OK.

-GET <key> - Retrieves value or ERROR.

-GET RANGE <start> <end> - Performs a lexicographical range scan.

-GET <prefix>* - Prefix-based search (e.g., GET user_*).

-SAVE <state_name> - Creates a snapshot of the current DB state.

(back to top)

---

🛡️ Data Integrity & Recovery

VortexDB is built for persistence. Every build and run cycle is protected by:

CRC32 Checksums:
Every VLog entry is hashed. If data is corrupted on disk, the reader detects the mismatch and prevents invalid data from reaching the application.
The Manifest System: A binary registry that tracks all active SSTables. This prevents "Orphaned Files" if the system crashes during a flush.
Automatic VLog Recovery:
On startup, VortexDB scans the Value Log to reconstruct the Memtable if the index file is missing or out of sync.

📊 Benchmarking & Performance Analysis

To validate VortexDB's architecture, I benchmarked it against standard embedded databases: LevelDB and SQLite (in both individual and batched transaction modes). The tests expose the highly asymmetric performance profile of the WiscKey architecture.

System Specifications & Baseline Methodology

CPU: Intel Core i5-12450HX | RAM: 12GB @ 4800 MT/s | Storage: 512GB NVMe SSD | OS: Windows 11 Home (64-bit)
Versions: SQLite v3.51.3 (Amalgamation), LevelDB v1.23
Payload Profile: The following baseline test uses small payloads (keys ~8 bytes, values ~10 bytes). SQLite was run with PRAGMA synchronous = OFF and journal_mode = MEMORY to strip away ACID overhead and test its absolute maximum theoretical memory-throughput.

Comprehensive Results (Memtable Size: 10,000)

Operation	Metric	VortexDB	LevelDB	SQLite (Indiv)	SQLite (Batch)
1. Write (New)	ops/sec	550,221	188,175	11,222	641,786
2. Update (Half)	ops/sec	49,465	188,215	11,932	645,161
3. Read Hit	ms/op (Cold)	0.063	0.0008	0.055	0.055
	ms/op (Hot)	0.059	0.0007	0.054	0.056
4. Read Miss	ms/op	0.001	0.0004	0.051	0.055
5. Range Scan	ms (Cold)	6.58	0.17	0.36	0.33
	ms (Hot)	6.44	0.15	0.28	0.27

📑 Deep Dive: I have aldo benchmarked VortexDB at different Memtable as well as payload size(s). For a detailed breakdown of these numbers, including how VortexDB crushes, matches or lags on certain metrics and the exact architectural reasons behind every one of those, please see benchmark.md.

(back to top)

---

Architectural Breakdown: Why VortexDB Behaves This Way

The numbers above perfectly illustrate the theoretical trade-offs of separating keys from values. Here is the engineering reality behind each metric:

1. Write (New) — The Buffer Advantage (Extremely Fast) VortexDB writes new data significantly faster than LevelDB. Because values are appended sequentially to the Value Log (VLog) and keys are simply inserted into an in-memory std::map, there is zero write-amplification during ingestion. If the Memtable buffer is large enough (10,000+), the SSD is saturated with highly efficient sequential writes.

2. Update (Overwrites) — The Sub-Vortex Penalty (Slow) Notice the massive drop from 550k ops/sec on new writes to 49k ops/sec on updates. This is an intentional architectural trade-off. In VortexDB, the put() function checks if a key already exists to prevent accidentally overwriting a recursive Sub-Database (SUB_VORTEX). This safety check forces a random disk-read to the VLog during the write operation, severely throttling the throughput compared to raw appends.

3. Read Hit — The Disconnected Seek (Slower) VortexDB takes ~0.06 ms for a Read Hit, which is noticeably slower than LevelDB (~0.0008 ms). This is the primary WiscKey trade-off: finding a value requires a two-step process. First, it looks up the key in the SSTable/Memtable to get the pointer, and then it performs a random disk seek into the .vlog file to fetch the actual string. The 2,000-entry LRU Cache helps slightly on "Hot" reads, but cannot beat LevelDB's contiguous memory layouts.

4. Read Miss — The Cuckoo Shield (Extremely Fast) VortexDB resolves non-existent keys in near-zero time (~0.001 ms). Before the engine even attempts to search the Memtable or read a Sparse Index from disk, the in-memory 4-slot Cuckoo Filter mathematically proves the key does not exist and immediately drops the operation.

5. Range Scans — The Random I/O Trap (Slow) This is the most famous limitation of the WiscKey design. Because values are separated from keys, a range scan means iterating sequentially through the SSTable keys, but then firing off hundreds of random, non-sequential disk reads to the VLog to gather the payloads. LevelDB wins here because standard LSM-Trees store keys and values together, turning a range scan into a single, blazing-fast sequential disk read.

🛠️ Tech Stack Decisions

Why C++17: Chosen for deterministic memory management and raw pointer precision. This is strictly necessary for managing the recursive sub-database tree, orchestrating multi-threaded read/write locks, and handling high-performance byte-level buffer I/O.
Why WiscKey over pure LSM: Traditional LSM-trees suffer from severe write amplification when handling larger payloads (like JSON objects or serialized game states). By isolating keys from values, VortexDB trades away sequential Range Scan speed in exchange for drastically reduced compaction overhead and high-velocity continuous ingestion.
Why Winsock2: The networking layer utilizes Windows Sockets 2 directly to maintain absolute, low-level control over TCP_NODELAY flag settings and client thread detachment, avoiding the overhead of heavy third-party networking libraries.

⚠️Technical Trade-offs (Known & Intentional)

VortexDB is a Systems Exploration project. To achieve its specific performance goals, certain trade-offs were made:

Platform Lock-in:
The networking layer utilizes Winsock2, making the server component specific to Windows environments.
Blocking Compaction: To ensure 100% consistency, the compact() operation is a "Stop-The-World" event that locks the database during the merge.
Storage Overhead:
Because it is optimized for speed, VortexDB creates multiple files (VLogs, SSTs, Idx, Manifest). Use compact regularly to prune stale data.
Raw Pointer Management:
The engine uses raw pointers for the recursive sub-db tree to maximize control over the destruction order, requiring careful manual memory management in the VortexDB destructor.

(back to top)

---

📂 Project Structure

/
├── vortex.h / .cpp       # 🧠 The Core Engine & Path Router
├── vlog.h / .cpp         # 📜 WiscKey Value Log (Buffered I/O + LRU)
├── sstable.h / .cpp      # 📑 Sorted String Tables (Sparse Index)
├── cuckoo.h / .cpp       # 🐦 Probabilistic Membership Filter
├── manifest.h / .cpp     # 🗃️ DB State & File Registry
├── server.h / .cpp       # 🌐 Winsock2 Multi-threaded Server
└── main.cpp              # 🐚 Interactive Shell & Entry Point

(back to top)

---

🚀 Getting Started

Prerequisites

Compiler: MSVC (Visual Studio 2019+) or MinGW with C++17 support.
Library: ws2_32.lib (Linked automatically via #pragma).

Build & Run

Clone the repository.
Compile using your preferred C++ compiler: g++ -std=c++17 main.cpp vortex.cpp vlog.cpp sstable.cpp manifest.cpp cuckoo.cpp server.cpp -lws2_32 -o vortexdb.exe
Launch vortexdb.exe.

👨‍💻Author

Built by Ameetesh
B.Tech Undergraduate (South Asian University)
Focused on Distributed Systems, Database Internals, and Performance Engineering.

License

MIT.

(back to top)

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
docs		docs
src		src
LICENSE		LICENSE
README.md		README.md
benchmark.md		benchmark.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VortexDB

A Recursive WiscKey Storage Engine

🔗 Quick Links

What Problem Does VortexDB Solve?

⚡ The Performance Gap

⚙️ Architecture: The "Vortex Hook"

✨Technical Features

🏗️ Storage & Indexing

🚀 Runtime Optimizations

📐 Usage & Interactive Shell

💻 Interactive Shell Example

🌐 Network Protocol (TCP Port 8080)

🛡️ Data Integrity & Recovery

📊 Benchmarking & Performance Analysis

System Specifications & Baseline Methodology

Comprehensive Results (Memtable Size: 10,000)

Architectural Breakdown: Why VortexDB Behaves This Way

🛠️ Tech Stack Decisions

⚠️Technical Trade-offs (Known & Intentional)

📂 Project Structure

🚀 Getting Started

👨‍💻Author

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

VortexDB

A Recursive WiscKey Storage Engine

🔗 Quick Links

What Problem Does VortexDB Solve?

⚡ The Performance Gap

⚙️ Architecture: The "Vortex Hook"

✨Technical Features

🏗️ Storage & Indexing

🚀 Runtime Optimizations

📐 Usage & Interactive Shell

💻 Interactive Shell Example

🌐 Network Protocol (TCP Port 8080)

🛡️ Data Integrity & Recovery

📊 Benchmarking & Performance Analysis

System Specifications & Baseline Methodology

Comprehensive Results (Memtable Size: 10,000)

Architectural Breakdown: Why VortexDB Behaves This Way

🛠️ Tech Stack Decisions

⚠️Technical Trade-offs (Known & Intentional)

📂 Project Structure

🚀 Getting Started

👨‍💻Author

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages