Skip to content

AmeeteSh-A/VortexDB

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

VortexDB

A Recursive WiscKey Storage Engine

Language Platform License


🔗 Quick Links


What Problem Does VortexDB Solve?

In traditional LSM-Trees (like LevelDB), both keys and values are moved during compaction. When values are large, this causes massive Write Amplification, slowing down the entire system.

VortexDB solves this by implementing a Separation of Concerns:

  • Keys remain in the LSM-Tree (SSTables) for fast indexing.
  • Values are stored in an append-only Value Log (VLog).

This ensures that during compaction, the engine only reshuffles small pointers, leaving the heavy data untouched.

(back to top)


⚡ The Performance Gap

Metric Standard LSM-Tree VortexDB (WiscKey)
Compaction Cost High (Keys + Values moved) Low (Only Keys moved)
Write Amplification O(depth of tree) Near-Constant O(1)
Large Value Handling Slows down merging Native Efficiency
Organization Flat Key-Space Recursive Sub-Vortices

(back to top)


⚙️ Architecture: The "Vortex Hook"

VortexDB's standout feature is its ability to spawn child databases within parent databases, creating a tree-like storage hierarchy similar to a file system.

VortexDB uses a custom routing logic that resolves paths (e.g., root/users/configs/theme). Each "Sub-Vortex" is a fully independent VortexDB instance with its own Memtable, SSTables, and VLog.


sequenceDiagram
    participant Client
    participant Server
    participant Memtable
    participant VLog
    participant SSTable

    Client->>Server: SET "user_1" "data_payload"
    Server->>Server: Infer Type (JSON/String/Int)
    Server->>VLog: Append Entry + CRC32
    VLog-->>Server: Return ValuePointer {file_id, offset, size}
    Server->>Memtable: Insert Key + ValuePointer
    Note over Memtable: If size > 100k
    Memtable->>SSTable: Flush (Write Sparse Index)
    SSTable->>Server: Register in Manifest
Loading

(back to top)


✨Technical Features

🏗️ Storage & Indexing

  • Sparse Indexing:
    SSTables store a key-offset every 100 keys. This allows O(logN) lookups with minimal memory footprint.
  • Cuckoo Membership Filter:
    A 4-slot bucket Cuckoo Filter intercepts "ghost reads" before they hit the disk, significantly reducing I/O for non-existent keys.
  • Type Inference Engine:
    VortexDB automatically detects and tags data types (INT, BOOL, JSON, STRING) at the point of entry, optimizing storage and retrieval logic.

🚀 Runtime Optimizations

  • LRU Value Cache:
    The VLogReader maintains a 2,000-entry Least Recently Used (LRU) cache to eliminate disk seek latency for "hot" data.
  • Zero-Copy Buffering: The VLogWriter utilizes a 64KB internal buffer to batch disk writes, protecting the SSD from excessive small-write cycles.
  • Multi-Threaded Winsock Server:
    A high-concurrency TCP server using std::shared_mutex (Single-Writer/Multiple-Reader) for thread-safe network access.

(back to top)

---

📐 Usage & Interactive Shell

VortexDB includes a powerful CLI that lets you navigate your database like a directory.

💻 Interactive Shell Example

Vortex [root] > cd users
Vortex [root/users] > put ameetesh {"role": "admin"}
Vortex [root/users] > get ameetesh
{"role": "admin"}
Vortex [root/users] > stats
--- [root_users] HEALTH REPORT ---
   Live Keys     : 1
   Total Entries : 1
   Waste Ratio   : 0.00%

🌐 Network Protocol (TCP Port 8080)

You can communicate with the engine via any TCP client:

-SET <key> <value> - Writes data and returns OK.

-GET <key> - Retrieves value or ERROR.

-GET RANGE <start> <end> - Performs a lexicographical range scan.

-GET <prefix>* - Prefix-based search (e.g., GET user_*).

-SAVE <state_name> - Creates a snapshot of the current DB state.

(back to top)

---

🛡️ Data Integrity & Recovery

VortexDB is built for persistence. Every build and run cycle is protected by:

  • CRC32 Checksums:
    Every VLog entry is hashed. If data is corrupted on disk, the reader detects the mismatch and prevents invalid data from reaching the application.
  • The Manifest System: A binary registry that tracks all active SSTables. This prevents "Orphaned Files" if the system crashes during a flush.
  • Automatic VLog Recovery:
    On startup, VortexDB scans the Value Log to reconstruct the Memtable if the index file is missing or out of sync.

📊 Benchmarking & Performance Analysis

To validate VortexDB's architecture, I benchmarked it against standard embedded databases: LevelDB and SQLite (in both individual and batched transaction modes). The tests expose the highly asymmetric performance profile of the WiscKey architecture.

System Specifications & Baseline Methodology

  • CPU: Intel Core i5-12450HX | RAM: 12GB @ 4800 MT/s | Storage: 512GB NVMe SSD | OS: Windows 11 Home (64-bit)
  • Versions: SQLite v3.51.3 (Amalgamation), LevelDB v1.23
  • Payload Profile: The following baseline test uses small payloads (keys ~8 bytes, values ~10 bytes). SQLite was run with PRAGMA synchronous = OFF and journal_mode = MEMORY to strip away ACID overhead and test its absolute maximum theoretical memory-throughput.

Comprehensive Results (Memtable Size: 10,000)

Operation Metric VortexDB LevelDB SQLite (Indiv) SQLite (Batch)
1. Write (New) ops/sec 550,221 188,175 11,222 641,786
2. Update (Half) ops/sec 49,465 188,215 11,932 645,161
3. Read Hit ms/op (Cold) 0.063 0.0008 0.055 0.055
ms/op (Hot) 0.059 0.0007 0.054 0.056
4. Read Miss ms/op 0.001 0.0004 0.051 0.055
5. Range Scan ms (Cold) 6.58 0.17 0.36 0.33
ms (Hot) 6.44 0.15 0.28 0.27

📑 Deep Dive: I have aldo benchmarked VortexDB at different Memtable as well as payload size(s). For a detailed breakdown of these numbers, including how VortexDB crushes, matches or lags on certain metrics and the exact architectural reasons behind every one of those, please see benchmark.md.

(back to top)

---

Architectural Breakdown: Why VortexDB Behaves This Way

The numbers above perfectly illustrate the theoretical trade-offs of separating keys from values. Here is the engineering reality behind each metric:

1. Write (New) — The Buffer Advantage (Extremely Fast) VortexDB writes new data significantly faster than LevelDB. Because values are appended sequentially to the Value Log (VLog) and keys are simply inserted into an in-memory std::map, there is zero write-amplification during ingestion. If the Memtable buffer is large enough (10,000+), the SSD is saturated with highly efficient sequential writes.

2. Update (Overwrites) — The Sub-Vortex Penalty (Slow) Notice the massive drop from 550k ops/sec on new writes to 49k ops/sec on updates. This is an intentional architectural trade-off. In VortexDB, the put() function checks if a key already exists to prevent accidentally overwriting a recursive Sub-Database (SUB_VORTEX). This safety check forces a random disk-read to the VLog during the write operation, severely throttling the throughput compared to raw appends.

3. Read Hit — The Disconnected Seek (Slower) VortexDB takes ~0.06 ms for a Read Hit, which is noticeably slower than LevelDB (~0.0008 ms). This is the primary WiscKey trade-off: finding a value requires a two-step process. First, it looks up the key in the SSTable/Memtable to get the pointer, and then it performs a random disk seek into the .vlog file to fetch the actual string. The 2,000-entry LRU Cache helps slightly on "Hot" reads, but cannot beat LevelDB's contiguous memory layouts.

4. Read Miss — The Cuckoo Shield (Extremely Fast) VortexDB resolves non-existent keys in near-zero time (~0.001 ms). Before the engine even attempts to search the Memtable or read a Sparse Index from disk, the in-memory 4-slot Cuckoo Filter mathematically proves the key does not exist and immediately drops the operation.

5. Range Scans — The Random I/O Trap (Slow) This is the most famous limitation of the WiscKey design. Because values are separated from keys, a range scan means iterating sequentially through the SSTable keys, but then firing off hundreds of random, non-sequential disk reads to the VLog to gather the payloads. LevelDB wins here because standard LSM-Trees store keys and values together, turning a range scan into a single, blazing-fast sequential disk read.


🛠️ Tech Stack Decisions

  • Why C++17: Chosen for deterministic memory management and raw pointer precision. This is strictly necessary for managing the recursive sub-database tree, orchestrating multi-threaded read/write locks, and handling high-performance byte-level buffer I/O.
  • Why WiscKey over pure LSM: Traditional LSM-trees suffer from severe write amplification when handling larger payloads (like JSON objects or serialized game states). By isolating keys from values, VortexDB trades away sequential Range Scan speed in exchange for drastically reduced compaction overhead and high-velocity continuous ingestion.
  • Why Winsock2: The networking layer utilizes Windows Sockets 2 directly to maintain absolute, low-level control over TCP_NODELAY flag settings and client thread detachment, avoiding the overhead of heavy third-party networking libraries.

⚠️Technical Trade-offs (Known & Intentional)

VortexDB is a Systems Exploration project. To achieve its specific performance goals, certain trade-offs were made:

  • Platform Lock-in:
    The networking layer utilizes Winsock2, making the server component specific to Windows environments.
  • Blocking Compaction: To ensure 100% consistency, the compact() operation is a "Stop-The-World" event that locks the database during the merge.
  • Storage Overhead:
    Because it is optimized for speed, VortexDB creates multiple files (VLogs, SSTs, Idx, Manifest). Use compact regularly to prune stale data.
  • Raw Pointer Management:
    The engine uses raw pointers for the recursive sub-db tree to maximize control over the destruction order, requiring careful manual memory management in the VortexDB destructor.

(back to top)

---

📂 Project Structure

/
├── vortex.h / .cpp       # 🧠 The Core Engine & Path Router
├── vlog.h / .cpp         # 📜 WiscKey Value Log (Buffered I/O + LRU)
├── sstable.h / .cpp      # 📑 Sorted String Tables (Sparse Index)
├── cuckoo.h / .cpp       # 🐦 Probabilistic Membership Filter
├── manifest.h / .cpp     # 🗃️ DB State & File Registry
├── server.h / .cpp       # 🌐 Winsock2 Multi-threaded Server
└── main.cpp              # 🐚 Interactive Shell & Entry Point

(back to top)

---

🚀 Getting Started

Prerequisites

  • Compiler: MSVC (Visual Studio 2019+) or MinGW with C++17 support.
  • Library: ws2_32.lib (Linked automatically via #pragma).

Build & Run

  • Clone the repository.
  • Compile using your preferred C++ compiler: g++ -std=c++17 main.cpp vortex.cpp vlog.cpp sstable.cpp manifest.cpp cuckoo.cpp server.cpp -lws2_32 -o vortexdb.exe
  • Launch vortexdb.exe.

👨‍💻Author

Built by Ameetesh
B.Tech Undergraduate (South Asian University)
Focused on Distributed Systems, Database Internals, and Performance Engineering.


License

MIT.

(back to top)

About

A high-performance, recursive key-value storage engine built in C++17. Implements a WiscKey-style architecture (separating keys from values) to eliminate write-amplification, featuring a custom TCP networking layer, in-memory Cuckoo filtering, and hierarchical sub-databases.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages