Vulkan Symbiote Engine

An extreme memory-efficient LLM inference engine designed to run large unquantized models on consumer hardware with minimal VRAM (4-8 GB) and RAM (8-32 GB).

Architecture Overview

Core Components

┌─────────────────────────────────────────────────────────────────┐
│                    VulkanSymbioteEngine                         │
│  ┌─────────────────┐    ┌─────────────────┐                    │
│  │  VitalityOracle │◄──►│ PackScheduler   │                    │
│  │ (Relevance+HW)  │    │ (Prefetch/Unload)│                   │
│  └─────────────────┘    └─────────────────┘                    │
│           │                       │                             │
│           ▼                       ▼                             │
│  ┌───────────────────────────────────────────────────────────┐  │
│  │              NomadPackManager                             │  │
│  │  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐       │  │
│  │  │ WeightPack  │  │ WeightPack  │  │ WeightPack  │ ...   │  │
│  │  │ (Attention)│  │ (MLP Block) │  │ (Norm/Layr) │       │  │
│  │  └─────────────┘  └─────────────┘  └─────────────┘       │  │
│  └───────────────────────────────────────────────────────────┘  │
│           │                       │                             │
│           ▼                       ▼                             │
│  ┌─────────────────┐    ┌─────────────────────────────────────┐ │ │
│  │ FractalMemory   │    │      VulkanComputePools             │ │
│  │ Allocator       │    │  ┌─────────┐ ┌─────────┐           │ │
│  │ (Suballoc)      │    │  │ Shader  │ │ Buffer  │           │ │
│  │                 │    │  │ Cache   │ │ Pool    │           │ │
│  └─────────────────┘    │  └─────────┘ └─────────┘           │ │
│                         └─────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
            │
            ▼
┌─────────────────────────────────────────────────────────┐
│                    Storage Backend                              │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────────────┐   │
│  │ GGUFLoader  │    │ ZFP/Blosc   │    │ AsyncPrefetcher     │   │
│  │ (Safetens) │    │ Decompressor│    │ (Background I/O)    │   │
│  └─────────────┘    └─────────────┘    └─────────────────────┘   │
└─────────────────────────────────────────────────────────────────┘

Key Features

Pack-based Memory Management: Loads weights in small, granular "packs" (1-50MB each)
Extreme Memory Efficiency: Never loads entire model into memory
Pure Vulkan Compute: No CUDA/ROCm dependencies
Predictive Prefetching: Smart anticipation of needed weight packs
Fractal Memory Allocation: Zero-fragmentation suballocation
Hot-swappable Shaders: Runtime SPIR-V compilation
Lossless Compression: ZFP/Blosc2 for 2-3x reduction
GPU-Accelerated: All major operations use compute shaders

Quick Start

Installation

# Clone repository
git clone <repository-url>
cd vk_symbiote

# Install dependencies
# Ubuntu/Debian
sudo apt-get install vulkan-dev cmake build-essential

# Build
mkdir build && cd build
cmake ..
make -j$(nproc)

# Run examples
./vk_symbiote_example
./vk_symbiote_benchmark

Basic Usage

#include "vk_symbiote/VulkanSymbioteEngine.h"
#include "vk_symbiote/ConfigManager.h"

using namespace vk_symbiote;

int main(int argc, char* argv[]) {
    // Load configuration from command line or file
    ConfigManager::instance().load_from_args(argc, argv);
    
    // Initialize engine with model path
    VulkanSymbioteEngine engine("/path/to/model.gguf");
    
    // Generate text
    std::string result = engine.generate(
        "Once upon a time in a digital forest", 
        100,  // max tokens
        0.7f   // temperature
    );
    
    std::cout << "Generated: " << result << std::endl;
    
    return 0;
}

Command Line Options

Usage: vk_symbiote_example [OPTIONS]

Model Configuration:
  --model <path>           Path to GGUF model file
  --config <path>          Configuration file path

Memory Configuration:
  --vram-gb <size>        VRAM budget in GB (default: 4)
  --ram-gb <size>         RAM budget in GB (default: 16)
  --prefetch-lookahead <n>  Prefetch lookahead (default: 3)
  --eviction-aggression <f>  Eviction aggression 0.0-1.0 (default: 0.7)

Performance:
  --workgroup-size <x> <y>   Workgroup size (default: 16 16)
  --disable-gpu             Use CPU fallback
  --fp32                    Use FP32 instead of FP16

Logging:
  --verbose, -v             Enable verbose logging
  --log-file <path>        Log to file
  --log-level <level>       Log level: DEBUG, INFO, WARNING, ERROR

Configuration

Configuration File Format

Configuration files use INI format:

[memory]
vram_budget_gb = 4
ram_budget_gb = 16
prefetch_lookahead = 3
eviction_aggression = 0.7
enable_compression = true
compression_algorithm = blosc2
max_packs_in_memory = 64

[performance]
enable_gpu = true
enable_profiling = false
workgroup_size_x = 16
workgroup_size_y = 16
use_subgroup_ops = true
use_fp16_math = true
scale_factor = 1.0

[logging]
log_level = INFO
log_to_file = false
log_file_path = vk_symbiote.log
log_performance = true
log_memory_usage = true
max_log_file_size_mb = 100

[model]
model_path = /path/to/model.gguf

Environment Variables

export VK_SYMBIOTE_VRAM_GB=8
export VK_SYMBIOTE_RAM_GB=32
export VK_SYMBIOTE_PREFETCH=5
export VK_SYMBIOTE_CONFIG=/path/to/config.ini

API Reference

Core Classes

VulkanSymbioteEngine

class VulkanSymbioteEngine {
public:
    VulkanSymbioteEngine(const Path& model_path);
    ~VulkanSymbioteEngine();
    
    // Text generation
    std::string generate(const std::string& prompt, 
                      uint32_t max_tokens = 256, 
                      float temperature = 0.7f);
    
    // Token operations
    std::vector<uint32_t> encode(const std::string& text) const;
    std::string decode(const std::vector<uint32_t>& tokens) const;
    
    // Performance monitoring
    const PerformanceMetrics& get_performance_metrics() const;
    void reset_performance_metrics();
};

PackManager

class PackManager {
public:
    Expected<std::shared_ptr<NomadPack>> get_or_load_pack(uint64 pack_id);
    ExpectedVoid prefetch_packs(const std::vector<uint64>& pack_ids);
    ExpectedVoid evict_until(uint64 bytes_needed);
    
    // Memory statistics
    float vram_utilization() const;
    float ram_utilization() const;
    std::vector<uint64> get_loaded_pack_ids() const;
};

VitalityOracle

class VitalityOracle {
public:
    VitalityScore score_pack(const PackMetadata& pack, 
                          const std::vector<uint32_t>& tokens,
                          uint32_t current_layer, 
                          const HardwareTelemetry& telemetry);
    
    std::vector<uint64> predict_next_packs(const std::vector<uint32_t>& tokens,
                                           uint32_t current_layer, 
                                           uint32 lookahead = 3, 
                                           uint32 count = 8);
    
    // Performance feedback
    void record_access(uint64 pack_id, bool was_used, float predicted_score);
    void update_model();
};

ShaderRuntime

class ShaderRuntime {
public:
    struct ShaderSpecialization {
        uint32_t workgroup_size_x = 16;
        uint32_t workgroup_size_y = 16;
        bool use_subgroup_ops = true;
        bool use_fp16_math = true;
    };
    
    VkPipeline get_fused_matmul_rope_pipeline(const ShaderSpecialization& spec);
    VkPipeline get_attention_pipeline(const ShaderSpecialization& spec);
    VkPipeline get_feedforward_pipeline(const ShaderSpecialization& spec);
    VkPipeline get_rms_norm_pipeline(const ShaderSpecialization& spec);
    
    void dispatch_compute(VkPipeline pipeline, 
                        VkDescriptorSet descriptor_set,
                        uint32_t group_count_x,
                        uint32_t group_count_y = 1,
                        uint32_t group_count_z = 1);
};

Performance Optimization

Memory Optimization

Pack Configuration: Adjust pack sizes based on model and hardware
Prefetch Strategy: Optimize lookahead for your usage pattern
Eviction Aggression: Balance memory pressure vs. cache hits
Compression: Enable lossless compression for larger models

GPU Optimization

Workgroup Sizing: Match your GPU's compute units
FP16 Math: Enable for better performance on supported hardware
Subgroup Operations: Use for parallel reduction
Batch Sizing: Process multiple tokens when possible

Scheduling Optimization

// Example: Optimize for throughput
ConfigManager::instance().performance().prefetch_lookahead = 5;
ConfigManager::instance().performance().workgroup_size_x = 32;
ConfigManager::instance().memory().eviction_aggression = 0.5f;

Performance Monitoring

// Enable performance profiling
ConfigManager::instance().performance().enable_profiling = true;

// Monitor in real-time
const auto& metrics = engine.get_performance_metrics();
std::cout << "Tokens/sec: " << metrics.average_tokens_per_second << std::endl;
std::cout << "GPU utilization: " << metrics.gpu_utilization_percent << "%" << std::endl;
std::cout << "Memory usage: " << (metrics.total_gpu_time_ns / 1e9f) << " MB/s" << std::endl;

Troubleshooting

Common Issues

Out of Memory

# Reduce memory budgets
--vram-gb 2
--ram-gb 8

# Increase eviction aggression
--eviction-aggression 0.9

# Disable compression
--no-compression

Slow Performance

# Disable GPU (use CPU fallback)
--disable-gpu

# Increase workgroup size
--workgroup-size 32 32

# Disable FP16
--fp32

# Disable subgroups
--no-subgroups

Vulkan Errors

# Enable verbose logging
--verbose

# Check GPU support
vulkaninfo --summary

# Force integrated GPU
--force-integrated-gpu

Model Loading Issues

# Check model format
file model.gguf

# Verify model path
ls -la /path/to/model.gguf

# Check available memory
free -h

# Enable debug logging
--log-level DEBUG

Developer Guide

Adding New Compute Kernels

Create GLSL shader in shaders/ directory
Use specialization constants for runtime parameters
Implement proper memory barriers
Add subgroup optimizations where possible

Extending Pack Types

Update PackType enum in Common.h
Add metadata to PackMetadata struct
Update pack generation logic in GGUFLoader.cpp
Implement specialized loading/unloading

Custom Compression

Implement CompressionBackend interface
Add to build system in CMakeLists.txt
Register in Compression.cpp
Update pack manager to use new algorithm

Performance Profiling

// Enable profiling in code
ConfigManager::instance().performance().enable_profiling = true;

// Access timing data
const auto& metrics = engine.get_performance_metrics();
uint64_t avg_inference_time = metrics.total_inference_time_ns / metrics.total_inferences;

Memory Debugging

// Enable memory usage logging
ConfigManager::instance().logging().log_memory_usage = true;

// Monitor pack statistics
const auto& vram_stats = pack_manager.get_vram_stats();
const auto& ram_stats = pack_manager.get_ram_stats();

// Force memory pressure
ConfigManager::instance().memory().vram_budget_gb = 1;

License

[License details to be added]

Contributing

This is a research-grade prototype. Key areas for improvement:

Additional model architectures (Mistral, Qwen, etc.)
Multi-GPU support
Advanced compression algorithms
CPU fallback implementations
Production-ready error handling

Acknowledgments

Built with:

Vulkan 1.3+ compute shaders
VMA memory allocation
GGUF model format support
Modern C++20 features
Blosc2 compression library

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
build		build
compression		compression
examples		examples
gui		gui
vk_symbiote		vk_symbiote
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
IMPLEMENTATION_AUDIT.md		IMPLEMENTATION_AUDIT.md
LICENSE		LICENSE
PRIORITY2_SUMMARY.md		PRIORITY2_SUMMARY.md
PRIORITY3_SUMMARY.md		PRIORITY3_SUMMARY.md
README.md		README.md
architecture_design.md		architecture_design.md
build.sh		build.sh
build_output.log		build_output.log
build_output2.log		build_output2.log
build_output3.log		build_output3.log
config.example.toml		config.example.toml

Folders and files

Latest commit

History

Repository files navigation

Vulkan Symbiote Engine

Table of Contents

Architecture Overview

Core Components

Key Features

Quick Start

Installation

Basic Usage

Command Line Options

Configuration

Configuration File Format

Environment Variables

API Reference

Core Classes

VulkanSymbioteEngine

PackManager

VitalityOracle

ShaderRuntime

Performance Optimization

Memory Optimization

GPU Optimization

Scheduling Optimization

Performance Monitoring

Troubleshooting

Common Issues

Out of Memory

Slow Performance

Vulkan Errors

Model Loading Issues

Developer Guide

Adding New Compute Kernels

Extending Pack Types

Custom Compression

Performance Profiling

Memory Debugging

License

Contributing

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages