Skip to content

πŸš€ LowMemoryLLM is A lightweight C-based LLM inference engine optimized for memory-constrained environments. ✨ Features: - πŸ“Š Multiple quantization (INT8/4/2) - πŸ’Ύ Smart memory management - πŸ”„ Efficient KV-cache Built for edge devices and systems with limited RAM. Pure C, fast and lightweight!

License

Notifications You must be signed in to change notification settings

2404589803/LowMemoryLLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

7 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

LowMemoryLLM

A framework for running large language models on resource-constrained devices with full precision and high speed.

δΈ­ζ–‡ζ–‡ζ‘£


Overview

LowMemoryLLM aims to enable large language models (70B+ parameters) to run on low-end hardware (old smartphones, outdated GPUs, limited memory devices) while maintaining:

  • Full precision: Bit-exact output matching FP32/FP16 inference
  • High speed: Near-native inference performance
  • Minimal memory: Run models that exceed available RAM/VRAM

This requires developing novel algorithms that go beyond existing optimization techniques.


Problem Statement

Constraint Current Solutions Limitation
Memory Quantization (INT4/INT8) Precision loss
Memory Model pruning Accuracy degradation
Speed Smaller models Reduced capability
Speed Distillation Not running original model

Our goal: Achieve all three (speed, precision, low memory) simultaneously through new algorithmic approaches.


Research Directions

1. Neural Subspace Projection

Discover low-dimensional structure in model weights and dynamically synthesize weights on-demand rather than storing them entirely.

  • Hypothesis: Model knowledge exists in a compressed manifold
  • Approach: Decompose weights into compact basis + input-dependent coefficients
  • Target: 10x memory reduction with zero precision loss

2. Input-Adaptive Dynamic Computation

Not all inputs require full model capacity. Automatically determine computation complexity based on input difficulty.

  • Simple queries β†’ Minimal computation path
  • Complex queries β†’ Full model engagement
  • No retraining required for existing models

3. Predictive Asynchronous Pipeline

Predict future computation paths based on attention patterns and overlap I/O with computation completely.

  • Analyze attention patterns to predict weight access
  • Prefetch weights before they are needed
  • Goal: Zero effective memory latency

4. Cross-Request Computation Reuse

Build semantic computation cache to identify similarity between inputs and reuse intermediate computations.

  • Cache KV states and intermediate activations
  • Semantic similarity matching across requests
  • Amortize computation cost over similar queries

5. Lossless Adaptive Precision

Dynamic precision selection per token and per layer with mathematical guarantees for precision preservation.

  • Not quantization, but intelligent computation scheduling
  • Higher precision for critical computations
  • Lower precision where mathematically safe

Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    LowMemoryLLM Framework                    β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚  Inference  β”‚  β”‚  Training   β”‚  β”‚  Model Adaptation   β”‚  β”‚
β”‚  β”‚   Engine    β”‚  β”‚   Engine    β”‚  β”‚      Engine         β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚         β”‚                β”‚                     β”‚             β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚              Core Algorithm Layer                       β”‚  β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚  β”‚
β”‚  β”‚  β”‚ Subspace   β”‚ β”‚  Dynamic   β”‚ β”‚    Predictive      β”‚  β”‚  β”‚
β”‚  β”‚  β”‚ Projection β”‚ β”‚ Computationβ”‚ β”‚    Pipeline        β”‚  β”‚  β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚                            β”‚                                  β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚           Hardware Abstraction Layer (HAL)              β”‚  β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚  β”‚
β”‚  β”‚  β”‚   CPU   β”‚ β”‚   GPU   β”‚ β”‚   NPU   β”‚ β”‚   Storage   β”‚   β”‚  β”‚
β”‚  β”‚  β”‚(x86/ARM)β”‚ β”‚(OpenCL) β”‚ β”‚(Future) β”‚ β”‚  (Disk I/O) β”‚   β”‚  β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚              Model Integration Layer                     β”‚  β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚  β”‚
β”‚  β”‚  β”‚ HuggingFace  β”‚  β”‚   Custom     β”‚  β”‚    GGUF      β”‚   β”‚  β”‚
β”‚  β”‚  β”‚   Loader     β”‚  β”‚   Formats    β”‚  β”‚   Support    β”‚   β”‚  β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Target Performance

Model Device Active Memory Target Speed Precision
LLaMA-2 7B Phone (4GB RAM) <2GB 10+ tokens/s Lossless
LLaMA-2 70B Laptop (16GB RAM) <8GB 2+ tokens/s Lossless
Mixtral 8x7B PC (8GB RAM) <4GB 5+ tokens/s Lossless
100B+ Models Desktop (32GB RAM) <16GB 1+ tokens/s Lossless

Implementation Status

Completed

  • Hardware Abstraction Layer (HAL)

    • CPU support with AVX2/NEON optimization
    • GPU support via OpenCL
    • Unified memory management interface
  • Quantization Infrastructure

    • INT8/INT4/FP16/FP8 quantization
    • Per-channel quantization
    • Dynamic quantization framework
    • Quantization-Aware Training (QAT)
  • KV-Cache Management

    • Disk offloading
    • Sliding window rotation
    • Cache compression
  • Mixed Precision Training

    • Dynamic loss scaling
    • FP32 weight backup
    • Overflow detection
  • Model Integration

    • HuggingFace model loader
    • Tokenizers (BPE, WordPiece, Unigram, SentencePiece)

In Development

  • Core Algorithms

    • Neural Subspace Projection
    • Input-Adaptive Dynamic Computation
    • Predictive Asynchronous Pipeline
    • Cross-Request Computation Reuse
  • Transformer Implementation

    • Self-Attention mechanism
    • Feed-Forward Networks
    • Layer Normalization
    • Complete inference pipeline
  • Memory Management

    • Layer-wise weight streaming
    • Predictive prefetching
    • Memory pool with LRU eviction

Publication Targets

  • Algorithm design: NeurIPS, ICML, ICLR
  • Systems optimization: MLSys, OSDI, SOSP

Installation

git clone https://github.com/2404589803/LowMemoryLLM.git
cd LowMemoryLLM
mkdir build && cd build
cmake ..
make

Requirements

  • C compiler with C11 support
  • CMake 3.10+
  • (Optional) OpenCL for GPU support
  • (Optional) AVX2/NEON capable CPU

Contributing

Contributions welcome in the following areas:

  • Algorithm research and implementation
  • Performance optimization
  • Hardware backend support
  • Theoretical analysis

License

MIT License - see LICENSE file.

About

πŸš€ LowMemoryLLM is A lightweight C-based LLM inference engine optimized for memory-constrained environments. ✨ Features: - πŸ“Š Multiple quantization (INT8/4/2) - πŸ’Ύ Smart memory management - πŸ”„ Efficient KV-cache Built for edge devices and systems with limited RAM. Pure C, fast and lightweight!

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors