LeonardoGPU: High-Performance Local LLM Inference Server By Adrian Vargas

Project Overview

LeonardoGPU is a high-performance inference server designed to run Large Language Models (LLMs) locally on GPU hardware. The project implements two distinct endpoints for different model formats, optimizing for both speed and memory efficiency.

Technical Architecture

1. SafeTensors Endpoint (Port 8000)

Implements HuggingFace Transformers for native PyTorch model loading
Utilizes CUDA acceleration with FP16 precision
Features optimized generation parameters:
- Temperature control (0.3) for deterministic outputs
- Top-p (0.7) and Top-k (50) sampling for focused responses
- Repetition penalty (1.2) to prevent loops
- N-gram repetition prevention
- Length optimization with early stopping

2. GGUF Endpoint (Port 8001)

Leverages llama.cpp for quantized model inference
Implements GPU acceleration with configurable layer offloading
Optimized batch processing for prompt evaluation
Memory-efficient with 4-bit quantization support

Performance Optimizations

Model preloading at server startup
Warm-up inference to prime GPU memory
Automatic device mapping for optimal GPU utilization
Low CPU memory usage configurations
Response post-processing for coherent output

Models Supported

DeepSeek-R1-Distill-Qwen-7B (SafeTensors)
- 7B parameter model optimized for inference
- Full FP16 precision on GPU
- ~14GB VRAM usage
DeepSeek-R1-Distill-Qwen-14B (GGUF)
- 14B parameter model with 4-bit quantization
- Reduced memory footprint
- ~8GB VRAM usage

Technical Requirements

NVIDIA GPU with CUDA support
16GB+ VRAM recommended
Python 3.8+
PyTorch with CUDA
FastAPI for API endpoints
Transformers library
llama-cpp-python for GGUF support

API Endpoints

POST /infer

Request body:

{
  "prompt": "Hello, how are you?"
}

Performance Metrics

Average inference time: 2-4 seconds for 100 tokens
Memory efficiency: Optimized for consumer GPUs
Response coherence: Enhanced with post-processing
Token generation speed: ~25-30 tokens/second

Future Improvements

Development Notes

This project demonstrates expertise in:

GPU optimization for ML inference
API design and implementation
Model loading and memory management
Error handling and logging
Performance monitoring and metrics

By Adrian Vargas

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.gitignore		.gitignore
=0.26.0		=0.26.0
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml
entrypoint.sh		entrypoint.sh
gguf_endpoint.py		gguf_endpoint.py
requirements.txt		requirements.txt
safetensors_endpoint.py		safetensors_endpoint.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LeonardoGPU: High-Performance Local LLM Inference Server By Adrian Vargas

Project Overview

Technical Architecture

1. SafeTensors Endpoint (Port 8000)

2. GGUF Endpoint (Port 8001)

Performance Optimizations

Models Supported

Technical Requirements

API Endpoints

POST /infer

Performance Metrics

Future Improvements

Development Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LeonardoGPU: High-Performance Local LLM Inference Server By Adrian Vargas

Project Overview

Technical Architecture

1. SafeTensors Endpoint (Port 8000)

2. GGUF Endpoint (Port 8001)

Performance Optimizations

Models Supported

Technical Requirements

API Endpoints

POST /infer

Performance Metrics

Future Improvements

Development Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages