Autonomous LLM Inference Runtime

Treating model deployment as a hardware-aware optimization problem.

This project is an automated inference orchestration engine. Instead of manually selecting model configurations, this runtime profiles the host hardware at startup, queries a learned ML-based policy to predict latency, and dynamically compiles the optimal model variant (pruning + quantization) for the current resource budget.

🚀 The Architecture

This is a Self-Optimizing Runtime:

Cold-Start Calibration: Profiles CPU/Memory bandwidth at runtime to understand the specific environment's performance profile.
Learned Policy Engine: Uses a Random Forest Regressor trained on historical benchmark data to predict inference latency across a multi-dimensional search space (Pruning Ratios, Quantization, Model Size).
Multi-Objective Orchestrator: Balances Latency and Perplexity using an Optimization Efficiency Score (OES) to navigate the Pareto frontier, rather than relying on brittle hardcoded rules.
RESTful Inference Gateway: Exposes the compiler/runtime via FastAPI for integration into larger production systems.

📊 Experimental Results

We evaluated 20+ configurations across GPT-2 and DistilGPT-2 architectures. The intelligent compiler consistently selected configurations that maximized the Optimization Efficiency Score (OES).

Configuration	Latency (s)	Perplexity	OES
Baseline (GPT-2, No Opt)	0.0320	22.79	1.37
Optimized (DistilGPT-2, P40, Q)	0.0035	100.04	2.85

Efficiency Improvement: ~108% increase in OES compared to baseline architectures.

🛠 Tech Stack

Intelligence: Scikit-Learn (Random Forest Regression)
Runtime: PyTorch, ONNX Runtime
Deployment: FastAPI (Inference Gateway)
Profiling: psutil (Hardware Telemetry)

⚡ Quick Start

1. The Lab (Benchmarking)

Generate the performance landscape by running the full sweep:

python main.py

2. The Engine (Inference)

Launch the intelligent runtime, which calibrates to your machine and selects the best model:

python runtime.py

3. The Gateway (API)

Spin up the Inference API to serve recommendations dynamically:

uvicorn api:app --reload
# Access docs at [http://127.0.0.1:8000/docs](http://127.0.0.1:8000/docs)

💡 Why this is different

Insted of appling optimizations, This system builds a decision engine over them. By separating the Benchmarking Engine from the Runtime Policy, this project enables deployment that adapts to the specific hardware—from low-end edge devices to powerful workstations—without changing a single line of code.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
models		models
optimizers		optimizers
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
analysis.py		analysis.py
api.py		api.py
benchmark.py		benchmark.py
compiler.py		compiler.py
config_space.py		config_space.py
latency_comparison.png		latency_comparison.png
main.py		main.py
optimizer_engine.py		optimizer_engine.py
perplexity_comparison.png		perplexity_comparison.png
requirements.txt		requirements.txt
runtime.py		runtime.py
tradeoff_plot.png		tradeoff_plot.png
visualize.py		visualize.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Autonomous LLM Inference Runtime

🚀 The Architecture

📊 Experimental Results

🛠 Tech Stack

⚡ Quick Start

1. The Lab (Benchmarking)

2. The Engine (Inference)

3. The Gateway (API)

💡 Why this is different

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Autonomous LLM Inference Runtime

🚀 The Architecture

📊 Experimental Results

🛠 Tech Stack

⚡ Quick Start

1. The Lab (Benchmarking)

2. The Engine (Inference)

3. The Gateway (API)

💡 Why this is different

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages