Treating model deployment as a hardware-aware optimization problem.
This project is an automated inference orchestration engine. Instead of manually selecting model configurations, this runtime profiles the host hardware at startup, queries a learned ML-based policy to predict latency, and dynamically compiles the optimal model variant (pruning + quantization) for the current resource budget.
This is a Self-Optimizing Runtime:
- Cold-Start Calibration: Profiles CPU/Memory bandwidth at runtime to understand the specific environment's performance profile.
- Learned Policy Engine: Uses a Random Forest Regressor trained on historical benchmark data to predict inference latency across a multi-dimensional search space (Pruning Ratios, Quantization, Model Size).
- Multi-Objective Orchestrator: Balances Latency and Perplexity using an Optimization Efficiency Score (OES) to navigate the Pareto frontier, rather than relying on brittle hardcoded rules.
- RESTful Inference Gateway: Exposes the compiler/runtime via FastAPI for integration into larger production systems.
We evaluated 20+ configurations across GPT-2 and DistilGPT-2 architectures. The intelligent compiler consistently selected configurations that maximized the Optimization Efficiency Score (OES).
| Configuration | Latency (s) | Perplexity | OES |
|---|---|---|---|
| Baseline (GPT-2, No Opt) | 0.0320 | 22.79 | 1.37 |
| Optimized (DistilGPT-2, P40, Q) | 0.0035 | 100.04 | 2.85 |
Efficiency Improvement: ~108% increase in OES compared to baseline architectures.
- Intelligence: Scikit-Learn (Random Forest Regression)
- Runtime: PyTorch, ONNX Runtime
- Deployment: FastAPI (Inference Gateway)
- Profiling:
psutil(Hardware Telemetry)
Generate the performance landscape by running the full sweep:
python main.pyLaunch the intelligent runtime, which calibrates to your machine and selects the best model:
python runtime.pySpin up the Inference API to serve recommendations dynamically:
uvicorn api:app --reload
# Access docs at [http://127.0.0.1:8000/docs](http://127.0.0.1:8000/docs)Insted of appling optimizations, This system builds a decision engine over them. By separating the Benchmarking Engine from the Runtime Policy, this project enables deployment that adapts to the specific hardware—from low-end edge devices to powerful workstations—without changing a single line of code.