Feature Description
Introduce a clean separation between the LLM export workflow (which relies on heavy ML frameworks like transformers and torch) and the runtime inference workflow (which should remain lightweight and depend only on onnxruntime and tokenizers).
The goal is to export the Llama model once during development, then run it in production via an ONNX-based provider without requiring any large ML dependencies.
Motivation / Use Case
- The runtime environment should remain minimal and fast to install. Users should not be forced to install large frameworks like PyTorch or Transformers just to use a pre-exported model.
- The project should support a clean architecture:
Developers export the model; production only loads ONNX.
- Reducing dependency size improves installation speed, distribution size, and deployment simplicity.
Proposed Implementation
-
Add an llm-export (or similar) optional dependency group that includes:
transformers
torch
optimum (optional)
-
Provide a CLI or script to export:
meta-llama/Llama-3.2-3B-Instruct → llama32-3b.onnx
- Tokenizer →
tokenizer.json
-
Create a new ONNX-based provider (e.g. OnnxGPUProvider) that:
- Loads the exported model and tokenizer
- Implements
ILLMProvider using ONNX Runtime + tokenizers
-
Keep runtime dependencies in pyproject.toml minimal:
onnxruntime-gpu (or CPU version)
tokenizers
- No
torch, transformers, or other ML frameworks
-
Retain the original GPUProvider (torch-based) as an optional developer-only component.
Additional Context
This feature formalizes the architecture shift toward a two-stage workflow:
-
Development / Export Stage
Requires heavy ML dependencies
→ Export ONNX model + tokenizer files
-
Runtime / Deployment Stage
Requires only lightweight inference libraries
→ Load ONNX model and run inference through a clean provider
This ensures a fast, minimal, dependency-light production environment while preserving full flexibility for contributors working with LLMs.
Related Work
#99
Feature Description
Introduce a clean separation between the LLM export workflow (which relies on heavy ML frameworks like
transformersandtorch) and the runtime inference workflow (which should remain lightweight and depend only ononnxruntimeandtokenizers).The goal is to export the Llama model once during development, then run it in production via an ONNX-based provider without requiring any large ML dependencies.
Motivation / Use Case
Developers export the model; production only loads ONNX.
Proposed Implementation
Add an
llm-export(or similar) optional dependency group that includes:transformerstorchoptimum(optional)Provide a CLI or script to export:
meta-llama/Llama-3.2-3B-Instruct→llama32-3b.onnxtokenizer.jsonCreate a new ONNX-based provider (e.g.
OnnxGPUProvider) that:ILLMProviderusing ONNX Runtime +tokenizersKeep runtime dependencies in
pyproject.tomlminimal:onnxruntime-gpu(or CPU version)tokenizerstorch,transformers, or other ML frameworksRetain the original
GPUProvider(torch-based) as an optional developer-only component.Additional Context
This feature formalizes the architecture shift toward a two-stage workflow:
Development / Export Stage
Requires heavy ML dependencies
→ Export ONNX model + tokenizer files
Runtime / Deployment Stage
Requires only lightweight inference libraries
→ Load ONNX model and run inference through a clean provider
This ensures a fast, minimal, dependency-light production environment while preserving full flexibility for contributors working with LLMs.
Related Work
#99