Feature: Decouple LLM export workflow from ONNX runtime inference

## Feature Description

Introduce a clean separation between the **LLM export workflow** (which relies on heavy ML frameworks like `transformers` and `torch`) and the **runtime inference workflow** (which should remain lightweight and depend only on `onnxruntime` and `tokenizers`).
The goal is to export the Llama model once during development, then run it in production via an ONNX-based provider without requiring any large ML dependencies.


## Motivation / Use Case

- The runtime environment should remain minimal and fast to install. Users should not be forced to install large frameworks like PyTorch or Transformers just to use a pre-exported model.
- The project should support a clean architecture:
  **Developers export the model; production only loads ONNX.**
- Reducing dependency size improves installation speed, distribution size, and deployment simplicity.


## Proposed Implementation

- Add an `llm-export` (or similar) optional dependency group that includes:

  - `transformers`
  - `torch`
  - `optimum` (optional)

- Provide a CLI or script to export:

  - `meta-llama/Llama-3.2-3B-Instruct` → `llama32-3b.onnx`
  - Tokenizer → `tokenizer.json`

- Create a new ONNX-based provider (e.g. `OnnxGPUProvider`) that:

  - Loads the exported model and tokenizer
  - Implements `ILLMProvider` using ONNX Runtime + `tokenizers`

- Keep runtime dependencies in `pyproject.toml` minimal:

  - `onnxruntime-gpu` (or CPU version)
  - `tokenizers`
  - No `torch`, `transformers`, or other ML frameworks

- Retain the original `GPUProvider` (torch-based) as an optional developer-only component.


## Additional Context

This feature formalizes the architecture shift toward a two-stage workflow:

1. **Development / Export Stage**
   Requires heavy ML dependencies
   → Export ONNX model + tokenizer files

2. **Runtime / Deployment Stage**
   Requires only lightweight inference libraries
   → Load ONNX model and run inference through a clean provider

This ensures a fast, minimal, dependency-light production environment while preserving full flexibility for contributors working with LLMs.


## Related Work
#99

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: Decouple LLM export workflow from ONNX runtime inference #100

Feature Description

Motivation / Use Case

Proposed Implementation

Additional Context

Related Work

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Feature: Decouple LLM export workflow from ONNX runtime inference #100

Description

Feature Description

Motivation / Use Case

Proposed Implementation

Additional Context

Related Work

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions