Skip to content

Feature: Decouple LLM export workflow from ONNX runtime inference #100

@jieyao-MilestoneHub

Description

@jieyao-MilestoneHub

Feature Description

Introduce a clean separation between the LLM export workflow (which relies on heavy ML frameworks like transformers and torch) and the runtime inference workflow (which should remain lightweight and depend only on onnxruntime and tokenizers).
The goal is to export the Llama model once during development, then run it in production via an ONNX-based provider without requiring any large ML dependencies.

Motivation / Use Case

  • The runtime environment should remain minimal and fast to install. Users should not be forced to install large frameworks like PyTorch or Transformers just to use a pre-exported model.
  • The project should support a clean architecture:
    Developers export the model; production only loads ONNX.
  • Reducing dependency size improves installation speed, distribution size, and deployment simplicity.

Proposed Implementation

  • Add an llm-export (or similar) optional dependency group that includes:

    • transformers
    • torch
    • optimum (optional)
  • Provide a CLI or script to export:

    • meta-llama/Llama-3.2-3B-Instructllama32-3b.onnx
    • Tokenizer → tokenizer.json
  • Create a new ONNX-based provider (e.g. OnnxGPUProvider) that:

    • Loads the exported model and tokenizer
    • Implements ILLMProvider using ONNX Runtime + tokenizers
  • Keep runtime dependencies in pyproject.toml minimal:

    • onnxruntime-gpu (or CPU version)
    • tokenizers
    • No torch, transformers, or other ML frameworks
  • Retain the original GPUProvider (torch-based) as an optional developer-only component.

Additional Context

This feature formalizes the architecture shift toward a two-stage workflow:

  1. Development / Export Stage
    Requires heavy ML dependencies
    → Export ONNX model + tokenizer files

  2. Runtime / Deployment Stage
    Requires only lightweight inference libraries
    → Load ONNX model and run inference through a clean provider

This ensures a fast, minimal, dependency-light production environment while preserving full flexibility for contributors working with LLMs.

Related Work

#99

Metadata

Metadata

Labels

enhancementNew feature or request

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions