Skip to content

feat: Add local LLM support with ONNX models#12

Open
orfeomorello wants to merge 1 commit intosachaa:masterfrom
orfeomorello:feature/local-llm-support
Open

feat: Add local LLM support with ONNX models#12
orfeomorello wants to merge 1 commit intosachaa:masterfrom
orfeomorello:feature/local-llm-support

Conversation

@orfeomorello
Copy link
Copy Markdown

PR: Add Local LLM Support (ONNX)

Overview

This PR adds support for local LLM inference using ONNX models that run entirely in the browser. The implementation uses @huggingface/transformers for WebGPU-accelerated inference.

Key Changes

1. New Provider System (src/providers/)

  • types.ts: Common interface for all LLM providers
  • onnx.ts: Local LLM provider using Gemma-3-1B-IT or Qwen-3.5-0.6B
  • factory.ts: Provider factory for creating provider instances
  • claude.ts: Claude API provider (unchanged)

2. Orchestrator Updates (src/orchestrator.ts)

  • Added provider type switching (Claude vs ONNX)
  • Local model uses simple chat mode (no tool calling)
  • Claude continues to work with full tool support

3. UI Updates (src/components/)

  • ChatPage.tsx: Different prompt starters for Claude vs local model
  • SettingsPage.tsx: Provider selection and model initialization
  • ProviderSettings.tsx: Model selection dropdown (Gemma vs Qwen)

4. Configuration (src/config.ts)

  • Added LLM_PROVIDER_TYPE config key
  • Added LOCAL_MODEL_ID config key

Architecture Decision: No Tool Calling for Local Models

After extensive testing, we determined that small local models (1B parameters) cannot reliably perform tool calling. The implementation now:

  • Claude mode: Full tool support (bash, file operations, fetch_url, etc.)
  • Local model mode: Chat-only - can have conversations but cannot execute commands

This is clearly communicated to users in the UI with different prompt examples and a message explaining the limitations.

Supported Local Models

Model Size
Gemma 3 1B IT ~1GB
Qwen 3.5 0.8B ~600MB

Requirements

  • Modern browser with WebGPU support (Chrome 113+, Edge 113+)
  • Sufficient RAM (2GB+ recommended)
  • GPU recommended for acceptable performance

Installation

npm install @huggingface/transformers

Usage

  1. Open Settings
  2. Select "Local Model (ONNX)" as provider
  3. Choose a model (Gemma or Qwen)
  4. Click "Initialize Model"
  5. Wait for model to download (first time only, cached afterwards)

Testing

  1. Test Claude mode: All tools should work as before
  2. Test local model: Should only chat, no tool execution
  3. Test switching between modes
  4. Test model re-initialization

Files Changed

src/providers/onnx.ts        (new)
src/providers/types.ts       (modified)
src/providers/factory.ts     (modified)
src/providers/index.ts       (modified)
src/orchestrator.ts          (modified)
src/config.ts                (modified)
src/components/chat/ChatPage.tsx          (modified)
src/components/settings/SettingsPage.tsx  (modified)
src/components/settings/ProviderSettings.tsx (new)
package.json                 (add @huggingface/transformers)

Known Limitations

  1. WebGPU required - falls back to CPU (very slow) if not available
  2. First load downloads ~1GB of model files
  3. Local models cannot use tools - chat only
  4. Context window limited to ~4000 tokens for local models

- Add ONNX provider with WebGPU acceleration
- Add streaming response for local models
- Add stop button to abort generation
- Add model selection in settings
- Support Gemma 3 1B and Qwen 3.5 0.8B
- Chat-only mode for local models
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant