[GPU] Add MoE expert offload-to-disk (OTD) for large MoE models by zaixing-wang · Pull Request #36202 · openvinotoolkit/openvino

zaixing-wang · 2026-06-03T01:08:10Z

Details:

a. Add MOE_OFFLOAD_RATIO GPU plugin property (0–100) controlling the percentage of MoE experts kept resident on device memory. Remaining experts are fetched on-demand from host memory via an LRU cache at inference time.
b. Implement compile-time partial constant allocation: only ratio% of expert weight tensors are allocated on GPU; the rest skip device memory transfer.
c. Add runtime LRU-based expert weight management in the fused MoE 3GEMM+SwiGLU kernel, streaming non-resident experts on cache miss.
d. Enables running large MoE models on GPUs with limited VRAM by trading compute latency for memory savings.

Property Configuration:

// Python (openvino_genai LLMPipeline)
import openvino_genai as ov_genai
pipe = ov_genai.LLMPipeline(model_path, "GPU", MOE_OFFLOAD_RATIO=37)
// keep 37% experts on GPU

// llm_bench (openvino.genai/tools/llm_bench)
// Create config.json: {"MOE_OFFLOAD_RATIO": 37}
python benchmark.py -m <model_path> -d GPU -f ov -t text_gen -ic 128 -lc config.json

Tickets:

Tickets: CVS-184115

AI Assistance:

AI assistance used: yes
GitHub Copilot used for: merge conflict resolution, property rename refactoring (MOE_OFFLOAD_MAX_EXPERTS → MOE_OFFLOAD_RATIO), code refactoring of the MoE OTD runtime path, evaluating and selecting file I/O strategies for weight streaming (mmap vs ReadFile vs buffered read), PR branch preparation (squash commit), and drafting this PR description. All changes validated by full rebuild + end-to-end benchmark.

[GPU] Add MoE expert offload-to-disk (OTD) for large MoE models

b7b6eb4

zaixing-wang requested review from a team as code owners June 3, 2026 01:08

github-actions Bot added the category: GPU OpenVINO GPU plugin label Jun 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GPU] Add MoE expert offload-to-disk (OTD) for large MoE models#36202

[GPU] Add MoE expert offload-to-disk (OTD) for large MoE models#36202
zaixing-wang wants to merge 1 commit into
openvinotoolkit:masterfrom
zaixing-wang:wzx_moe_otd_pr

zaixing-wang commented Jun 3, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

zaixing-wang commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Details:

Property Configuration:

Tickets:

AI Assistance:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

zaixing-wang commented Jun 3, 2026 •

edited

Loading