Skip to content

[GPU] Add MoE expert offload-to-disk (OTD) for large MoE models#36202

Open
zaixing-wang wants to merge 1 commit into
openvinotoolkit:masterfrom
zaixing-wang:wzx_moe_otd_pr
Open

[GPU] Add MoE expert offload-to-disk (OTD) for large MoE models#36202
zaixing-wang wants to merge 1 commit into
openvinotoolkit:masterfrom
zaixing-wang:wzx_moe_otd_pr

Conversation

@zaixing-wang
Copy link
Copy Markdown
Contributor

@zaixing-wang zaixing-wang commented Jun 3, 2026

Details:

a. Add MOE_OFFLOAD_RATIO GPU plugin property (0–100) controlling the percentage of MoE experts kept resident on device memory. Remaining experts are fetched on-demand from host memory via an LRU cache at inference time.
b. Implement compile-time partial constant allocation: only ratio% of expert weight tensors are allocated on GPU; the rest skip device memory transfer.
c. Add runtime LRU-based expert weight management in the fused MoE 3GEMM+SwiGLU kernel, streaming non-resident experts on cache miss.
d. Enables running large MoE models on GPUs with limited VRAM by trading compute latency for memory savings.

Property Configuration:

// Python (openvino_genai LLMPipeline)
import openvino_genai as ov_genai
pipe = ov_genai.LLMPipeline(model_path, "GPU", MOE_OFFLOAD_RATIO=37)
// keep 37% experts on GPU

// llm_bench (openvino.genai/tools/llm_bench)
// Create config.json: {"MOE_OFFLOAD_RATIO": 37}
python benchmark.py -m <model_path> -d GPU -f ov -t text_gen -ic 128 -lc config.json

Tickets:

Tickets: CVS-184115

AI Assistance:

AI assistance used: yes
GitHub Copilot used for: merge conflict resolution, property rename refactoring (MOE_OFFLOAD_MAX_EXPERTS → MOE_OFFLOAD_RATIO), code refactoring of the MoE OTD runtime path, evaluating and selecting file I/O strategies for weight streaming (mmap vs ReadFile vs buffered read), PR branch preparation (squash commit), and drafting this PR description. All changes validated by full rebuild + end-to-end benchmark.

@zaixing-wang zaixing-wang requested review from a team as code owners June 3, 2026 01:08
@github-actions github-actions Bot added the category: GPU OpenVINO GPU plugin label Jun 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

category: GPU OpenVINO GPU plugin

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant