[GPU] Add MoE expert offload-to-disk (OTD) for large MoE models#36202
Open
zaixing-wang wants to merge 1 commit into
Open
[GPU] Add MoE expert offload-to-disk (OTD) for large MoE models#36202zaixing-wang wants to merge 1 commit into
zaixing-wang wants to merge 1 commit into
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Details:
a. Add MOE_OFFLOAD_RATIO GPU plugin property (0–100) controlling the percentage of MoE experts kept resident on device memory. Remaining experts are fetched on-demand from host memory via an LRU cache at inference time.
b. Implement compile-time partial constant allocation: only ratio% of expert weight tensors are allocated on GPU; the rest skip device memory transfer.
c. Add runtime LRU-based expert weight management in the fused MoE 3GEMM+SwiGLU kernel, streaming non-resident experts on cache miss.
d. Enables running large MoE models on GPUs with limited VRAM by trading compute latency for memory savings.
Property Configuration:
// Python (openvino_genai LLMPipeline)
import openvino_genai as ov_genai
pipe = ov_genai.LLMPipeline(model_path, "GPU", MOE_OFFLOAD_RATIO=37)
// keep 37% experts on GPU
// llm_bench (openvino.genai/tools/llm_bench)
// Create config.json: {"MOE_OFFLOAD_RATIO": 37}
python benchmark.py -m <model_path> -d GPU -f ov -t text_gen -ic 128 -lc config.json
Tickets:
Tickets: CVS-184115
AI Assistance:
AI assistance used: yes
GitHub Copilot used for: merge conflict resolution, property rename refactoring (MOE_OFFLOAD_MAX_EXPERTS → MOE_OFFLOAD_RATIO), code refactoring of the MoE OTD runtime path, evaluating and selecting file I/O strategies for weight streaming (mmap vs ReadFile vs buffered read), PR branch preparation (squash commit), and drafting this PR description. All changes validated by full rebuild + end-to-end benchmark.