selective_mixed_precision: QKV-aware overrides, AUTO memory mode, MULTI_GPU dispatch by hanbitmyths · Pull Request #2477 · microsoft/Olive

hanbitmyths · 2026-05-27T22:02:56Z

This PR hardens SelectiveMixedPrecision (SMP) for real-world LLMs targeting ONNX Runtime GenAI:

QKV-aware quant config overrides (olive/passes/pytorch/quant_utils.py): Normalize the per-layer override dict so that the Q, K, and V projections in the same attention block always share precision. ModelBuilder's GQA fusion requires this; without it, partial overrides silently break export on Qwen-style models.
AUTO kld_memory_mode (olive/passes/pytorch/selective_mixed_precision.py): A new auto setting selects among full, multi_gpu, low_memory, and offload based on visible GPU memory and estimated model footprint, and logs the decision (e.g. KLD memory mode auto-selected: multi_gpu (gpus=3, full=145.14GB, multi_budget=215.86GB, ...)).
New multi_gpu mode: Uses accelerate.dispatch_model + infer_auto_device_map with _no_split_modules honored. After infer_auto_device_map, every model.layers.N.* entry is coalesced to the first device assigned for that layer, and a defensive check falls back to low_memory if a decoder layer still spans devices. A diagnostic info log reports the per-device layer counts.

Validation (A100 VM)

Qwen3-0.6B old vs new export: tokens identical (124 vs 116 overrides, new_missing_qkv_partners=[]), same 657 MB output, ~301 vs 309 tok/s.
Qwen2.5-1.5B-Instruct export + ort-genai: 1.34 GB int4, 290 tok/s.
Qwen2.5-14B-Instruct AUTO → MULTI_GPU (3×A100), 9.44 GB int4, 95 tok/s.

MMLU 0-shot (HF fp16 vs ort-genai int4, greedy)

Model	N	PyTorch	ort-genai	Δ
Qwen3-0.6B	500	36.6%	28.6%	−8.0 pp
Qwen2.5-1.5B-Instruct	500	60.2%	54.2%	−6.0 pp
Qwen2.5-14B-Instruct	250	74.8%	77.2%	+2.4 pp (within ±5.5 pp CI)

14B is essentially lossless; the small-model deltas are inherent to int4 SMP on sub-2B parameters, not regressions introduced here.

Checklist before requesting a review

Add unit tests for this change.
Make sure all tests can pass. (24 passed, 1 skipped in test_selective_mixed_precision.py)
Update documents if necessary.
Lint and apply fixes to your code by running lintrunner -a
Is this a user-facing change? If yes, give a description of this change to be included in the release notes.

Release note: SelectiveMixedPrecision now supports an auto setting for kld_memory_mode and a new multi_gpu mode that shards the KLD-scored forward across visible GPUs via Accelerate. Quant config overrides are normalized so Q/K/V projections in the same attention block share precision, ensuring compatibility with ModelBuilder GQA fusion.

…TI_GPU dispatch - Normalize per-layer quant config overrides so Q/K/V projections in the same attention block share precision, required by ModelBuilder for GQA fusion. - Add AUTO setting for kld_memory_mode that picks among FULL, MULTI_GPU, LOW_MEMORY, OFFLOAD based on available GPU memory and model size. - Add MULTI_GPU mode that uses Accelerate's dispatch_model with _no_split_modules honored, plus a coalescing pass that pins every model.layers.N.* entry to a single device and falls back to LOW_MEMORY if a decoder layer still spans devices. - Tests: 24 unit tests covering QKV grouping, AUTO selection thresholds, and the MULTI_GPU device-map coalescing path.

hanbitmyths and others added 4 commits May 22, 2026 16:44

docs: surface KLD memory modes and QKV grouping in pass docstring

cc52a7e

Merge branch 'main' into smp-qkv-aware-multi-gpu

a4a2b2a

Address SMP review feedback

8e98a92

hanbitmyths mentioned this pull request May 27, 2026

selective_mixed_precision: QKV-aware overrides, AUTO memory mode, MULTI_GPU dispatch #2473

Closed

5 tasks

Merge branch 'main' into smp-qkv-aware-multi-gpu

3294108

hanbitmyths marked this pull request as ready for review May 27, 2026 22:14

Copilot AI review requested due to automatic review settings May 27, 2026 22:14

Copilot started reviewing on behalf of hanbitmyths May 27, 2026 22:14 View session

hanbitmyths closed this May 27, 2026

hanbitmyths review requested due to automatic review settings May 27, 2026 22:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

selective_mixed_precision: QKV-aware overrides, AUTO memory mode, MULTI_GPU dispatch#2477

selective_mixed_precision: QKV-aware overrides, AUTO memory mode, MULTI_GPU dispatch#2477
hanbitmyths wants to merge 5 commits into
mainfrom
smp-qkv-aware-multi-gpu

hanbitmyths commented May 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hanbitmyths commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Validation (A100 VM)

MMLU 0-shot (HF fp16 vs ort-genai int4, greedy)

Checklist before requesting a review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

hanbitmyths commented May 27, 2026 •

edited

Loading