| `VLLM_USE_FLASHINFER_MOE_FP4` |
Remove |
Should be a vLLM heuristic: auto-select when variant precision is fp4/nvfp4 on Hopper. High perf impact — removing the explicit flag only matters if vLLM doesn't pick it up automatically. |
| `VLLM_USE_FLASHINFER_MOE_FP8` |
Remove |
Same family as above. The =0 workaround cases need an upstream fix; the =1 cases should be auto-detected. |
| `VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8` |
Investigate |
Only appears in gpt-oss. Unclear if this is a stable upstream flag or a one-time experiment. |
| `VLLM_FLASHINFER_MOE_BACKEND` |
Keep |
Meaningful latency-vs-throughput kernel choice that users should own. Needs better documentation on when to pick `latency` vs `throughput`. |
| `VLLM_USE_DEEP_GEMM` |
Keep |
Real throughput lift for FP8 matmul. Requires DeepGEMM install and has a warmup cost, so it should stay explicit. Worth surfacing in a performance-mode guide. |
| `VLLM_DEEP_GEMM_WARMUP` |
Keep |
Intentional escape hatch to skip slow JIT warmup at startup. No throughput effect at runtime. |
| `VLLM_FLOAT32_MATMUL_PRECISION=high` |
Keep |
Enables TF32 TensorCore path on Blackwell — medium perf gain. Currently only in guide prose; should be promoted to a structured `hardware_overrides.blackwell` entry across all Blackwell-capable recipes. |
| `SAFETENSORS_FAST_GPU` |
Remove |
Should be a vLLM default — no reason it isn't always-on. Speeds up model loading from disk; no inference throughput effect. |
| `VLLM_ATTENTION_BACKEND=FLASH_ATTN` |
Keep |
Model-specific correctness requirement (Qwen3-Next), not a perf choice. |
| `VLLM_USE_TRTLLM_ATTENTION=0` |
Keep |
Model-specific workaround to disable TRT-LLM attention where it breaks correctness. |
| `VLLM_USE_TRITON_FLASH_ATTN=0` |
Remove |
Appears to be a stale bug workaround. Verify whether the underlying issue is fixed in recent vLLM; if so, drop it. |
| `VLLM_USE_NCCL_SYMM_MEM` |
Keep |
Part of the NVLink perf bundle for pd_cluster (see three NCCL_* rows below). Medium bandwidth impact on NVLink clusters. Needs grouped documentation. |
| `NCCL_CUMEM_ENABLE` |
Keep |
Required for the NVLink SHARP / symmetric memory path. Bundle with `VLLM_USE_NCCL_SYMM_MEM`. |
| `NCCL_MNNVL_ENABLE` |
Keep |
Enables multi-node NVLink. Bundle with above. |
| `NCCL_NVLS_ENABLE` |
Keep |
Enables NVLink SHARP (NVLS). Bundle with above. |
| `VLLM_USE_V1` |
Remove |
Stale — V1 is the default engine in recent vLLM. Flag is a no-op. |
| `VLLM_V1_USE_PREFILL_DECODE_ATTENTION` |
Remove |
Likely deprecated alongside the V1 migration. Verify and drop. |
| `VLLM_VIDEO_LOADER_BACKEND=opencv` |
Keep |
Correctness requirement for video input models (Nemotron VL). Not a perf choice. |
| `VLLM_RPC_TIMEOUT=18000000` |
Investigate |
~5hr timeout set for DeepSeek-V3.2-Exp. Verify whether still needed; if yes, document why this model requires it. |
| `VLLM_ALLOW_LONG_MAX_MODEL_LEN` |
Keep |
Intentional user opt-in beyond the safety gate for context length. Should stay explicit. |
| `VLLM_COMMIT` |
Remove |
Not a runtime env var — it's an install-time wheel pin for MiniMax-M2.7. Should live in the dependencies block or be dropped once mainline vLLM catches up. |
| `VLLM_ROCM_USE_AITER` |
Remove |
Should be the AMD default in vLLM. Large perf uplift on MI300X+ — removing the explicit flag only matters if vLLM enables it automatically. |
| `VLLM_ROCM_USE_AITER_MOE` |
Remove (=1) / Keep (=0) |
The =1 cases should be a vLLM heuristic. The =0 cases are correctness workarounds for FP8 checkpoints on specific MI hardware — keep those. |
| `VLLM_ROCM_USE_AITER_RMSNORM` |
Investigate |
Disables AITER RMSNorm on AMD. Unclear if this is a workaround for a specific bug or intentional tuning. |
| `VLLM_ROCM_QUICK_REDUCE_QUANTIZATION=INT4` |
Keep |
AMD-specific INT4 quantized all-reduce. Medium bandwidth savings on large MoE models. Not auto-detectable. |
| `HSA_NO_SCRATCH_RECLAIM` |
Investigate |
Only in gpt-oss. Unclear if it generalizes to other AMD recipes or is model-specific. |
| `AMDGCN_USE_BUFFER_OPS` |
Investigate |
Only in gpt-oss. Same question as above. |
We have accumulated a number of `extra_env` / `base_env` entries across recipes. This issue tracks a pass to decide which to keep, which to remove (stale or should become a vLLM heuristic), and which need better documentation.
Goal: confirm each decision with an engineer who knows the flag, then file follow-up PRs.
Next steps: