Skip to content

Audit: env var cleanup pass across all recipes #366

@faradawn

Description

@faradawn

We have accumulated a number of `extra_env` / `base_env` entries across recipes. This issue tracks a pass to decide which to keep, which to remove (stale or should become a vLLM heuristic), and which need better documentation.

Goal: confirm each decision with an engineer who knows the flag, then file follow-up PRs.


Env Var Decision Notes
`VLLM_USE_FLASHINFER_MOE_FP4` Remove Should be a vLLM heuristic: auto-select when variant precision is fp4/nvfp4 on Hopper. High perf impact — removing the explicit flag only matters if vLLM doesn't pick it up automatically.
`VLLM_USE_FLASHINFER_MOE_FP8` Remove Same family as above. The =0 workaround cases need an upstream fix; the =1 cases should be auto-detected.
`VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8` Investigate Only appears in gpt-oss. Unclear if this is a stable upstream flag or a one-time experiment.
`VLLM_FLASHINFER_MOE_BACKEND` Keep Meaningful latency-vs-throughput kernel choice that users should own. Needs better documentation on when to pick `latency` vs `throughput`.
`VLLM_USE_DEEP_GEMM` Keep Real throughput lift for FP8 matmul. Requires DeepGEMM install and has a warmup cost, so it should stay explicit. Worth surfacing in a performance-mode guide.
`VLLM_DEEP_GEMM_WARMUP` Keep Intentional escape hatch to skip slow JIT warmup at startup. No throughput effect at runtime.
`VLLM_FLOAT32_MATMUL_PRECISION=high` Keep Enables TF32 TensorCore path on Blackwell — medium perf gain. Currently only in guide prose; should be promoted to a structured `hardware_overrides.blackwell` entry across all Blackwell-capable recipes.
`SAFETENSORS_FAST_GPU` Remove Should be a vLLM default — no reason it isn't always-on. Speeds up model loading from disk; no inference throughput effect.
`VLLM_ATTENTION_BACKEND=FLASH_ATTN` Keep Model-specific correctness requirement (Qwen3-Next), not a perf choice.
`VLLM_USE_TRTLLM_ATTENTION=0` Keep Model-specific workaround to disable TRT-LLM attention where it breaks correctness.
`VLLM_USE_TRITON_FLASH_ATTN=0` Remove Appears to be a stale bug workaround. Verify whether the underlying issue is fixed in recent vLLM; if so, drop it.
`VLLM_USE_NCCL_SYMM_MEM` Keep Part of the NVLink perf bundle for pd_cluster (see three NCCL_* rows below). Medium bandwidth impact on NVLink clusters. Needs grouped documentation.
`NCCL_CUMEM_ENABLE` Keep Required for the NVLink SHARP / symmetric memory path. Bundle with `VLLM_USE_NCCL_SYMM_MEM`.
`NCCL_MNNVL_ENABLE` Keep Enables multi-node NVLink. Bundle with above.
`NCCL_NVLS_ENABLE` Keep Enables NVLink SHARP (NVLS). Bundle with above.
`VLLM_USE_V1` Remove Stale — V1 is the default engine in recent vLLM. Flag is a no-op.
`VLLM_V1_USE_PREFILL_DECODE_ATTENTION` Remove Likely deprecated alongside the V1 migration. Verify and drop.
`VLLM_VIDEO_LOADER_BACKEND=opencv` Keep Correctness requirement for video input models (Nemotron VL). Not a perf choice.
`VLLM_RPC_TIMEOUT=18000000` Investigate ~5hr timeout set for DeepSeek-V3.2-Exp. Verify whether still needed; if yes, document why this model requires it.
`VLLM_ALLOW_LONG_MAX_MODEL_LEN` Keep Intentional user opt-in beyond the safety gate for context length. Should stay explicit.
`VLLM_COMMIT` Remove Not a runtime env var — it's an install-time wheel pin for MiniMax-M2.7. Should live in the dependencies block or be dropped once mainline vLLM catches up.
`VLLM_ROCM_USE_AITER` Remove Should be the AMD default in vLLM. Large perf uplift on MI300X+ — removing the explicit flag only matters if vLLM enables it automatically.
`VLLM_ROCM_USE_AITER_MOE` Remove (=1) / Keep (=0) The =1 cases should be a vLLM heuristic. The =0 cases are correctness workarounds for FP8 checkpoints on specific MI hardware — keep those.
`VLLM_ROCM_USE_AITER_RMSNORM` Investigate Disables AITER RMSNorm on AMD. Unclear if this is a workaround for a specific bug or intentional tuning.
`VLLM_ROCM_QUICK_REDUCE_QUANTIZATION=INT4` Keep AMD-specific INT4 quantized all-reduce. Medium bandwidth savings on large MoE models. Not auto-detectable.
`HSA_NO_SCRATCH_RECLAIM` Investigate Only in gpt-oss. Unclear if it generalizes to other AMD recipes or is model-specific.
`AMDGCN_USE_BUFFER_OPS` Investigate Only in gpt-oss. Same question as above.

Next steps:

  • Engineers confirm / override each row above
  • File removal PRs for confirmed stale flags
  • File vLLM upstream issues for flags that should become heuristics
  • Promote `VLLM_FLOAT32_MATMUL_PRECISION` to structured `hardware_overrides.blackwell` across Blackwell-capable recipes

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions