Audit: env var cleanup pass across all recipes

We have accumulated a number of \`extra_env\` / \`base_env\` entries across recipes. This issue tracks a pass to decide which to keep, which to remove (stale or should become a vLLM heuristic), and which need better documentation.

**Goal:** confirm each decision with an engineer who knows the flag, then file follow-up PRs.

---

| Env Var | Decision | Notes |
|---------|----------|-------|
| \`VLLM_USE_FLASHINFER_MOE_FP4\` | Remove | Should be a vLLM heuristic: auto-select when variant precision is fp4/nvfp4 on Hopper. High perf impact — removing the explicit flag only matters if vLLM doesn't pick it up automatically. |
| \`VLLM_USE_FLASHINFER_MOE_FP8\` | Remove | Same family as above. The =0 workaround cases need an upstream fix; the =1 cases should be auto-detected. |
| \`VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8\` | Investigate | Only appears in gpt-oss. Unclear if this is a stable upstream flag or a one-time experiment. |
| \`VLLM_FLASHINFER_MOE_BACKEND\` | Keep | Meaningful latency-vs-throughput kernel choice that users should own. Needs better documentation on when to pick \`latency\` vs \`throughput\`. |
| \`VLLM_USE_DEEP_GEMM\` | Keep | Real throughput lift for FP8 matmul. Requires DeepGEMM install and has a warmup cost, so it should stay explicit. Worth surfacing in a performance-mode guide. |
| \`VLLM_DEEP_GEMM_WARMUP\` | Keep | Intentional escape hatch to skip slow JIT warmup at startup. No throughput effect at runtime. |
| \`VLLM_FLOAT32_MATMUL_PRECISION=high\` | Keep | Enables TF32 TensorCore path on Blackwell — medium perf gain. Currently only in guide prose; should be promoted to a structured \`hardware_overrides.blackwell\` entry across all Blackwell-capable recipes. |
| \`SAFETENSORS_FAST_GPU\` | Remove | Should be a vLLM default — no reason it isn't always-on. Speeds up model loading from disk; no inference throughput effect. |
| \`VLLM_ATTENTION_BACKEND=FLASH_ATTN\` | Keep | Model-specific correctness requirement (Qwen3-Next), not a perf choice. |
| \`VLLM_USE_TRTLLM_ATTENTION=0\` | Keep | Model-specific workaround to disable TRT-LLM attention where it breaks correctness. |
| \`VLLM_USE_TRITON_FLASH_ATTN=0\` | Remove | Appears to be a stale bug workaround. Verify whether the underlying issue is fixed in recent vLLM; if so, drop it. |
| \`VLLM_USE_NCCL_SYMM_MEM\` | Keep | Part of the NVLink perf bundle for pd_cluster (see three NCCL_* rows below). Medium bandwidth impact on NVLink clusters. Needs grouped documentation. |
| \`NCCL_CUMEM_ENABLE\` | Keep | Required for the NVLink SHARP / symmetric memory path. Bundle with \`VLLM_USE_NCCL_SYMM_MEM\`. |
| \`NCCL_MNNVL_ENABLE\` | Keep | Enables multi-node NVLink. Bundle with above. |
| \`NCCL_NVLS_ENABLE\` | Keep | Enables NVLink SHARP (NVLS). Bundle with above. |
| \`VLLM_USE_V1\` | Remove | Stale — V1 is the default engine in recent vLLM. Flag is a no-op. |
| \`VLLM_V1_USE_PREFILL_DECODE_ATTENTION\` | Remove | Likely deprecated alongside the V1 migration. Verify and drop. |
| \`VLLM_VIDEO_LOADER_BACKEND=opencv\` | Keep | Correctness requirement for video input models (Nemotron VL). Not a perf choice. |
| \`VLLM_RPC_TIMEOUT=18000000\` | Investigate | ~5hr timeout set for DeepSeek-V3.2-Exp. Verify whether still needed; if yes, document why this model requires it. |
| \`VLLM_ALLOW_LONG_MAX_MODEL_LEN\` | Keep | Intentional user opt-in beyond the safety gate for context length. Should stay explicit. |
| \`VLLM_COMMIT\` | Remove | Not a runtime env var — it's an install-time wheel pin for MiniMax-M2.7. Should live in the dependencies block or be dropped once mainline vLLM catches up. |
| \`VLLM_ROCM_USE_AITER\` | Remove | Should be the AMD default in vLLM. Large perf uplift on MI300X+ — removing the explicit flag only matters if vLLM enables it automatically. |
| \`VLLM_ROCM_USE_AITER_MOE\` | Remove (=1) / Keep (=0) | The =1 cases should be a vLLM heuristic. The =0 cases are correctness workarounds for FP8 checkpoints on specific MI hardware — keep those. |
| \`VLLM_ROCM_USE_AITER_RMSNORM\` | Investigate | Disables AITER RMSNorm on AMD. Unclear if this is a workaround for a specific bug or intentional tuning. |
| \`VLLM_ROCM_QUICK_REDUCE_QUANTIZATION=INT4\` | Keep | AMD-specific INT4 quantized all-reduce. Medium bandwidth savings on large MoE models. Not auto-detectable. |
| \`HSA_NO_SCRATCH_RECLAIM\` | Investigate | Only in gpt-oss. Unclear if it generalizes to other AMD recipes or is model-specific. |
| \`AMDGCN_USE_BUFFER_OPS\` | Investigate | Only in gpt-oss. Same question as above. |

---

**Next steps:**
- [ ] Engineers confirm / override each row above
- [ ] File removal PRs for confirmed stale flags
- [ ] File vLLM upstream issues for flags that should become heuristics
- [ ] Promote \`VLLM_FLOAT32_MATMUL_PRECISION\` to structured \`hardware_overrides.blackwell\` across Blackwell-capable recipes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Audit: env var cleanup pass across all recipes #366

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Env Var	Decision	Notes
`VLLM_USE_FLASHINFER_MOE_FP4`	Remove	Should be a vLLM heuristic: auto-select when variant precision is fp4/nvfp4 on Hopper. High perf impact — removing the explicit flag only matters if vLLM doesn't pick it up automatically.
`VLLM_USE_FLASHINFER_MOE_FP8`	Remove	Same family as above. The =0 workaround cases need an upstream fix; the =1 cases should be auto-detected.
`VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8`	Investigate	Only appears in gpt-oss. Unclear if this is a stable upstream flag or a one-time experiment.
`VLLM_FLASHINFER_MOE_BACKEND`	Keep	Meaningful latency-vs-throughput kernel choice that users should own. Needs better documentation on when to pick `latency` vs `throughput`.
`VLLM_USE_DEEP_GEMM`	Keep	Real throughput lift for FP8 matmul. Requires DeepGEMM install and has a warmup cost, so it should stay explicit. Worth surfacing in a performance-mode guide.
`VLLM_DEEP_GEMM_WARMUP`	Keep	Intentional escape hatch to skip slow JIT warmup at startup. No throughput effect at runtime.
`VLLM_FLOAT32_MATMUL_PRECISION=high`	Keep	Enables TF32 TensorCore path on Blackwell — medium perf gain. Currently only in guide prose; should be promoted to a structured `hardware_overrides.blackwell` entry across all Blackwell-capable recipes.
`SAFETENSORS_FAST_GPU`	Remove	Should be a vLLM default — no reason it isn't always-on. Speeds up model loading from disk; no inference throughput effect.
`VLLM_ATTENTION_BACKEND=FLASH_ATTN`	Keep	Model-specific correctness requirement (Qwen3-Next), not a perf choice.
`VLLM_USE_TRTLLM_ATTENTION=0`	Keep	Model-specific workaround to disable TRT-LLM attention where it breaks correctness.
`VLLM_USE_TRITON_FLASH_ATTN=0`	Remove	Appears to be a stale bug workaround. Verify whether the underlying issue is fixed in recent vLLM; if so, drop it.
`VLLM_USE_NCCL_SYMM_MEM`	Keep	Part of the NVLink perf bundle for pd_cluster (see three NCCL_* rows below). Medium bandwidth impact on NVLink clusters. Needs grouped documentation.
`NCCL_CUMEM_ENABLE`	Keep	Required for the NVLink SHARP / symmetric memory path. Bundle with `VLLM_USE_NCCL_SYMM_MEM`.
`NCCL_MNNVL_ENABLE`	Keep	Enables multi-node NVLink. Bundle with above.
`NCCL_NVLS_ENABLE`	Keep	Enables NVLink SHARP (NVLS). Bundle with above.
`VLLM_USE_V1`	Remove	Stale — V1 is the default engine in recent vLLM. Flag is a no-op.
`VLLM_V1_USE_PREFILL_DECODE_ATTENTION`	Remove	Likely deprecated alongside the V1 migration. Verify and drop.
`VLLM_VIDEO_LOADER_BACKEND=opencv`	Keep	Correctness requirement for video input models (Nemotron VL). Not a perf choice.
`VLLM_RPC_TIMEOUT=18000000`	Investigate	~5hr timeout set for DeepSeek-V3.2-Exp. Verify whether still needed; if yes, document why this model requires it.
`VLLM_ALLOW_LONG_MAX_MODEL_LEN`	Keep	Intentional user opt-in beyond the safety gate for context length. Should stay explicit.
`VLLM_COMMIT`	Remove	Not a runtime env var — it's an install-time wheel pin for MiniMax-M2.7. Should live in the dependencies block or be dropped once mainline vLLM catches up.
`VLLM_ROCM_USE_AITER`	Remove	Should be the AMD default in vLLM. Large perf uplift on MI300X+ — removing the explicit flag only matters if vLLM enables it automatically.
`VLLM_ROCM_USE_AITER_MOE`	Remove (=1) / Keep (=0)	The =1 cases should be a vLLM heuristic. The =0 cases are correctness workarounds for FP8 checkpoints on specific MI hardware — keep those.
`VLLM_ROCM_USE_AITER_RMSNORM`	Investigate	Disables AITER RMSNorm on AMD. Unclear if this is a workaround for a specific bug or intentional tuning.
`VLLM_ROCM_QUICK_REDUCE_QUANTIZATION=INT4`	Keep	AMD-specific INT4 quantized all-reduce. Medium bandwidth savings on large MoE models. Not auto-detectable.
`HSA_NO_SCRATCH_RECLAIM`	Investigate	Only in gpt-oss. Unclear if it generalizes to other AMD recipes or is model-specific.
`AMDGCN_USE_BUFFER_OPS`	Investigate	Only in gpt-oss. Same question as above.

Audit: env var cleanup pass across all recipes #366

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions