When c is None, the function _allocate_output uses torch.empty to allocate the output tensor:
return torch.empty(*shape, device=a.device, dtype=a.dtype)
However, if any entry in batch_sizes (e.g., batch_sizes[i]) is zero, the corresponding GEMM computation for that expert is skipped, and that region of the output tensor is never written to.
Since torch.empty does not initialize memory, these unwritten regions may contain:
- Arbitrary garbage values
- NaNs or infinities
- Non-deterministic behavior across runs
This can lead to silent correctness issues in MoE (Mixture of Experts) , especially when some experts receive zero tokens during routing.
When c is None, the function
_allocate_outputusestorch.emptyto allocate the output tensor:However, if any entry in batch_sizes (e.g., batch_sizes[i]) is zero, the corresponding GEMM computation for that expert is skipped, and that region of the output tensor is never written to.
Since torch.empty does not initialize memory, these unwritten regions may contain:
This can lead to silent correctness issues in MoE (Mixture of Experts) , especially when some experts receive zero tokens during routing.