Add deterministic topk for MoE routing by sanketpurandare · Pull Request #3600 · pytorch/torchtitan

sanketpurandare · 2026-06-10T01:42:18Z

Route MoE expert selection through a TorchTitan custom op that enables PyTorch's deterministic topk implementation locally and restores the caller's deterministic-algorithm state afterward. This gives activation-checkpoint recompute the same stable top-k tie-breaking behavior without saving raw aten.topk outputs in selective activation checkpointing.

The wrapper follows the deterministic_scatter_add pattern, includes fake tensor and autograd registrations, and relies on the PyTorch deterministic topk implementation when available.

Test Plan:

python -m py_compile torchtitan/ops/topk.py torchtitan/models/common/moe.py torchtitan/distributed/activation_checkpoint.py tests/unit_tests/test_deterministic_ops.py
pytest tests/unit_tests/test_deterministic_ops.py -q
pytest tests/unit_tests/test_activation_checkpoint.py -q
pytest tests/unit_tests/test_compile_moe.py -q
pre-commit run --files torchtitan/ops/topk.py torchtitan/models/common/moe.py torchtitan/distributed/activation_checkpoint.py tests/unit_tests/test_deterministic_ops.py
pre-commit run --all-files

Route MoE expert selection through a TorchTitan custom op that enables PyTorch's deterministic topk implementation locally and restores the caller's deterministic-algorithm state afterward. This gives activation-checkpoint recompute the same stable top-k tie-breaking behavior without saving raw aten.topk outputs in selective activation checkpointing. The wrapper follows the deterministic_scatter_add pattern, includes fake tensor and autograd registrations, and relies on the PyTorch deterministic topk implementation when available. Test Plan: - python -m py_compile torchtitan/ops/topk.py torchtitan/models/common/moe.py torchtitan/distributed/activation_checkpoint.py tests/unit_tests/test_deterministic_ops.py - pytest tests/unit_tests/test_deterministic_ops.py -q - pytest tests/unit_tests/test_activation_checkpoint.py -q - pytest tests/unit_tests/test_compile_moe.py -q - pre-commit run --files torchtitan/ops/topk.py torchtitan/models/common/moe.py torchtitan/distributed/activation_checkpoint.py tests/unit_tests/test_deterministic_ops.py - pre-commit run --all-files stack-info: PR: #3600, branch: sanketpurandare/stack/20

tianyu-l · 2026-06-10T02:01:43Z

        torch.ops.aten.linear.default,
-        # topk can be non-deterministic; save to keep MoE expert assignments
-        # stable between forward and recompute.
-        torch.ops.aten.topk.default,


just curious -- what makes "always saving topk" in SAC policy bad?

Not bad but unnecessary.

tianyu-l · 2026-06-10T02:02:48Z

+) -> tuple[torch.Tensor, torch.Tensor]:
+    prev = torch.are_deterministic_algorithms_enabled()
+    prev_warn_only = torch.is_deterministic_algorithms_warn_only_enabled()
+    torch.use_deterministic_algorithms(True, warn_only=False)


@songhappy does it break your use case?

pytorch-bot Bot added the ciflow/8gpu label Jun 10, 2026

meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jun 10, 2026

sanketpurandare force-pushed the sanketpurandare/stack/20 branch from c1033f9 to bc36ecf Compare June 10, 2026 01:42

tianyu-l reviewed Jun 10, 2026

View reviewed changes

sanketpurandare added the ciflow/h100.8 Trigger H100.8 CI label Jun 10, 2026

sanketpurandare marked this pull request as ready for review June 10, 2026 15:16

sanketpurandare requested review from fegin, wconstab and wwwjn as code owners June 10, 2026 15:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add deterministic topk for MoE routing#3600

Add deterministic topk for MoE routing#3600
sanketpurandare wants to merge 1 commit into
mainfrom
sanketpurandare/stack/20

sanketpurandare commented Jun 10, 2026 •

edited

Loading

Uh oh!

tianyu-l Jun 10, 2026

Uh oh!

sanketpurandare Jun 10, 2026

Uh oh!

tianyu-l Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sanketpurandare commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tianyu-l Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

sanketpurandare Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

tianyu-l Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sanketpurandare commented Jun 10, 2026 •

edited

Loading