Skip to content

Wire batched MCTS into training self-play#17

Merged
cweill merged 1 commit into
mainfrom
feature/batched-mcts-training
May 26, 2026
Merged

Wire batched MCTS into training self-play#17
cweill merged 1 commit into
mainfrom
feature/batched-mcts-training

Conversation

@cweill

@cweill cweill commented May 26, 2026

Copy link
Copy Markdown
Contributor

Summary

Training self-play built its MCTS config as {num_simulations, dirichlet_eps}, so MCTS fell back to batch_size=1 (sequential) and the leaf-parallel batched MCTS from #9 — measured at ~2.8–4.2× in the benchmark — never reached real training. (The 5 Connect Four robustness runs provably ran sequential: their wandb configs logged no batch size.) This threads a self-play MCTS batch-size knob through both training entry points.

Changes

  • arena.py training CLI: new --mcts-batch-size (default 16), added to self_play_mcts_cfg.
  • modal_app.py: new mcts_batch_size param (default 16) on train_remote/main, added to self_play_mcts_cfg and logged in the wandb run config.
  • Eval is untouched: MCTSPlayer (gating/ladder) has no batch_size parameter, so evaluation stays on exact sequential search even though it reads the same cfg.

Design note

Leaf-parallel MCTS with virtual loss approximates sequential search — a speed/fidelity trade. Default 16 captures most of the speedup for data generation; --mcts-batch-size 1 restores fully sequential self-play. Keeping eval sequential preserves gating/Elo fidelity, where accuracy matters more than throughput.

Testing

  • Full suite: 143 passed.
  • New: CLI threads --mcts-batch-size into self_play_mcts_cfg (and defaults to 16); _self_play_cfg passes batch_size through; Modal train_remote/entrypoint forward and log it.
  • End-to-end: a real 1-iteration tic-tac-toe training with --mcts-batch-size 8 runs (exit 0), generates examples, and reduces loss. Batched MCTS mechanics themselves are covered by test_mcts.py (Batched / leaf-parallel MCTS inference (~3x net-eval throughput) #9).

Review focus

  • Default batch size 16 for self-play — reasonable, or prefer a more conservative default?
  • The self-play-batched / eval-sequential split.

Closes alphago-0oz.

Self-play built its MCTS config with only num_simulations + dirichlet_eps,
so MCTS fell back to batch_size=1 (sequential) and #9's leaf-parallel
speedup never reached real training (the 5 Connect Four robustness runs
provably ran sequential). Add a --mcts-batch-size knob (default 16) to the
arena training CLI and an mcts_batch_size param to the Modal entrypoint,
threaded into self_play_mcts_cfg and logged to wandb.

Only self-play is batched: the evaluation player (MCTSPlayer) ignores
batch_size, so gating/Elo/ladder stay on exact sequential search. Set
--mcts-batch-size 1 to restore fully sequential self-play.

alphago-0oz
@cweill cweill merged commit eeed986 into main May 26, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant