An empirical study of Mixture of Experts routing strategies, directly motivated by Grok-1's architecture.
-
What is the optimal auxiliary loss coefficient? How does the load-balancing loss coefficient affect expert collapse, specialization, and final perplexity?
-
How does routing strategy affect expert specialization? Do top-k, expert choice, and noisy routing lead to different degrees of weight divergence between experts?
-
What is the perplexity cost of guaranteed load balance? Expert choice routing eliminates collapse but drops tokens — is the tradeoff worthwhile?
┌─────────────────────────────────────────────────┐
│ SmallMoETransformer │
│ │
│ Input IDs ──► Token Emb + Pos Emb │
│ │ │
│ ┌──────────────────▼──────────────────────┐ │
│ │ Layer 0: Attention ──► Dense FFN │ │
│ │ Layer 1: Attention ──► MoE FFN ◄── 8 experts │
│ │ Layer 2: Attention ──► Dense FFN │ │
│ │ Layer 3: Attention ──► MoE FFN ◄── 8 experts │
│ │ Layer 4: Attention ──► Dense FFN │ │
│ │ Layer 5: Attention ──► MoE FFN ◄── 8 experts │
│ └──────────────────┬──────────────────────┘ │
│ │ │
│ LayerNorm ──► Linear ──► Logits │
└─────────────────────────────────────────────────┘
MoE FFN Detail:
tokens ──► Router ──► Top-K selection
│
┌───┬──┴──┬───┬───┬───┬───┬───┐
│E0 │ E1 │E2 │E3 │E4 │E5 │E6 │E7│ (8 independent FFNs)
└───┴──┬──┴───┴───┴───┴───┴───┘
│
weighted sum ──► output
| Config | Router | top_k | aux_coeff | Purpose |
|---|---|---|---|---|
| A | TopK | 1 | 0.0 | Baseline — no load balancing |
| B | TopK | 1 | 0.01 | Light balancing |
| C | TopK | 1 | 0.1 | Heavy balancing |
| D | TopK | 2 | 0.01 | Mixtral-style |
| E | Expert Choice | 2 | 0.0 | Guaranteed balance |
| F | Noisy TopK | 2 | 0.01 | Exploration routing |
# Install dependencies
pip install -r requirements.txt
# Run tests
pytest tests/ -v
# Quick study (200 steps per config, ~5 min)
python run_study.py --quick --plot
# Full study (2000 steps per config, ~30-60 min)
python run_study.py --plot
# Single config
python run_study.py --config config_D --steps 500
# Generate plots from existing results
python run_study.py --plot-onlyRun
python run_study.py --plotto populate.
See FINDINGS.md for the full research writeup.
Grok-1 uses 8 experts with top-2 routing across 314B parameters. The authors explicitly noted their MoE implementation is "deliberately simple/inefficient to prove correctness." This study investigates what that design space looks like:
- Config D mirrors Grok-1's routing (top-2 with moderate load balancing)
- Config E tests the alternative: guaranteed balance via expert choice
- Config F tests whether exploration noise improves on simple top-k
The ExpertBank.benchmark() method quantifies the efficiency gap between naive (Grok-1-style) and batched expert computation.
├── model/
│ ├── moe.py # Routers, MoE layer, transformer
│ └── experts.py # Expert FFN bank (naive + batched)
├── training/
│ └── train.py # Training pipeline with research logging
├── analysis/
│ ├── metrics.py # Load imbalance, specialization, entropy
│ └── visualize.py # Publication-quality plots
├── tests/
│ └── test_moe.py # Mathematical correctness tests
├── run_study.py # Full study runner
├── FINDINGS.md # Research paper (results filled after run)
└── requirements.txt
- Shazeer et al. 2017, "Outrageously Large Neural Networks" (original MoE)
- Fedus et al. 2022, "Switch Transformers" (top-1 routing + aux loss)
- Zhou et al. 2022, "Expert Choice Routing" (experts select tokens)
- Jiang et al. 2024, "Mixtral of Experts" (top-2 routing in practice)
- xAI, 2024, "Grok-1" (8 experts, top-2, 314B parameters)