MoE Efficiency Study

An empirical study of Mixture of Experts routing strategies, directly motivated by Grok-1's architecture.

Research Questions

What is the optimal auxiliary loss coefficient? How does the load-balancing loss coefficient affect expert collapse, specialization, and final perplexity?
How does routing strategy affect expert specialization? Do top-k, expert choice, and noisy routing lead to different degrees of weight divergence between experts?
What is the perplexity cost of guaranteed load balance? Expert choice routing eliminates collapse but drops tokens — is the tradeoff worthwhile?

Architecture

┌─────────────────────────────────────────────────┐
│              SmallMoETransformer                 │
│                                                  │
│  Input IDs ──► Token Emb + Pos Emb               │
│                     │                             │
│  ┌──────────────────▼──────────────────────┐     │
│  │ Layer 0: Attention ──► Dense FFN        │     │
│  │ Layer 1: Attention ──► MoE FFN ◄── 8 experts │
│  │ Layer 2: Attention ──► Dense FFN        │     │
│  │ Layer 3: Attention ──► MoE FFN ◄── 8 experts │
│  │ Layer 4: Attention ──► Dense FFN        │     │
│  │ Layer 5: Attention ──► MoE FFN ◄── 8 experts │
│  └──────────────────┬──────────────────────┘     │
│                     │                             │
│              LayerNorm ──► Linear ──► Logits      │
└─────────────────────────────────────────────────┘

MoE FFN Detail:
  tokens ──► Router ──► Top-K selection
                │
        ┌───┬──┴──┬───┬───┬───┬───┬───┐
        │E0 │ E1  │E2 │E3 │E4 │E5 │E6 │E7│  (8 independent FFNs)
        └───┴──┬──┴───┴───┴───┴───┴───┘
               │
        weighted sum ──► output

Configurations

Config	Router	top_k	aux_coeff	Purpose
A	TopK	1	0.0	Baseline — no load balancing
B	TopK	1	0.01	Light balancing
C	TopK	1	0.1	Heavy balancing
D	TopK	2	0.01	Mixtral-style
E	Expert Choice	2	0.0	Guaranteed balance
F	Noisy TopK	2	0.01	Exploration routing

Quick Start

# Install dependencies
pip install -r requirements.txt

# Run tests
pytest tests/ -v

# Quick study (200 steps per config, ~5 min)
python run_study.py --quick --plot

# Full study (2000 steps per config, ~30-60 min)
python run_study.py --plot

# Single config
python run_study.py --config config_D --steps 500

# Generate plots from existing results
python run_study.py --plot-only

Results

Run python run_study.py --plot to populate.

See FINDINGS.md for the full research writeup.

Connection to Grok-1

Grok-1 uses 8 experts with top-2 routing across 314B parameters. The authors explicitly noted their MoE implementation is "deliberately simple/inefficient to prove correctness." This study investigates what that design space looks like:

Config D mirrors Grok-1's routing (top-2 with moderate load balancing)
Config E tests the alternative: guaranteed balance via expert choice
Config F tests whether exploration noise improves on simple top-k

The ExpertBank.benchmark() method quantifies the efficiency gap between naive (Grok-1-style) and batched expert computation.

Project Structure

├── model/
│   ├── moe.py          # Routers, MoE layer, transformer
│   └── experts.py      # Expert FFN bank (naive + batched)
├── training/
│   └── train.py        # Training pipeline with research logging
├── analysis/
│   ├── metrics.py      # Load imbalance, specialization, entropy
│   └── visualize.py    # Publication-quality plots
├── tests/
│   └── test_moe.py     # Mathematical correctness tests
├── run_study.py         # Full study runner
├── FINDINGS.md          # Research paper (results filled after run)
└── requirements.txt

References

Shazeer et al. 2017, "Outrageously Large Neural Networks" (original MoE)
Fedus et al. 2022, "Switch Transformers" (top-1 routing + aux loss)
Zhou et al. 2022, "Expert Choice Routing" (experts select tokens)
Jiang et al. 2024, "Mixtral of Experts" (top-2 routing in practice)
xAI, 2024, "Grok-1" (8 experts, top-2, 314B parameters)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MoE Efficiency Study

Research Questions

Architecture

Configurations

Quick Start

Results

Connection to Grok-1

Project Structure

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
analysis		analysis
model		model
results		results
tests		tests
training		training
.env.example		.env.example
.gitignore		.gitignore
FINDINGS.md		FINDINGS.md
README.md		README.md
requirements.txt		requirements.txt
run_study.py		run_study.py

Folders and files

Latest commit

History

Repository files navigation

MoE Efficiency Study

Research Questions

Architecture

Configurations

Quick Start

Results

Connection to Grok-1

Project Structure

References

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages