Skip to content

originaonxi/moe-efficiency-study

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MoE Efficiency Study

An empirical study of Mixture of Experts routing strategies, directly motivated by Grok-1's architecture.

Research Questions

  1. What is the optimal auxiliary loss coefficient? How does the load-balancing loss coefficient affect expert collapse, specialization, and final perplexity?

  2. How does routing strategy affect expert specialization? Do top-k, expert choice, and noisy routing lead to different degrees of weight divergence between experts?

  3. What is the perplexity cost of guaranteed load balance? Expert choice routing eliminates collapse but drops tokens — is the tradeoff worthwhile?

Architecture

┌─────────────────────────────────────────────────┐
│              SmallMoETransformer                 │
│                                                  │
│  Input IDs ──► Token Emb + Pos Emb               │
│                     │                             │
│  ┌──────────────────▼──────────────────────┐     │
│  │ Layer 0: Attention ──► Dense FFN        │     │
│  │ Layer 1: Attention ──► MoE FFN ◄── 8 experts │
│  │ Layer 2: Attention ──► Dense FFN        │     │
│  │ Layer 3: Attention ──► MoE FFN ◄── 8 experts │
│  │ Layer 4: Attention ──► Dense FFN        │     │
│  │ Layer 5: Attention ──► MoE FFN ◄── 8 experts │
│  └──────────────────┬──────────────────────┘     │
│                     │                             │
│              LayerNorm ──► Linear ──► Logits      │
└─────────────────────────────────────────────────┘

MoE FFN Detail:
  tokens ──► Router ──► Top-K selection
                │
        ┌───┬──┴──┬───┬───┬───┬───┬───┐
        │E0 │ E1  │E2 │E3 │E4 │E5 │E6 │E7│  (8 independent FFNs)
        └───┴──┬──┴───┴───┴───┴───┴───┘
               │
        weighted sum ──► output

Configurations

Config Router top_k aux_coeff Purpose
A TopK 1 0.0 Baseline — no load balancing
B TopK 1 0.01 Light balancing
C TopK 1 0.1 Heavy balancing
D TopK 2 0.01 Mixtral-style
E Expert Choice 2 0.0 Guaranteed balance
F Noisy TopK 2 0.01 Exploration routing

Quick Start

# Install dependencies
pip install -r requirements.txt

# Run tests
pytest tests/ -v

# Quick study (200 steps per config, ~5 min)
python run_study.py --quick --plot

# Full study (2000 steps per config, ~30-60 min)
python run_study.py --plot

# Single config
python run_study.py --config config_D --steps 500

# Generate plots from existing results
python run_study.py --plot-only

Results

Run python run_study.py --plot to populate.

See FINDINGS.md for the full research writeup.

Connection to Grok-1

Grok-1 uses 8 experts with top-2 routing across 314B parameters. The authors explicitly noted their MoE implementation is "deliberately simple/inefficient to prove correctness." This study investigates what that design space looks like:

  • Config D mirrors Grok-1's routing (top-2 with moderate load balancing)
  • Config E tests the alternative: guaranteed balance via expert choice
  • Config F tests whether exploration noise improves on simple top-k

The ExpertBank.benchmark() method quantifies the efficiency gap between naive (Grok-1-style) and batched expert computation.

Project Structure

├── model/
│   ├── moe.py          # Routers, MoE layer, transformer
│   └── experts.py      # Expert FFN bank (naive + batched)
├── training/
│   └── train.py        # Training pipeline with research logging
├── analysis/
│   ├── metrics.py      # Load imbalance, specialization, entropy
│   └── visualize.py    # Publication-quality plots
├── tests/
│   └── test_moe.py     # Mathematical correctness tests
├── run_study.py         # Full study runner
├── FINDINGS.md          # Research paper (results filled after run)
└── requirements.txt

References

  • Shazeer et al. 2017, "Outrageously Large Neural Networks" (original MoE)
  • Fedus et al. 2022, "Switch Transformers" (top-1 routing + aux loss)
  • Zhou et al. 2022, "Expert Choice Routing" (experts select tokens)
  • Jiang et al. 2024, "Mixtral of Experts" (top-2 routing in practice)
  • xAI, 2024, "Grok-1" (8 experts, top-2, 314B parameters)

About

Research — Mixture-of-Experts efficiency analysis. Benchmarking sparse activation vs dense models on cost-per-token and quality.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages