A neural network learns to play Pac-Man. Then a second neural network learns to dream the game -- and a third agent learns to play entirely inside those dreams.
79 tests | 128 parallel envs | 4.2M param PPO agent | 28M param world model
This project explores two fundamental approaches to training game-playing AI:
A convolutional neural network learns to play Pac-Man from scratch through trial and error. It plays 128 games simultaneously, gradually improving its strategy over 8,000 training updates. The agent progresses through a curriculum -- starting against easy ghosts and scaling up to the full arcade difficulty with all four ghost personalities (Blinky, Pinky, Inky, Clyde) using their authentic 1980 targeting strategies. The result is an agent that clears levels with 95%+ success rate.
Instead of learning from the real game, we train a Recurrent State-Space Model (RSSM) that watches the trained agent play and learns to simulate Pac-Man entirely in its latent space -- a compressed, learned representation of the game. Once this "dream engine" is trained, we train a new agent that has never seen the real game. It learns to play Pac-Man purely from imagined gameplay generated by the world model. This is the same paradigm behind DeepMind's Dreamer and recent breakthroughs like DIAMOND (NeurIPS 2024).
World models are one of the most promising directions toward sample-efficient AI. Instead of requiring millions of real interactions, an agent can generate unlimited training data from a learned simulator. This has implications beyond games -- robotics, autonomous driving, and scientific simulation all face the same problem of expensive real-world data.
The entire stack is custom-built: a vectorized NumPy game engine running 128 parallel games, a PyTorch training pipeline, and the full RSSM world model architecture. No OpenAI Gym, no pre-built environments -- just NumPy, PyTorch, and the original arcade rules.
Input: 32×31×28 grid (8ch × 4 frame stack) + 5 scalars
|
+-- Conv2d(32->64, 3x3, stride=1) -> ReLU
+-- Conv2d(64->128, 3x3, stride=2) -> ReLU
+-- Conv2d(128->128, 3x3, stride=2) -> ReLU
+-- Flatten
|
+-- Concat with 5 scalars
+-- Linear(... -> 512) -> ReLU <- shared backbone
|
+---> Linear(512 -> 256 -> 4) <- policy head
+---> Linear(512 -> 256 -> 1) <- value head
Total: ~4.2M parameters
Observation (8x31x28) --> CNN Encoder [64,128,256,256] --> Posterior z
|
+-----------------------------------+
v
GRU Dynamics (h=512) --> Prior z (for imagination)
|
+---> Transposed CNN Decoder --> Reconstructed obs
+---> Reward Head (MLP) --> Predicted reward
+---> Continue Head (MLP) --> Termination probability
Latent state: h (512) + z (32 classes x 64 categoricals = 2048) = 2560 dims
Total: ~28M parameters
Latent state (2560) --> MLP (512 -> 256 -> 4 actions) <- actor
--> MLP (512 -> 256 -> 1 value) <- critic
The agent learns to play Pac-Man through three curriculum stages:
| Phase | Updates | Difficulty | What the Agent Learns |
|---|---|---|---|
| 1 | 0 - 800 | Scatter-only ghosts | Navigate maze, eat pellets |
| 2 | 800 - 3000 | Full ghost AI | Evade ghosts, use power pellets |
| 3 | 3000 - 8000 | Full AI + Cruise Elroy | Master ghost chains, optimize routes |
Key hyperparameters:
| Parameter | Value |
|---|---|
| Parallel environments | 128 |
| Rollout steps | 128 (= 16,384 transitions/rollout) |
| PPO epochs | 4 |
| Minibatch size | 2,048 |
| Learning rate | 2.5e-4 (annealed) |
| Entropy coefficient | 0.15 -> 0.01 (annealed) |
| Frame stacking | 4 frames |
| RND curiosity | Enabled (intrinsic_coef=0.1) |
| Total updates | 8,000 |
The trained PPO agent collects 1000 episodes of gameplay. The RSSM world model learns to simulate the game by minimizing:
Loss = reconstruction + reward_prediction + continue_prediction + KL(posterior || prior)
- Reconstruction: MSE between decoded latent state and real observation
- Reward: MSE on symlog-transformed rewards
- Continue: Binary cross-entropy on episode termination
- KL: Keeps the prior (imagination) close to the posterior (reality), with free nats = 1.0
The dream agent trains entirely inside the world model's latent space:
- Encode diverse starting observations into latent states
- Roll out 512 parallel imagined trajectories for 15 steps each
- Compute GAE advantages using predicted rewards and continuation probabilities
- PPO update on the imagined data
- Periodically evaluate in the real game to measure transfer
The agent sees the game as a multi-channel image:
| Channel | Content | Purpose |
|---|---|---|
| 0 | Walls | Navigate the maze |
| 1 | Pac-Man position | Self-localization |
| 2 | Pellets | Primary objective |
| 3 | Power pellets | Strategic power-ups |
| 4 | Dangerous ghosts | Threats to avoid |
| 5 | Edible ghosts | Targets to chase |
| 6 | Ghost house | Restricted zone |
| 7 | Fruit | Bonus points |
Plus 5 scalar features: power timer, lives, ghosts eaten, progress, and current direction.
The PPO agent stacks 4 consecutive frames (32 channels) for temporal context. The world model uses single frames (8 channels) -- the GRU hidden state replaces frame stacking.
Each ghost follows authentic 1980 arcade behavior:
| Ghost | Name | Targeting Strategy |
|---|---|---|
| Blinky | Red | Directly targets Pac-Man's tile |
| Pinky | Pink | Targets 4 tiles ahead of Pac-Man |
| Inky | Cyan | Flanking maneuver using Blinky's position |
| Clyde | Orange | Chases when far (>8 tiles), retreats when close |
Ghosts cycle through scatter (patrol corners) and chase (hunt Pac-Man) modes, with frightened mode when Pac-Man eats a power pellet.
| Event | Reward | Purpose |
|---|---|---|
| Eat pellet | +1.0 | Core objective |
| Eat power pellet | +2.0 | Strategic value |
| Eat ghost (1st-4th) | +5 / +10 / +15 / +20 | Ghost chain bonus |
| Eat fruit | +3.0 | Bonus target |
| Clear level | +50.0 | Ultimate goal |
| Death | -10.0 | Avoid ghosts |
| Game over | -25.0 | Strong survival signal |
| Ghost proximity | -0.3 | Spatial awareness |
| Time step | -0.01 | Encourage efficiency |
# Clone
git clone https://github.com/vikranthreddimasu/pacman-ai.git
cd pacman-ai
# Install
pip install -e ".[dev]"
# Train PPO agent (128 parallel games, auto-detects MPS/CUDA/CPU)
python scripts/train.py --total-updates 8000
# Watch the agent play
python scripts/watch.py runs/<run-dir>/checkpoints/latest.pt
# Run tests
pytest tests/ -v# Collect gameplay data from trained agent (~35 min)
python scripts/collect_data.py --checkpoint runs/<run-dir>/checkpoints/latest.pt --episodes 1000
# Train world model (~7 hours on MPS)
python scripts/train_world_model.py --data runs/<run-dir>/replay_buffer.pt
# Train dream agent (~2 hours)
python scripts/train_dreamer.py --world-model runs/<run-dir>/world_model/world_model_latest.pt
# Watch real game vs model's dream side-by-side
python scripts/watch_dreams.py --world-model runs/<run-dir>/world_model/world_model_latest.pt# Live training dashboard (Dash + Plotly)
python scripts/dashboard.py --log-dir runs/<run-dir>/tensorboard/
# TensorBoard
tensorboard --logdir runs/pacman-ai/
├── pacman/
│ ├── agents/
│ │ ├── networks.py # Actor-Critic CNN (~4.2M params)
│ │ ├── ppo.py # PPO algorithm (clipped surrogate + GAE)
│ │ ├── rnd.py # Random Network Distillation curiosity
│ │ └── rollout.py # Experience buffer with batch generator
│ ├── config/
│ │ └── default.yaml # All hyperparameters
│ ├── engine/
│ │ ├── constants.py # Enums, directions, ghost properties
│ │ ├── entities.py # GameState dataclass
│ │ ├── game.py # Step logic, collision, rewards
│ │ ├── ghost_ai.py # Blinky/Pinky/Inky/Clyde targeting
│ │ ├── maze.py # Grid operations, BFS pathfinding
│ │ └── maze_data.py # Classic arcade maze layout
│ ├── env/
│ │ ├── pacman_env.py # Single-game Gymnasium interface
│ │ └── vec_env.py # N parallel games with auto-reset + frame stacking
│ ├── training/
│ │ ├── trainer.py # PPO training loop + curriculum
│ │ ├── evaluator.py # Greedy policy evaluation
│ │ ├── checkpoint.py # Save/load model state
│ │ ├── wm_trainer.py # World model training loop
│ │ └── dream_trainer.py # Imagination PPO training loop
│ ├── world_model/
│ │ ├── rssm.py # GRU dynamics + categorical stochastic state
│ │ ├── encoder.py # 4-layer CNN encoder (obs -> latent)
│ │ ├── decoder.py # Transposed CNN decoder (latent -> obs)
│ │ ├── heads.py # Reward and continue prediction heads
│ │ ├── world_model.py # Integrated model: train_step() + imagine()
│ │ └── replay_buffer.py # Sequential episode storage
│ ├── utils/
│ │ └── config.py # YAML config loader
│ └── viz/
│ ├── renderer.py # Pygame game renderer
│ ├── sprites.py # Pac-Man and ghost sprites
│ ├── dashboard.py # Live metrics dashboard (Dash + Plotly)
│ └── dream_viewer.py # Side-by-side real vs dream visualization
├── scripts/
│ ├── train.py # PPO training entry point
│ ├── evaluate.py # Evaluate a checkpoint
│ ├── watch.py # Watch agent play in real-time
│ ├── dashboard.py # Launch metrics dashboard
│ ├── collect_data.py # Collect gameplay data for world model
│ ├── train_world_model.py # Train RSSM from collected data
│ ├── train_dreamer.py # Train dream agent in imagination
│ └── watch_dreams.py # Launch dream viewer
├── tests/ # 79 tests
├── Dockerfile
└── pyproject.toml
Why PPO over DQN? PPO handles continuous exploration better, is more stable with large batch sizes, and naturally supports the stochastic policies needed for high-dimensional observations.
Why a custom engine? A pure NumPy engine runs 128 games in parallel without process overhead, achieving 2,800+ environment steps per second with full control over authentic ghost AI.
Why curriculum learning? Without it, ghosts kill the agent before it learns to eat pellets. Scatter-only ghosts let the agent learn navigation first.
Why frame stacking for PPO but not the world model? The PPO agent needs 4 stacked frames to perceive motion (ghost direction, speed). The world model's GRU hidden state serves the same purpose -- it maintains temporal memory across steps, so single frames suffice.
Why categorical stochastic state? Pac-Man is a discrete game -- positions are on a grid, ghosts are in specific modes. Categorical distributions (32 classes x 64 categoricals) model this better than Gaussian alternatives, matching the Dreamer V3 approach.
The world model implementation draws from:
- Dreamer V3 (Hafner et al., 2023) -- RSSM with categorical stochastic state
- DIAMOND (NeurIPS 2024) -- world models for game simulation
- Genie 3 (DeepMind) -- imagination-based agent training
| Component | Technology |
|---|---|
| Game Engine | NumPy (vectorized, 128 parallel games) |
| PPO Agent | PyTorch CNN Actor-Critic (~4.2M params) |
| Curiosity | Random Network Distillation (RND) |
| World Model | RSSM with categorical latent state (~28M params) |
| Acceleration | Apple MPS / CUDA / CPU |
| Visualization | Pygame + Dash/Plotly |
| Tracking | TensorBoard |
| Testing | pytest (79 tests) |
MIT