Skip to content

vikranthreddimasu/pacman-ai

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

35 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Pac-Man AI: From Reinforcement Learning to World Models

A neural network learns to play Pac-Man. Then a second neural network learns to dream the game -- and a third agent learns to play entirely inside those dreams.

79 tests | 128 parallel envs | 4.2M param PPO agent | 28M param world model


What is this project?

This project explores two fundamental approaches to training game-playing AI:

1. Learning by Playing (PPO Agent)

A convolutional neural network learns to play Pac-Man from scratch through trial and error. It plays 128 games simultaneously, gradually improving its strategy over 8,000 training updates. The agent progresses through a curriculum -- starting against easy ghosts and scaling up to the full arcade difficulty with all four ghost personalities (Blinky, Pinky, Inky, Clyde) using their authentic 1980 targeting strategies. The result is an agent that clears levels with 95%+ success rate.

2. Learning by Dreaming (World Model)

Instead of learning from the real game, we train a Recurrent State-Space Model (RSSM) that watches the trained agent play and learns to simulate Pac-Man entirely in its latent space -- a compressed, learned representation of the game. Once this "dream engine" is trained, we train a new agent that has never seen the real game. It learns to play Pac-Man purely from imagined gameplay generated by the world model. This is the same paradigm behind DeepMind's Dreamer and recent breakthroughs like DIAMOND (NeurIPS 2024).

Why does this matter?

World models are one of the most promising directions toward sample-efficient AI. Instead of requiring millions of real interactions, an agent can generate unlimited training data from a learned simulator. This has implications beyond games -- robotics, autonomous driving, and scientific simulation all face the same problem of expensive real-world data.

Built from scratch

The entire stack is custom-built: a vectorized NumPy game engine running 128 parallel games, a PyTorch training pipeline, and the full RSSM world model architecture. No OpenAI Gym, no pre-built environments -- just NumPy, PyTorch, and the original arcade rules.


Architecture

PPO Agent

Input: 32×31×28 grid (8ch × 4 frame stack) + 5 scalars
  |
  +-- Conv2d(32->64, 3x3, stride=1)  -> ReLU
  +-- Conv2d(64->128, 3x3, stride=2) -> ReLU
  +-- Conv2d(128->128, 3x3, stride=2) -> ReLU
  +-- Flatten
  |
  +-- Concat with 5 scalars
  +-- Linear(... -> 512) -> ReLU           <- shared backbone
  |
  +---> Linear(512 -> 256 -> 4)            <- policy head
  +---> Linear(512 -> 256 -> 1)            <- value head

Total: ~4.2M parameters

World Model (RSSM)

Observation (8x31x28) --> CNN Encoder [64,128,256,256] --> Posterior z
                                                              |
                          +-----------------------------------+
                          v
                    GRU Dynamics (h=512) --> Prior z (for imagination)
                          |
                          +---> Transposed CNN Decoder --> Reconstructed obs
                          +---> Reward Head (MLP)     --> Predicted reward
                          +---> Continue Head (MLP)   --> Termination probability

Latent state: h (512) + z (32 classes x 64 categoricals = 2048) = 2560 dims
Total: ~28M parameters

Dream Agent

Latent state (2560) --> MLP (512 -> 256 -> 4 actions)   <- actor
                    --> MLP (512 -> 256 -> 1 value)      <- critic

Training Pipeline

Phase 1: PPO Training (Real Environment)

The agent learns to play Pac-Man through three curriculum stages:

Phase Updates Difficulty What the Agent Learns
1 0 - 800 Scatter-only ghosts Navigate maze, eat pellets
2 800 - 3000 Full ghost AI Evade ghosts, use power pellets
3 3000 - 8000 Full AI + Cruise Elroy Master ghost chains, optimize routes

Key hyperparameters:

Parameter Value
Parallel environments 128
Rollout steps 128 (= 16,384 transitions/rollout)
PPO epochs 4
Minibatch size 2,048
Learning rate 2.5e-4 (annealed)
Entropy coefficient 0.15 -> 0.01 (annealed)
Frame stacking 4 frames
RND curiosity Enabled (intrinsic_coef=0.1)
Total updates 8,000

Phase 2: World Model Training

The trained PPO agent collects 1000 episodes of gameplay. The RSSM world model learns to simulate the game by minimizing:

Loss = reconstruction + reward_prediction + continue_prediction + KL(posterior || prior)
  • Reconstruction: MSE between decoded latent state and real observation
  • Reward: MSE on symlog-transformed rewards
  • Continue: Binary cross-entropy on episode termination
  • KL: Keeps the prior (imagination) close to the posterior (reality), with free nats = 1.0

Phase 3: Imagination Training

The dream agent trains entirely inside the world model's latent space:

  1. Encode diverse starting observations into latent states
  2. Roll out 512 parallel imagined trajectories for 15 steps each
  3. Compute GAE advantages using predicted rewards and continuation probabilities
  4. PPO update on the imagined data
  5. Periodically evaluate in the real game to measure transfer

Observation Space

The agent sees the game as a multi-channel image:

Channel Content Purpose
0 Walls Navigate the maze
1 Pac-Man position Self-localization
2 Pellets Primary objective
3 Power pellets Strategic power-ups
4 Dangerous ghosts Threats to avoid
5 Edible ghosts Targets to chase
6 Ghost house Restricted zone
7 Fruit Bonus points

Plus 5 scalar features: power timer, lives, ghosts eaten, progress, and current direction.

The PPO agent stacks 4 consecutive frames (32 channels) for temporal context. The world model uses single frames (8 channels) -- the GRU hidden state replaces frame stacking.


Ghost AI

Each ghost follows authentic 1980 arcade behavior:

Ghost Name Targeting Strategy
Blinky Red Directly targets Pac-Man's tile
Pinky Pink Targets 4 tiles ahead of Pac-Man
Inky Cyan Flanking maneuver using Blinky's position
Clyde Orange Chases when far (>8 tiles), retreats when close

Ghosts cycle through scatter (patrol corners) and chase (hunt Pac-Man) modes, with frightened mode when Pac-Man eats a power pellet.


Reward Design

Event Reward Purpose
Eat pellet +1.0 Core objective
Eat power pellet +2.0 Strategic value
Eat ghost (1st-4th) +5 / +10 / +15 / +20 Ghost chain bonus
Eat fruit +3.0 Bonus target
Clear level +50.0 Ultimate goal
Death -10.0 Avoid ghosts
Game over -25.0 Strong survival signal
Ghost proximity -0.3 Spatial awareness
Time step -0.01 Encourage efficiency

Quick Start

# Clone
git clone https://github.com/vikranthreddimasu/pacman-ai.git
cd pacman-ai

# Install
pip install -e ".[dev]"

# Train PPO agent (128 parallel games, auto-detects MPS/CUDA/CPU)
python scripts/train.py --total-updates 8000

# Watch the agent play
python scripts/watch.py runs/<run-dir>/checkpoints/latest.pt

# Run tests
pytest tests/ -v

World Model Pipeline

# Collect gameplay data from trained agent (~35 min)
python scripts/collect_data.py --checkpoint runs/<run-dir>/checkpoints/latest.pt --episodes 1000

# Train world model (~7 hours on MPS)
python scripts/train_world_model.py --data runs/<run-dir>/replay_buffer.pt

# Train dream agent (~2 hours)
python scripts/train_dreamer.py --world-model runs/<run-dir>/world_model/world_model_latest.pt

# Watch real game vs model's dream side-by-side
python scripts/watch_dreams.py --world-model runs/<run-dir>/world_model/world_model_latest.pt

Monitoring

# Live training dashboard (Dash + Plotly)
python scripts/dashboard.py --log-dir runs/<run-dir>/tensorboard/

# TensorBoard
tensorboard --logdir runs/

Project Structure

pacman-ai/
├── pacman/
│   ├── agents/
│   │   ├── networks.py        # Actor-Critic CNN (~4.2M params)
│   │   ├── ppo.py             # PPO algorithm (clipped surrogate + GAE)
│   │   ├── rnd.py             # Random Network Distillation curiosity
│   │   └── rollout.py         # Experience buffer with batch generator
│   ├── config/
│   │   └── default.yaml       # All hyperparameters
│   ├── engine/
│   │   ├── constants.py       # Enums, directions, ghost properties
│   │   ├── entities.py        # GameState dataclass
│   │   ├── game.py            # Step logic, collision, rewards
│   │   ├── ghost_ai.py        # Blinky/Pinky/Inky/Clyde targeting
│   │   ├── maze.py            # Grid operations, BFS pathfinding
│   │   └── maze_data.py       # Classic arcade maze layout
│   ├── env/
│   │   ├── pacman_env.py      # Single-game Gymnasium interface
│   │   └── vec_env.py         # N parallel games with auto-reset + frame stacking
│   ├── training/
│   │   ├── trainer.py         # PPO training loop + curriculum
│   │   ├── evaluator.py       # Greedy policy evaluation
│   │   ├── checkpoint.py      # Save/load model state
│   │   ├── wm_trainer.py      # World model training loop
│   │   └── dream_trainer.py   # Imagination PPO training loop
│   ├── world_model/
│   │   ├── rssm.py            # GRU dynamics + categorical stochastic state
│   │   ├── encoder.py         # 4-layer CNN encoder (obs -> latent)
│   │   ├── decoder.py         # Transposed CNN decoder (latent -> obs)
│   │   ├── heads.py           # Reward and continue prediction heads
│   │   ├── world_model.py     # Integrated model: train_step() + imagine()
│   │   └── replay_buffer.py   # Sequential episode storage
│   ├── utils/
│   │   └── config.py          # YAML config loader
│   └── viz/
│       ├── renderer.py        # Pygame game renderer
│       ├── sprites.py         # Pac-Man and ghost sprites
│       ├── dashboard.py       # Live metrics dashboard (Dash + Plotly)
│       └── dream_viewer.py    # Side-by-side real vs dream visualization
├── scripts/
│   ├── train.py               # PPO training entry point
│   ├── evaluate.py            # Evaluate a checkpoint
│   ├── watch.py               # Watch agent play in real-time
│   ├── dashboard.py           # Launch metrics dashboard
│   ├── collect_data.py        # Collect gameplay data for world model
│   ├── train_world_model.py   # Train RSSM from collected data
│   ├── train_dreamer.py       # Train dream agent in imagination
│   └── watch_dreams.py        # Launch dream viewer
├── tests/                     # 79 tests
├── Dockerfile
└── pyproject.toml

Key Design Decisions

Why PPO over DQN? PPO handles continuous exploration better, is more stable with large batch sizes, and naturally supports the stochastic policies needed for high-dimensional observations.

Why a custom engine? A pure NumPy engine runs 128 games in parallel without process overhead, achieving 2,800+ environment steps per second with full control over authentic ghost AI.

Why curriculum learning? Without it, ghosts kill the agent before it learns to eat pellets. Scatter-only ghosts let the agent learn navigation first.

Why frame stacking for PPO but not the world model? The PPO agent needs 4 stacked frames to perceive motion (ghost direction, speed). The world model's GRU hidden state serves the same purpose -- it maintains temporal memory across steps, so single frames suffice.

Why categorical stochastic state? Pac-Man is a discrete game -- positions are on a grid, ghosts are in specific modes. Categorical distributions (32 classes x 64 categoricals) model this better than Gaussian alternatives, matching the Dreamer V3 approach.


Research Context

The world model implementation draws from:

  • Dreamer V3 (Hafner et al., 2023) -- RSSM with categorical stochastic state
  • DIAMOND (NeurIPS 2024) -- world models for game simulation
  • Genie 3 (DeepMind) -- imagination-based agent training

Tech Stack

Component Technology
Game Engine NumPy (vectorized, 128 parallel games)
PPO Agent PyTorch CNN Actor-Critic (~4.2M params)
Curiosity Random Network Distillation (RND)
World Model RSSM with categorical latent state (~28M params)
Acceleration Apple MPS / CUDA / CPU
Visualization Pygame + Dash/Plotly
Tracking TensorBoard
Testing pytest (79 tests)

License

MIT

About

A neural network learns Pac-Man, a world model learns to dream the game, and a third agent learns to play entirely inside those dreams. PPO → RSSM world model → imagination training → online dream loop. Custom vectorized NumPy engine, PyTorch, built from scratch.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors