Reinforcement Learning in Maze Navigation

This repository contains coursework for BIOE70077 - Reinforcement Learning for Bioengineers at Imperial College London. It compares the implementation of both Dynamic Programming and Monte Carlo methods in a maze environment simulating a drug delivery problem in bioengineering.

Project Overview

The task was to train an agent to navigate a grid-like maze with obstacles and absorbing states representing biological barriers and drug targets. The project explored:

Dynamic Programming (Policy Iteration)
Monte Carlo Learning (First-Visit, with Decaying ε-Greedy Policy)
Exploration Strategy Comparison (Fixed ε, Decaying ε, and SoftMax)
Sensitivity Analysis

The environment is inspired by drug delivery systems, with RL techniques used to discover optimal paths under probabilistic movement constraints.

Algorithms Implemented

Dynamic Programming Agent

Approach: Policy Iteration
Assumptions: Full knowledge of transition matrix (T) and reward function (R)
Highlights:
- Converges quickly in small state space
- Value function and policy grid visualizations
- Explored the effects of different γ and transition probabilities

Monte Carlo Agent

Approach: First-Visit Monte Carlo with Decaying ε-Greedy
Assumptions: Model-free learning using sampled episodes
Highlights:
- Optimistic Q-value initialization
- Decaying learning rate and exploration rate
- Learning curve across 400 episodes
- Policy and value function visualizations

Exploration Strategies Compared

Strategy	Description	Notes
Fixed ε-Greedy	Constant exploration rate	High variance, did not converge efficiently
Decaying ε-Greedy	Exploration rate decays over time	Balanced convergence and exploration
SoftMax	Action probability ∝ Q-values (temperature-based)	Best stability and learning performance

Each method was evaluated by comparing learning curves and policy convergence.

Results

SoftMax strategy yielded the highest total rewards and most stable convergence.
The DP agent converged faster due to full environment access.
Monte Carlo handled uncertainty well, especially with exploration strategy tuning.

Key Takeaways

Policy Iteration is efficient in small, fully-known MDPs.
Model-free agents (like Monte Carlo) benefit from smart exploration strategies.
Exploration-exploitation tradeoff is crucial in noisy environments.
SoftMax performed best, balancing fast learning and policy optimality.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md
coursework1.py		coursework1.py
coursework1_report_06005311.pdf		coursework1_report_06005311.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Reinforcement Learning in Maze Navigation

Project Overview

Algorithms Implemented

Dynamic Programming Agent

Monte Carlo Agent

Exploration Strategies Compared

Results

Key Takeaways

About

Uh oh!

Releases

Packages

Languages

eg424/RL_Maze

Folders and files

Latest commit

History

Repository files navigation

Reinforcement Learning in Maze Navigation

Project Overview

Algorithms Implemented

Dynamic Programming Agent

Monte Carlo Agent

Exploration Strategies Compared

Results

Key Takeaways

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages