MSc Artificial Intelligence and Adaptive Systems · Dissertation School of Engineering and Informatics · University of Sussex · August 2025 Supervised by Dr. James Bennett, Informatics Department
"The goal of memory is not perfect retention, but optimising future decisions — which makes forgetting just as important as remembering."
— Richards & Frankland, Neuron (2017)
- Abstract
- Motivation — Why Forgetting?
- Key Contributions
- The Core Idea in 60 Seconds
- Algorithm — Full Technical Details
- Environment
- Results
- Project Structure
- Installation
- Usage
- Live Visualisations
- Hyperparameter Grid Search
- Hyperparameter Reference
- Connections to Prior Work
- Discussion Highlights
- Future Directions
- Full Dissertation
- Citation
- License
Brea et al. (2014) demonstrated that fast forgetting of punishment can be advantageous in a simple go–no-go task. This work tests that principle in a more complex, noisy, and dynamic environment. We present a model-free reinforcement learning method that integrates Q-learning with asymmetric decay of value estimates, inspired by biological evidence of active forgetting and differential retention of appetitive and aversive memories. The agent uses two independent value functions for positive and negative reinforcement. After each step, all entries are decayed by two forgetting rates to forego past experiences without requiring explicit context detection. We evaluate the algorithm in a custom 2D grid-world environment with four rewarding and four punishing states whose locations are resampled every 20 steps. Gaussian noise is added to the cells, and the agent selects actions via a softmax policy over the net value Q_net = Q+ − Q−. A grid search over forgetting rates reveals that specific asymmetric settings, notably faster forgetting of punishments, yield higher long-term reward accumulation. Results show that simple, biologically inspired forgetting mechanisms can be implemented efficiently in model-free RL to improve performance in noisy, changing environments.
Standard Q-learning assumes persistent memory: once a value is learned, it remains unchanged until it is explicitly overwritten by a new experience. This is fine for static environments, but most real-world problems are non-stationary — rewards move, dangers shift, and the optimal policy today may be disastrously wrong tomorrow.
Neuroscience tells us something striking: biological memory is actively and asymmetrically regulated. In Drosophila melanogaster (fruit flies), researchers found that:
- Aversive (punishment) memories decay faster than appetitive (reward) memories under normal conditions (Shuai et al., 2010).
- The protein Rac1 actively drives forgetting — blocking Rac1 makes flies remember punishment longer, while activating it causes them to forget faster (Brea et al., 2014).
- This is not a design flaw. It is computationally optimal: in a changing world, quickly forgetting a punishment allows the agent to re-explore that region once conditions change, potentially discovering new rewards.
Richards & Frankland (2017) formalised this insight: the purpose of memory is not perfect storage — it is optimising future decisions. Memory transience enhances generalisation and prevents overfitting to outdated experiences.
This project asks: can we transplant this biological principle into a reinforcement learning agent, and does it actually help?
The answer is yes.
-
Novel Algorithm: Dual-Q learning with independent, asymmetric per-step forgetting rates for reward (
φ+) and punishment (φ-) channels — designed, implemented, and validated from scratch. -
Biologically Grounded Design: Every algorithmic choice is motivated by neuroscience. The decay mechanism mirrors Rac1-mediated forgetting (Brea et al., 2014). The dual-channel architecture mirrors the appetitive/aversive separation in Drosophila mushroom body circuits (Bennett et al., 2021). The persistence-transience balance mirrors the framework of Richards & Frankland (2017).
-
Custom Gymnasium Environment: An 8×8 stochastic grid world with Gaussian noise, dynamic reward relocation every 20 steps, and separate reward/punishment maps — built entirely from scratch on the Gymnasium API.
-
Exhaustive Empirical Validation: 800+ hyperparameter configurations × 50 independent Monte Carlo runs × 2000 steps each. Results averaged to confidence-worthy baselines.
-
Parallel Grid Search: Fully parallelised hyperparameter sweep using Python's
ProcessPoolExecutor— all CPU cores utilised. -
Rich Visualisation Dashboard: Real-time live plots of Q+, Q−, Q_net, policy arrows, cumulative reward, and reward rate — all updating as the agent learns.
-
Clear, Reproducible Result: Asymmetric forgetting (
φ+ > φ-orφ- < φ+depending on learning rate regime) consistently achieves higher cumulative reward than no-decay or symmetric-decay baselines, across all tested learning rates.
Standard Q-learning remembers everything forever. In a changing world, old information is worse than no information — it's actively misleading.
This agent maintains two separate value tables:
| Table | Tracks | Forgetting Rate |
|---|---|---|
Q+ |
Expected future rewards | φ+ — moderate-to-high (forget reasonably fast) |
Q- |
Expected future punishments | φ- — lower than φ+ (forget more slowly) |
After every single step — regardless of which state was visited — both tables decay globally:
Q+ ← (1 − φ+) · Q+ # all reward memory fades slightly
Q- ← (1 − φ-) · Q- # all punishment memory fades slightly, but slower
The agent then selects actions based on the net value Q_net = Q+ − Q−, using a Boltzmann (softmax) policy.
Why does asymmetry help?
- High
φ+: After rewards relocate, the agent's confidence in old reward locations fades quickly → it explores again → finds the new reward → higher return. - Low
φ-: The agent retains danger information longer → avoids repeatedly blundering into punishments → fewer costly mistakes.
This is exactly what fruit flies do. And it works in RL too.
The environment is modelled as a Markov Decision Process (MDP) defined by the tuple (S, A, P, r+, r−, γ):
S— finite set of states (|S| = 64for the 8×8 grid)A— finite set of actions (|A| = 4: Up, Down, Left, Right)P(s'|s, a)— transition probability (deterministic movement with boundary constraints)r+(s, a) ≥ 0— positive reinforcement component (reward)r−(s, a) ≥ 0— negative reinforcement component (punishment)γ ∈ [0,1]— discount factor (γ = 0.95)
The agent's objective is to maximise the expected discounted cumulative return:
E_π [ Σ_{t=0}^{∞} γ^t · (r+_t − r-_t) ]
Two independent Q-tables are maintained:
Q+ ∈ ℝ^{|S|×|A|} (reward predictions)
Q- ∈ ℝ^{|S|×|A|} (punishment predictions)
Both initialised to zero. They are updated independently via separate TD streams, and they decay independently via separate forgetting rates.
Before action selection and before any TD update, both tables are multiplied by their retention factors:
Q+ ← D+ · Q+ where D+ = 1 − φ+
Q- ← D- · Q- where D- = 1 − φ-
This operation is applied globally — to every state-action pair, not just the one visited. This is the key difference from standard Q-learning. Every step, the entire Q-table "forgets" a small fraction of all accumulated knowledge. The decay is multiplicative (exponential decay over time), meaning the influence of an experience at time t decays as (D+)^{t_current - t}.
The agent computes the net action-value for the current state s:
Q_net(s, a) = Q+(s, a) − Q-(s, a) for all a ∈ A(s)
Valid actions are those that don't move the agent off the grid. Actions are selected via the Boltzmann (softmax) distribution:
exp( (Q_net(s,a) − max_b Q_net(s,b)) / τ )
π(a|s) = ─────────────────────────────────────────────
Σ_b exp( (Q_net(s,b) − max_c Q_net(s,c)) / τ )
The subtraction of the maximum value prevents numerical overflow without changing the probability distribution. Temperature τ = 0.1 produces a strongly exploitative policy while still allowing occasional exploration of suboptimal actions.
At high τ → uniform random policy (pure exploration). At low τ → greedy policy (pure exploitation). At τ = 0.1 → high-exploitation with controlled stochastic exploration.
The selected action a is executed. The environment returns:
s'— next stater+— positive reinforcement signal (from reward map + Gaussian noise)r-— negative reinforcement signal (from punishment map + Gaussian noise)
Two separate temporal-difference errors are computed:
δ+ = r+ + γ · max_{a'} Q+(s', a') − Q+(s, a)
δ- = r- + γ · max_{a'} Q-(s', a') − Q-(s, a)
Q-values are then updated by moving a fraction α toward the target:
Q+(s, a) ← Q+(s, a) + α · δ+
Q-(s, a) ← Q-(s, a) + α · δ-
Combining decay and TD update, the full update per step is:
Q+_t(s,a) ← Q+_t(s,a) + α·δ+_t − φ+ · (Q+_t(s,a) + α·δ+_t)
Q-_t(s,a) ← Q-_t(s,a) + α·δ-_t − φ- · (Q-_t(s,a) + α·δ-_t)
Or equivalently, using the φ notation from the dissertation:
Q+_t ← Q+_t + α·δ+_t − φ+ · (Q+_t + α·δ+_t)
Q-_t ← Q-_t + α·δ-_t − φ- · (Q-_t + α·δ-_t)
Every u = 20 steps, the environment resamples reward and punishment locations:
if t % u == 0:
env.update_reinforcement() # randomly relocate all 4 rewards and 4 punishmentsThis creates the non-stationarity that forces the agent to adapt — and makes forgetting genuinely useful.
─────────────────────────────────────────────────────────────────
ASYMMETRIC DUAL-Q LEARNING WITH GLOBAL MEMORY DECAY
─────────────────────────────────────────────────────────────────
Hyperparameters: α, φ+, φ-, γ=0.95, τ=0.1, u=20, T=2000
Initialise:
Q+(s,a) ← 0 for all s ∈ S, a ∈ A
Q-(s,a) ← 0 for all s ∈ S, a ∈ A
s ← s_0 = (0,0)
for t = 1, 2, ..., T:
// Step 1: Global memory decay (entire tables)
Q+ ← (1 − φ+) · Q+
Q- ← (1 − φ-) · Q-
// Step 2: Non-stationarity injection
if t mod u == 0:
env.resample_reward_and_punishment_locations()
env.apply_gaussian_noise_to_all_cells()
// Step 3: Action selection (Boltzmann over valid actions)
Q_net(s, ·) ← Q+(s, ·) − Q-(s, ·)
a ← sample from softmax(Q_net(s, ·), τ) [valid actions only]
// Step 4: Execute action, observe reinforcement signals
s', r+, r- ← env.step(a)
// Step 5: Compute TD errors (independent channels)
δ+ ← r+ + γ · max_{a'} Q+(s', a') − Q+(s, a)
δ- ← r- + γ · max_{a'} Q-(s', a') − Q-(s, a)
// Step 6: Update Q-values (independent channels)
Q+(s, a) ← Q+(s, a) + α · δ+
Q-(s, a) ← Q-(s, a) + α · δ-
s ← s'
─────────────────────────────────────────────────────────────────
A key design choice is that decay is applied globally at every step, not just to visited states. This means:
- Even states the agent hasn't visited recently lose their Q-values gradually.
- No explicit "context detection" is needed — the agent doesn't need to know when rewards relocated.
- The forgetting naturally scales with time-since-visit: frequently visited states are constantly refreshed by new TD updates, while rarely visited states decay toward zero.
This is analogous to the biological finding that memory decay is a continuous active process, not a triggered event.
A custom 8×8 stochastic grid world built from scratch on the Gymnasium API:
0 1 2 3 4 5 6 7
┌───┬───┬───┬───┬───┬───┬───┬───┐
│ A │ │ R │ │ │ P │ │ │ 0
├───┼───┼───┼───┼───┼───┼───┼───┤
│ │ │ │ │ R │ │ │ │ 1
├───┼───┼───┼───┼───┼───┼───┼───┤
│ P │ │ │ │ │ │ R │ │ 2
├───┼───┼───┼───┼───┼───┼───┼───┤
│ │ │ │ │ │ │ │ P │ 3
├───┼───┼───┼───┼───┼───┼───┼───┤
│ │ │ P │ │ │ │ │ │ 4
├───┼───┼───┼───┼───┼───┼───┼───┤
│ │ R │ │ │ │ │ │ │ 5
├───┼───┼───┼───┼───┼───┼───┼───┤
│ │ │ │ │ P │ │ │ │ 6
├───┼───┼───┼───┼───┼───┼───┼───┤
│ │ │ │ R │ │ │ │ │ 7
└───┴───┴───┴───┴───┴───┴───┴───┘
A = Agent (starts at (0,0), blue in renderer)
R = Reward cell (+1, green) — 4 cells total, non-overlapping with P
P = Punishment cell (-1, red) — 4 cells total, non-overlapping with R
· = Neutral cell (grey, noise-only)
↻ Every 20 steps: ALL reward and punishment locations are re-randomised
| Property | Value |
|---|---|
| Grid dimensions | 8 × 8 |
| State space ` | S |
| Action space ` | A |
| Boundary behaviour | Wall bounce (invalid actions ignored, agent stays put) |
| Starting position | (0, 0) (top-left corner) |
| Episode length | T = 2000 steps |
| Termination | None (no early termination) |
- Reward cells (4): Value =
+1 + N(0, 0.2)per step (floored at 0) - Punishment cells (4): Value =
+1 + N(0, 0.2)per step (floored at 0), subtracted from net reward - Neutral cells (60): Value drawn from
|N(0, 0.2)|— either small positive reward or small negative (punishment), never both simultaneously - Rewards and punishments are persistent (collecting them does not remove them)
- Reward and punishment locations are guaranteed non-overlapping (sampled without replacement)
At every step, a noise matrix N ∈ ℝ^{8×8} is drawn from N(0, 0.2) and applied:
for each cell (r, c):
n = N(0, 0.2)
if cell is reward: rewards[r,c] = max(0, 1 + n)
elif cell is punishment: punishments[r,c] = max(0, 1 + n)
elif n >= 0: rewards[r,c] = n # neutral positive
else: punishments[r,c] = |n| # neutral negativeThis ensures even neutral cells occasionally produce small signals, forcing the agent to continuously update and making the environment genuinely stochastic.
The environment includes a Pygame renderer that visualises:
- Green cells (reward intensity encoded in brightness)
- Red cells (punishment intensity encoded in brightness)
- Blue cell (agent position)
- Cumulative reward counter in real-time
Setup: α ∈ {0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0}, φ+, φ- ∈ [10⁻⁵, 10⁰] (log-spaced), 50 runs per configuration, 2000 steps per run.
Summary results across baselines:
| Strategy | φ+ |
φ- |
Avg. Cumulative Reward |
|---|---|---|---|
| No decay | 0 | 0 | ~50 |
| Symmetric decay | φ+ = φ- ≈ 0.16 |
same | ~100–120 |
| Asymmetric (optimal) | ≈ 0.18 | ≈ 0.09 | ~155–165 |
Contour maps of average cumulative reward over (log₁₀ φ−, log₁₀ φ+) for each of the 8 tested learning rates (α = 0.3 to 1.0). Warmer colours = higher reward. Each panel is averaged over N=50 runs × T=2000 steps. The high-reward band consistently sits where φ+ ≥ φ− (rewards forgotten faster than punishments), shifting upward as α increases.
Key findings from Experiment 1:
- At all tested learning rates, the optimal configurations always had
φ+ ≥ φ-— rewards are forgotten faster than punishments. - At low learning rates (
α = 0.3–0.4), performance was relatively insensitive toφ-whenφ- < 10⁻², but highly sensitive toφ+. - At high learning rates (
α ≥ 0.8), a clear plateau of high reward emerged aroundlog₁₀(φ+) ∈ [−1.05, −0.5]andlog₁₀(φ-) ∈ [−1.32, −0.5]. - Performance collapsed near
φ+ → 1andφ- → 1— too much forgetting erases useful knowledge before it can be acted upon. - Performance also degraded at
φ+ ≈ 0,φ- ≈ 0(no decay baseline) — overfitting to outdated reward locations.
Setup: α ∈ {0.9, 1.0}, φ+, φ- ∈ [10⁻², 10⁰] (20 log-spaced values each), same averaging.
Zoomed-in contour maps at α=0.9 (left) and α=1.0 (right), restricted to φ+, φ− ∈ [10⁻², 1]. The diagonal ridge of high reward (yellow) is clearly visible where φ− < φ+ — confirming that slower punishment forgetting paired with moderate reward forgetting is the robust optimum at high learning rates.
| α | Optimal φ+ |
Optimal φ- |
Avg. Cumulative Reward |
|---|---|---|---|
| 0.9 | 0.183 | 0.089 | 164.8 |
| 1.0 | 0.234 | 0.144 | 163.2 |
Both optima satisfy φ+ > φ-. The high-reward band in the contour maps forms a diagonal ridge where φ- < φ+, confirming the asymmetry is a robust requirement rather than a lucky point find.
The grid search generates contour heatmaps over (log₁₀ φ-, log₁₀ φ+) space:
- Yellow/warm regions (high reward): moderate
φ+and smallerφ- - Purple/cold regions (low reward): either extreme forgetting or no forgetting
- Top-right corner (
φ+, φ- → 1): total collapse — agent forgets everything too fast - Bottom-left corner (
φ+, φ- → 0): persistent memory — agent exploits outdated locations
The optimal ridge is stable across α ∈ {0.9, 1.0}, validating the robustness of the asymmetric forgetting principle.
asymmetric-forgetting-rl/
│
├── environment.py # GridWorldEnv — custom Gymnasium env (8×8 grid, Pygame renderer, noise)
│ # QLearningAgent — Dual-Q agent with independent asymmetric decay
│
├── main.py # Entry point: run a single simulation with live dashboard
│
├── grid_search.py # Parallel grid search over (α, φ+, φ-) using ProcessPoolExecutor
│
├── utils.py # Visualisation utilities:
│ # draw_q_values_v3() Q+ or Q- heatmap with direction arrows
│ # draw_q_net_difference() Q_net = Q+ − Q− diverging map
│ # init_all_live_plots_v1() initialise live dashboard
│ # update_all_live_plots_v3() update dashboard each step
│ # plot_cumulative_and_reward_rate_with_slider()
│ # plot_grid_search_results() contour heatmaps over φ space
│ # save_best_params_to_file() persist best config to txt
│ # save_numbers_to_csv() persist grid search results
│
├── config.py # All fixed hyperparameters (GRID_SIZE, DISCOUNT_FACTOR, TEMPERATURE, etc.)
│
├── requirements.txt # Python dependencies
│
├── results/ # Experiment outputs
│ ├── grid_search_results_u20t01_0804_2003.csv Experiment 1 full results (800 configs × 50 runs)
│ ├── grid_search_results_u20t01_0804_2052.csv Experiment 2 refined results
│ ├── best_params_0804_2003.txt Best hyperparameter config from Experiment 1
│ ├── best_params_0804_2052.txt Best hyperparameter config from Experiment 2
│ └── run1-8/ Individual run visualisations
│
├── assets/ # Images used in this README
│ ├── live_training_dashboard.png Screenshot of the live training dashboard
│ ├── experiment1_grid_search_contours.png Contour maps from Experiment 1 (all α)
│ └── experiment2_refined_search_contours.png Contour maps from Experiment 2 (high α)
│
├── README.md # This file
├── LICENSE # MIT License
└── .gitignore
Requirements: Python 3.11+
# Clone the repository
git clone https://github.com/alinjfz/asymmetric-forgetting-rl.git
cd asymmetric-forgetting-rl
# (Optional but recommended) Create a virtual environment
python -m venv venv
source venv/bin/activate # macOS/Linux
# venv\Scripts\activate # Windows
# Install dependencies
pip install -r requirements.txtDependencies:
| Package | Purpose |
|---|---|
gymnasium==1.1.1 |
RL environment API |
numpy |
Array operations, Q-tables |
matplotlib |
Visualisation, heatmaps |
pygame |
Real-time grid world renderer |
pandas |
CSV I/O for grid search results |
seaborn |
Statistical plots |
mpl-tools |
Interactive matplotlib widgets |
python main.pyThe live dashboard opens automatically during training:
Top row (left to right): cumulative reward curve, Pygame grid world renderer (green = rewards, red = punishments, blue = agent), greedy policy arrows derived from Q_net. Bottom row: Q+ value heatmap (blue intensity), Q− value heatmap (red intensity), Q_net = Q+ − Q− diverging map.
This runs the agent with the optimal hyperparameters discovered by the grid search:
LEARNING_RATE = 0.9
FORGET_REWARD = 0.183 # φ+ — reward forgetting rate
FORGET_PUNISH = 0.089 # φ- — punishment forgetting rateThe simulation runs for 2000 steps. A live visualisation dashboard opens automatically (see Live Visualisations below).
Edit main.py (or pass parameters programmatically):
from environment import GridWorldEnv, QLearningAgent
from main import run_simulation
rewards, agent, path, reward_map, punishment_map = run_simulation(
LEARNING_RATE=0.7,
FORGET_REWARD=0.05,
FORGET_PUNISH=0.01,
animate=False, # track agent path
render=True, # show pygame window
live_plot=True # show matplotlib dashboard
)rewards, agent, path, reward_map, punishment_map = run_simulation(
LEARNING_RATE=0.9,
FORGET_REWARD=0.183,
FORGET_PUNISH=0.089,
animate=False,
render=False,
live_plot=False
)
print(f"Total reward: {sum(rewards):.2f}")python grid_search.pyThis runs a parallel sweep (uses all available CPU cores). Progress is printed in real time:
Starting parallel grid search on 800 configurations...
[1/800] LR=0.900, FR=1.000e-02, FP=1.000e-02 → Avg Reward: 47.23
[2/800] LR=0.900, FR=1.389e-02, FP=1.000e-02 → Avg Reward: 89.11
...
[800/800] LR=1.000, FR=2.154e-01, FP=1.000e-02 → Avg Reward: 163.21
Grid Search completed in 847.32 seconds
Best Hyperparameter Configuration:
LEARNING_RATE: 9.000e-01
FORGET_REWARD: 1.833e-01
FORGET_PUNISH: 8.859e-02
Average Reward: 164.80
Results are saved to results/grid_search_results_<timestamp>.csv. Interactive contour heatmaps are displayed automatically.
When live_plot=True, a multi-panel Matplotlib dashboard updates in real time:
┌────────────────┬────────────────┐
│ Q+ Heatmap │ Q- Heatmap │ Blue arrows = high reward value
│ (blue tones) │ (red tones) │ Red arrows = high punishment value
│ per-state │ per-state │
├────────────────┼────────────────┤
│ Q_net = Q+−Q- │ Policy Arrows │ Q_net > 0 → agent prefers this state
│ (diverging │ (greedy │ Q_net < 0 → agent avoids this state
│ colormap) │ policy) │
├────────────────┴────────────────┤
│ Cumulative Reward Over Time │ Trend shows learning progress
├─────────────────────────────────┤
│ Reward Rate (sliding window)│ Detects adaptation after resets
└─────────────────────────────────┘
After the simulation completes, an interactive slider plot allows you to inspect the cumulative reward and reward rate at any window size.
The Pygame renderer (when render=True) shows the grid world in real time with:
- Colour-coded cells (green = reward, red = punishment, intensity = magnitude)
- Agent position (blue square)
- Live cumulative reward counter
The grid search (grid_search.py) explores the joint effect of three hyperparameters:
| Parameter | Symbol | Search Range | Spacing |
|---|---|---|---|
| Learning rate | α | {0.9, 1.0} |
Discrete |
| Reward forgetting rate | φ+ | [10⁻², 10⁰] |
20 log-spaced values |
| Punishment forgetting rate | φ- | [10⁻², 10⁰] |
20 log-spaced values |
Total configurations: 2 × 20 × 20 = 800. Each averaged over 50 runs = 40,000 individual simulations.
Results are saved as CSV with columns [learning_rate, forget_reward, forget_punish, avg_reward] and can be reloaded for further analysis.
The plot_grid_search_results() utility generates contour heatmaps for each α value, plotting average reward over (log₁₀ φ-, log₁₀ φ+) space.
| Parameter | Symbol | Default | Config Key | Description |
|---|---|---|---|---|
| Grid size | — | 8 | GRID_SIZE |
Grid dimensions (8×8 = 64 states) |
| Episode length | T | 2000 | MAX_STEPS |
Steps per simulation run |
| Discount factor | γ | 0.95 | DISCOUNT_FACTOR |
Future reward weighting |
| Softmax temperature | τ | 0.1 | TEMPERATURE |
Action selection entropy |
| Reward update frequency | u | 20 | UPDATE_SCHEDULE |
Steps between reward relocations |
| MC averaging runs | N | 50 | NUM_RUNS |
Runs averaged per config |
| Learning rate | α | 0.9 | LEARNING_RATE (main.py) |
TD update step size |
| Reward forgetting rate | φ+ | 0.183 | FORGET_REWARD (main.py) |
Per-step Q+ decay |
| Punishment forgetting rate | φ- | 0.089 | FORGET_PUNISH (main.py) |
Per-step Q- decay |
All fixed hyperparameters live in config.py. Tunable parameters (LEARNING_RATE, FORGET_REWARD, FORGET_PUNISH) are set in main.py or passed to run_simulation().
This work sits at the intersection of three research traditions:
| Work | Relevance |
|---|---|
| Brea et al. (2014) | Normative theory of forgetting from Drosophila; active, asymmetric forgetting is optimal in changing environments |
| Shuai et al. (2010) | Rac1 protein actively drives forgetting; aversive memories decay faster than appetitive ones |
| Richards & Frankland (2017) | Persistence-transience framework; forgetting enhances generalisation |
| Bennett, Philippides & Nowotny (2021) | Drosophila mushroom body circuits implement separate appetitive/aversive dopamine pathways |
| Kato & Morita (2016) | Dopamine signals link value decay to sustained motivation; slow and fast timescales |
| Work | Relationship |
|---|---|
| Watkins & Dayan (1992) | Q-learning baseline; this work extends it with dual channels and decay |
| Elfwing & Seymour (2017) — MaxPain | Also uses Q+ and Q-; but no forgetting — values persist indefinitely. This work adds decay to address overfitting |
| Lin, Bouneffouf & Cecchi (2019) — Split Q-learning | Dual-stream Q-learning with asymmetric learning rates; this work uses asymmetric decay rates instead, which affects all states globally |
| Kandroodi et al. (2021) | Asymmetric learning rates (α+ ≠ α-); affects only visited states on update. This work's decay affects all states continuously |
| Kantasewi et al. (2019) — MQQL | Multi Q-table with sub-reward forgetting; periodic resets of sub-tables. This work uses continuous smooth decay |
| Fuchida et al. (2010) | Modified TD target with absolute-value operator to better handle negative Q-values |
| Work | Connection |
|---|---|
| Tsuruhara & Ito (2024) | FRIT-based adaptive control with forgetting factors; forgetting as leaky integration |
| Soleimani et al. (2025) | RL-PID with forgetting-factor iterative learning control |
Key distinction from asymmetric learning rates (Kandroodi et al., 2021):
Asymmetric learning rates (
α+ ≠ α-) only change how strongly new experiences update the Q-values of visited state-action pairs. Asymmetric decay rates (φ+ ≠ φ-) change how quickly all accumulated knowledge fades — across the entire state-action space, including states not recently visited. This is a global, continuous process, analogous to the biological finding that forgetting is an ongoing active mechanism, not a triggered event.
When reward locations reset every 20 steps, an agent with high φ+ quickly loses confidence in the old reward locations. This increases the TD error for those states when the agent next visits them, which drives faster learning of the new locations. The net effect is faster adaptation and higher cumulative return.
Punishment cells are also relocated, but the cost of stumbling into one repeatedly is high. A small φ- means the agent retains danger information longer — even after resampling, the memory of "this area was dangerous" persists, biasing the agent away from risky regions until it has evidence they are now safe. This avoidance memory prevents costly repeated punishments.
Fruit fly experiments show aversive memories decay faster than appetitive ones — the opposite of what this work finds optimal (slow punishment forgetting). This divergence is explained by the noise structure of the environment: neutral cells randomly produce punishment-like signals, making it noisy to distinguish real dangers from false alarms. Retaining punishment memory counteracts this noise. When noise is reduced in the simulation, optimal parameters shift back toward the biological prediction, confirming the divergence is task-specific, not a failure of the biological analogy.
The global multiplicative decay Q ← (1-φ) · Q is mathematically equivalent to a leaky integrator — a well-known component in PID and feedback control systems. The balance between φ+ and φ- tunes the plasticity-stability trade-off: high φ+ makes reward predictions plastic (adapt fast), low φ- makes punishment predictions stable (don't overwrite hard-won safety knowledge).
- Deep RL extension — Replace tabular Q-tables with neural networks (DQN + asymmetric replay buffer decay)
- Adaptive forgetting — Meta-learn
φ+andφ-as a function of detected environmental volatility (change-point detection) - RL-PID hybrid — Embed forgetting as a derivative term in a PID-style controller on the TD error signal
- QF-Tuner integration — Replace manual grid search with the FOX optimisation framework (Jumaah et al., 2025) for faster and more principled hyperparameter optimisation
- Richer environments — Validate on MiniGrid, Atari non-stationary wrappers, or continuous-state environments with function approximation
- State-dependent forgetting — Allow
φto vary per state based on visit frequency or recency, rather than global uniform decay
The complete academic dissertation behind this repository — including the full literature review, formal proofs, extended results analysis, and all figures — is available on request.
I have chosen not to publish the PDF publicly at this stage, as the work has not yet undergone formal peer review and I want to ensure proper attribution before wider distribution.
If you are a researcher, recruiter, or fellow student and would like access to the full paper, please reach out:
- GitHub: @alinjfz
- Open an issue on this repository with the subject "Dissertation Request"
Abstract of the paper is included in the Abstract section above.
If you build on this work, please cite it as:
@thesis{najafzadeh2025asymmetric,
title = {Asymmetric Forgetting in Dual-Q Reinforcement Learning:
From Biological Theories to Computational Implementation},
author = {Najafzadeh, Ali},
school = {University of Sussex, School of Engineering and Informatics},
year = {2025},
month = {August},
supervisor = {Dr. James Bennett},
type = {MSc Dissertation}
}This project is released under a Custom Source-Available License — see LICENSE for the full terms.
Summary:
| Allowed | |
|---|---|
| Run the code as-is (any purpose, including commercial) | ✅ Yes — with attribution |
| Redistribute unmodified copies (courses, tutorials, papers) | ✅ Yes — with attribution |
| Research and academic use | ✅ Yes — with citation |
| Modify, adapt, or build derivatives | ❌ Not without written permission |
| Sublicense or sell under different terms | ❌ No |
Attribution is always required: credit Ali Najafzadeh, link to this repository, and cite the dissertation in academic work. To request permission for derivative use or collaboration, open an issue or contact via GitHub.
Built with curiosity, rigour, and a deep belief that the brain's "design flaws" are actually its greatest features.
Ali Najafzadeh · University of Sussex · School of Engineering and Informatics · 2025


