Asymmetric Forgetting in Dual-Q Reinforcement Learning

From Biological Theories to Computational Implementation

MSc Artificial Intelligence and Adaptive Systems · Dissertation School of Engineering and Informatics · University of Sussex · August 2025 Supervised by Dr. James Bennett, Informatics Department

"The goal of memory is not perfect retention, but optimising future decisions — which makes forgetting just as important as remembering."

— Richards & Frankland, Neuron (2017)

Abstract
Motivation — Why Forgetting?
Key Contributions
The Core Idea in 60 Seconds
Algorithm — Full Technical Details
Environment
Results
Project Structure
Installation
Usage
Live Visualisations
Hyperparameter Grid Search
Hyperparameter Reference
Connections to Prior Work
Discussion Highlights
Future Directions
Full Dissertation
Citation
License

Abstract

Brea et al. (2014) demonstrated that fast forgetting of punishment can be advantageous in a simple go–no-go task. This work tests that principle in a more complex, noisy, and dynamic environment. We present a model-free reinforcement learning method that integrates Q-learning with asymmetric decay of value estimates, inspired by biological evidence of active forgetting and differential retention of appetitive and aversive memories. The agent uses two independent value functions for positive and negative reinforcement. After each step, all entries are decayed by two forgetting rates to forego past experiences without requiring explicit context detection. We evaluate the algorithm in a custom 2D grid-world environment with four rewarding and four punishing states whose locations are resampled every 20 steps. Gaussian noise is added to the cells, and the agent selects actions via a softmax policy over the net value Q_net = Q+ − Q−. A grid search over forgetting rates reveals that specific asymmetric settings, notably faster forgetting of punishments, yield higher long-term reward accumulation. Results show that simple, biologically inspired forgetting mechanisms can be implemented efficiently in model-free RL to improve performance in noisy, changing environments.

Motivation — Why Forgetting?

Standard Q-learning assumes persistent memory: once a value is learned, it remains unchanged until it is explicitly overwritten by a new experience. This is fine for static environments, but most real-world problems are non-stationary — rewards move, dangers shift, and the optimal policy today may be disastrously wrong tomorrow.

Neuroscience tells us something striking: biological memory is actively and asymmetrically regulated. In Drosophila melanogaster (fruit flies), researchers found that:

Aversive (punishment) memories decay faster than appetitive (reward) memories under normal conditions (Shuai et al., 2010).
The protein Rac1 actively drives forgetting — blocking Rac1 makes flies remember punishment longer, while activating it causes them to forget faster (Brea et al., 2014).
This is not a design flaw. It is computationally optimal: in a changing world, quickly forgetting a punishment allows the agent to re-explore that region once conditions change, potentially discovering new rewards.

Richards & Frankland (2017) formalised this insight: the purpose of memory is not perfect storage — it is optimising future decisions. Memory transience enhances generalisation and prevents overfitting to outdated experiences.

This project asks: can we transplant this biological principle into a reinforcement learning agent, and does it actually help?

The answer is yes.

Key Contributions

Novel Algorithm: Dual-Q learning with independent, asymmetric per-step forgetting rates for reward (φ+) and punishment (φ-) channels — designed, implemented, and validated from scratch.
Biologically Grounded Design: Every algorithmic choice is motivated by neuroscience. The decay mechanism mirrors Rac1-mediated forgetting (Brea et al., 2014). The dual-channel architecture mirrors the appetitive/aversive separation in Drosophila mushroom body circuits (Bennett et al., 2021). The persistence-transience balance mirrors the framework of Richards & Frankland (2017).
Custom Gymnasium Environment: An 8×8 stochastic grid world with Gaussian noise, dynamic reward relocation every 20 steps, and separate reward/punishment maps — built entirely from scratch on the Gymnasium API.
Exhaustive Empirical Validation: 800+ hyperparameter configurations × 50 independent Monte Carlo runs × 2000 steps each. Results averaged to confidence-worthy baselines.
Parallel Grid Search: Fully parallelised hyperparameter sweep using Python's ProcessPoolExecutor — all CPU cores utilised.
Rich Visualisation Dashboard: Real-time live plots of Q+, Q−, Q_net, policy arrows, cumulative reward, and reward rate — all updating as the agent learns.
Clear, Reproducible Result: Asymmetric forgetting (φ+ > φ- or φ- < φ+ depending on learning rate regime) consistently achieves higher cumulative reward than no-decay or symmetric-decay baselines, across all tested learning rates.

The Core Idea in 60 Seconds

Standard Q-learning remembers everything forever. In a changing world, old information is worse than no information — it's actively misleading.

This agent maintains two separate value tables:

Table	Tracks	Forgetting Rate
`Q+`	Expected future rewards	`φ+` — moderate-to-high (forget reasonably fast)
`Q-`	Expected future punishments	`φ-` — lower than `φ+` (forget more slowly)

After every single step — regardless of which state was visited — both tables decay globally:

Q+ ← (1 − φ+) · Q+        # all reward memory fades slightly
Q- ← (1 − φ-) · Q-        # all punishment memory fades slightly, but slower

The agent then selects actions based on the net value Q_net = Q+ − Q−, using a Boltzmann (softmax) policy.

Why does asymmetry help?

High φ+: After rewards relocate, the agent's confidence in old reward locations fades quickly → it explores again → finds the new reward → higher return.
Low φ-: The agent retains danger information longer → avoids repeatedly blundering into punishments → fewer costly mistakes.

This is exactly what fruit flies do. And it works in RL too.

Algorithm — Full Technical Details

Formal Problem Definition

The environment is modelled as a Markov Decision Process (MDP) defined by the tuple (S, A, P, r+, r−, γ):

S — finite set of states (|S| = 64 for the 8×8 grid)
A — finite set of actions (|A| = 4: Up, Down, Left, Right)
P(s'|s, a) — transition probability (deterministic movement with boundary constraints)
r+(s, a) ≥ 0 — positive reinforcement component (reward)
r−(s, a) ≥ 0 — negative reinforcement component (punishment)
γ ∈ [0,1] — discount factor (γ = 0.95)

The agent's objective is to maximise the expected discounted cumulative return:

E_π [ Σ_{t=0}^{∞} γ^t · (r+_t − r-_t) ]

Dual-Q Architecture

Two independent Q-tables are maintained:

Q+ ∈ ℝ^{|S|×|A|}    (reward predictions)
Q- ∈ ℝ^{|S|×|A|}    (punishment predictions)

Both initialised to zero. They are updated independently via separate TD streams, and they decay independently via separate forgetting rates.

Step 1 — Global Memory Decay

Before action selection and before any TD update, both tables are multiplied by their retention factors:

Q+ ← D+ · Q+     where D+ = 1 − φ+
Q- ← D- · Q-     where D- = 1 − φ-

This operation is applied globally — to every state-action pair, not just the one visited. This is the key difference from standard Q-learning. Every step, the entire Q-table "forgets" a small fraction of all accumulated knowledge. The decay is multiplicative (exponential decay over time), meaning the influence of an experience at time t decays as (D+)^{t_current - t}.

Step 2 — Action Selection via Boltzmann Policy

The agent computes the net action-value for the current state s:

Q_net(s, a) = Q+(s, a) − Q-(s, a)    for all a ∈ A(s)

Valid actions are those that don't move the agent off the grid. Actions are selected via the Boltzmann (softmax) distribution:

         exp( (Q_net(s,a) − max_b Q_net(s,b)) / τ )
π(a|s) = ─────────────────────────────────────────────
         Σ_b exp( (Q_net(s,b) − max_c Q_net(s,c)) / τ )

The subtraction of the maximum value prevents numerical overflow without changing the probability distribution. Temperature τ = 0.1 produces a strongly exploitative policy while still allowing occasional exploration of suboptimal actions.

At high τ → uniform random policy (pure exploration). At low τ → greedy policy (pure exploitation). At τ = 0.1 → high-exploitation with controlled stochastic exploration.

Step 3 — Environment Interaction

The selected action a is executed. The environment returns:

s' — next state
r+ — positive reinforcement signal (from reward map + Gaussian noise)
r- — negative reinforcement signal (from punishment map + Gaussian noise)

Step 4 — Independent TD Updates

Two separate temporal-difference errors are computed:

δ+ = r+ + γ · max_{a'} Q+(s', a') − Q+(s, a)
δ- = r- + γ · max_{a'} Q-(s', a') − Q-(s, a)

Q-values are then updated by moving a fraction α toward the target:

Q+(s, a) ← Q+(s, a) + α · δ+
Q-(s, a) ← Q-(s, a) + α · δ-

Combined Update (Compact Form)

Combining decay and TD update, the full update per step is:

Q+_t(s,a) ← Q+_t(s,a) + α·δ+_t − φ+ · (Q+_t(s,a) + α·δ+_t)
Q-_t(s,a) ← Q-_t(s,a) + α·δ-_t − φ- · (Q-_t(s,a) + α·δ-_t)

Or equivalently, using the φ notation from the dissertation:

Q+_t ← Q+_t + α·δ+_t − φ+ · (Q+_t + α·δ+_t)
Q-_t ← Q-_t + α·δ-_t − φ- · (Q-_t + α·δ-_t)

Step 5 — Non-Stationarity Injection

Every u = 20 steps, the environment resamples reward and punishment locations:

if t % u == 0:
    env.update_reinforcement()   # randomly relocate all 4 rewards and 4 punishments

This creates the non-stationarity that forces the agent to adapt — and makes forgetting genuinely useful.

Full Algorithm Pseudocode

─────────────────────────────────────────────────────────────────
  ASYMMETRIC DUAL-Q LEARNING WITH GLOBAL MEMORY DECAY
─────────────────────────────────────────────────────────────────
  Hyperparameters: α, φ+, φ-, γ=0.95, τ=0.1, u=20, T=2000

  Initialise:
    Q+(s,a) ← 0   for all s ∈ S, a ∈ A
    Q-(s,a) ← 0   for all s ∈ S, a ∈ A
    s ← s_0 = (0,0)

  for t = 1, 2, ..., T:

    // Step 1: Global memory decay (entire tables)
    Q+ ← (1 − φ+) · Q+
    Q- ← (1 − φ-) · Q-

    // Step 2: Non-stationarity injection
    if t mod u == 0:
        env.resample_reward_and_punishment_locations()
        env.apply_gaussian_noise_to_all_cells()

    // Step 3: Action selection (Boltzmann over valid actions)
    Q_net(s, ·) ← Q+(s, ·) − Q-(s, ·)
    a ← sample from softmax(Q_net(s, ·), τ)  [valid actions only]

    // Step 4: Execute action, observe reinforcement signals
    s', r+, r- ← env.step(a)

    // Step 5: Compute TD errors (independent channels)
    δ+ ← r+ + γ · max_{a'} Q+(s', a') − Q+(s, a)
    δ- ← r- + γ · max_{a'} Q-(s', a') − Q-(s, a)

    // Step 6: Update Q-values (independent channels)
    Q+(s, a) ← Q+(s, a) + α · δ+
    Q-(s, a) ← Q-(s, a) + α · δ-

    s ← s'
─────────────────────────────────────────────────────────────────

Why Global Decay (Not State-Specific)?

A key design choice is that decay is applied globally at every step, not just to visited states. This means:

Even states the agent hasn't visited recently lose their Q-values gradually.
No explicit "context detection" is needed — the agent doesn't need to know when rewards relocated.
The forgetting naturally scales with time-since-visit: frequently visited states are constantly refreshed by new TD updates, while rarely visited states decay toward zero.

This is analogous to the biological finding that memory decay is a continuous active process, not a triggered event.

Environment

Overview

A custom 8×8 stochastic grid world built from scratch on the Gymnasium API:

  0   1   2   3   4   5   6   7
┌───┬───┬───┬───┬───┬───┬───┬───┐
│ A │   │ R │   │   │ P │   │   │  0
├───┼───┼───┼───┼───┼───┼───┼───┤
│   │   │   │   │ R │   │   │   │  1
├───┼───┼───┼───┼───┼───┼───┼───┤
│ P │   │   │   │   │   │ R │   │  2
├───┼───┼───┼───┼───┼───┼───┼───┤
│   │   │   │   │   │   │   │ P │  3
├───┼───┼───┼───┼───┼───┼───┼───┤
│   │   │ P │   │   │   │   │   │  4
├───┼───┼───┼───┼───┼───┼───┼───┤
│   │ R │   │   │   │   │   │   │  5
├───┼───┼───┼───┼───┼───┼───┼───┤
│   │   │   │   │ P │   │   │   │  6
├───┼───┼───┼───┼───┼───┼───┼───┤
│   │   │   │ R │   │   │   │   │  7
└───┴───┴───┴───┴───┴───┴───┴───┘

  A = Agent (starts at (0,0), blue in renderer)
  R = Reward cell (+1, green)       — 4 cells total, non-overlapping with P
  P = Punishment cell (-1, red)     — 4 cells total, non-overlapping with R
  · = Neutral cell (grey, noise-only)

  ↻ Every 20 steps: ALL reward and punishment locations are re-randomised

State and Action Space

Property	Value
Grid dimensions	8 × 8
State space `	S
Action space `	A
Boundary behaviour	Wall bounce (invalid actions ignored, agent stays put)
Starting position	`(0, 0)` (top-left corner)
Episode length	T = 2000 steps
Termination	None (no early termination)

Reward Structure

Reward cells (4): Value = +1 + N(0, 0.2) per step (floored at 0)
Punishment cells (4): Value = +1 + N(0, 0.2) per step (floored at 0), subtracted from net reward
Neutral cells (60): Value drawn from |N(0, 0.2)| — either small positive reward or small negative (punishment), never both simultaneously
Rewards and punishments are persistent (collecting them does not remove them)
Reward and punishment locations are guaranteed non-overlapping (sampled without replacement)

Gaussian Noise

At every step, a noise matrix N ∈ ℝ^{8×8} is drawn from N(0, 0.2) and applied:

for each cell (r, c):
    n = N(0, 0.2)
    if cell is reward:      rewards[r,c] = max(0, 1 + n)
    elif cell is punishment: punishments[r,c] = max(0, 1 + n)
    elif n >= 0:             rewards[r,c] = n         # neutral positive
    else:                    punishments[r,c] = |n|    # neutral negative

This ensures even neutral cells occasionally produce small signals, forcing the agent to continuously update and making the environment genuinely stochastic.

Rendering

The environment includes a Pygame renderer that visualises:

Green cells (reward intensity encoded in brightness)
Red cells (punishment intensity encoded in brightness)
Blue cell (agent position)
Cumulative reward counter in real-time

Results

Experiment 1 — Broad Hyperparameter Search

Setup: α ∈ {0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0}, φ+, φ- ∈ [10⁻⁵, 10⁰] (log-spaced), 50 runs per configuration, 2000 steps per run.

Summary results across baselines:

Strategy	`φ+`	`φ-`	Avg. Cumulative Reward
No decay	0	0	~50
Symmetric decay	`φ+ = φ-` ≈ 0.16	same	~100–120
Asymmetric (optimal)	≈ 0.18	≈ 0.09	~155–165

Contour maps of average cumulative reward over (log₁₀ φ−, log₁₀ φ+) for each of the 8 tested learning rates (α = 0.3 to 1.0). Warmer colours = higher reward. Each panel is averaged over N=50 runs × T=2000 steps. The high-reward band consistently sits where φ+ ≥ φ− (rewards forgotten faster than punishments), shifting upward as α increases.

Key findings from Experiment 1:

At all tested learning rates, the optimal configurations always had φ+ ≥ φ- — rewards are forgotten faster than punishments.
At low learning rates (α = 0.3–0.4), performance was relatively insensitive to φ- when φ- < 10⁻², but highly sensitive to φ+.
At high learning rates (α ≥ 0.8), a clear plateau of high reward emerged around log₁₀(φ+) ∈ [−1.05, −0.5] and log₁₀(φ-) ∈ [−1.32, −0.5].
Performance collapsed near φ+ → 1 and φ- → 1 — too much forgetting erases useful knowledge before it can be acted upon.
Performance also degraded at φ+ ≈ 0, φ- ≈ 0 (no decay baseline) — overfitting to outdated reward locations.

Experiment 2 — Refined Search at High Learning Rates

Setup: α ∈ {0.9, 1.0}, φ+, φ- ∈ [10⁻², 10⁰] (20 log-spaced values each), same averaging.

Zoomed-in contour maps at α=0.9 (left) and α=1.0 (right), restricted to φ+, φ− ∈ [10⁻², 1]. The diagonal ridge of high reward (yellow) is clearly visible where φ− < φ+ — confirming that slower punishment forgetting paired with moderate reward forgetting is the robust optimum at high learning rates.

α	Optimal `φ+`	Optimal `φ-`	Avg. Cumulative Reward
0.9	0.183	0.089	164.8
1.0	0.234	0.144	163.2

Both optima satisfy φ+ > φ-. The high-reward band in the contour maps forms a diagonal ridge where φ- < φ+, confirming the asymmetry is a robust requirement rather than a lucky point find.

What the Contour Maps Show

The grid search generates contour heatmaps over (log₁₀ φ-, log₁₀ φ+) space:

Yellow/warm regions (high reward): moderate φ+ and smaller φ-
Purple/cold regions (low reward): either extreme forgetting or no forgetting
Top-right corner (φ+, φ- → 1): total collapse — agent forgets everything too fast
Bottom-left corner (φ+, φ- → 0): persistent memory — agent exploits outdated locations

The optimal ridge is stable across α ∈ {0.9, 1.0}, validating the robustness of the asymmetric forgetting principle.

Project Structure

asymmetric-forgetting-rl/
│
├── environment.py          # GridWorldEnv  — custom Gymnasium env (8×8 grid, Pygame renderer, noise)
│                           # QLearningAgent — Dual-Q agent with independent asymmetric decay
│
├── main.py                 # Entry point: run a single simulation with live dashboard
│
├── grid_search.py          # Parallel grid search over (α, φ+, φ-) using ProcessPoolExecutor
│
├── utils.py                # Visualisation utilities:
│                           #   draw_q_values_v3()                    Q+ or Q- heatmap with direction arrows
│                           #   draw_q_net_difference()               Q_net = Q+ − Q− diverging map
│                           #   init_all_live_plots_v1()              initialise live dashboard
│                           #   update_all_live_plots_v3()            update dashboard each step
│                           #   plot_cumulative_and_reward_rate_with_slider()
│                           #   plot_grid_search_results()            contour heatmaps over φ space
│                           #   save_best_params_to_file()            persist best config to txt
│                           #   save_numbers_to_csv()                 persist grid search results
│
├── config.py               # All fixed hyperparameters (GRID_SIZE, DISCOUNT_FACTOR, TEMPERATURE, etc.)
│
├── requirements.txt        # Python dependencies
│
├── results/                # Experiment outputs
│   ├── grid_search_results_u20t01_0804_2003.csv   Experiment 1 full results (800 configs × 50 runs)
│   ├── grid_search_results_u20t01_0804_2052.csv   Experiment 2 refined results
│   ├── best_params_0804_2003.txt                  Best hyperparameter config from Experiment 1
│   ├── best_params_0804_2052.txt                  Best hyperparameter config from Experiment 2
│   └── run1-8/                                    Individual run visualisations
│
├── assets/                 # Images used in this README
│   ├── live_training_dashboard.png          Screenshot of the live training dashboard
│   ├── experiment1_grid_search_contours.png Contour maps from Experiment 1 (all α)
│   └── experiment2_refined_search_contours.png Contour maps from Experiment 2 (high α)
│
├── README.md               # This file
├── LICENSE                 # MIT License
└── .gitignore

Installation

Requirements: Python 3.11+

# Clone the repository
git clone https://github.com/alinjfz/asymmetric-forgetting-rl.git
cd asymmetric-forgetting-rl

# (Optional but recommended) Create a virtual environment
python -m venv venv
source venv/bin/activate          # macOS/Linux
# venv\Scripts\activate           # Windows

# Install dependencies
pip install -r requirements.txt

Dependencies:

Package	Purpose
`gymnasium==1.1.1`	RL environment API
`numpy`	Array operations, Q-tables
`matplotlib`	Visualisation, heatmaps
`pygame`	Real-time grid world renderer
`pandas`	CSV I/O for grid search results
`seaborn`	Statistical plots
`mpl-tools`	Interactive matplotlib widgets

Usage

Run a Single Simulation (Recommended Starting Point)

python main.py

The live dashboard opens automatically during training:

Top row (left to right): cumulative reward curve, Pygame grid world renderer (green = rewards, red = punishments, blue = agent), greedy policy arrows derived from Q_net. Bottom row: Q+ value heatmap (blue intensity), Q− value heatmap (red intensity), Q_net = Q+ − Q− diverging map.

This runs the agent with the optimal hyperparameters discovered by the grid search:

LEARNING_RATE  = 0.9
FORGET_REWARD  = 0.183   # φ+ — reward forgetting rate
FORGET_PUNISH  = 0.089   # φ- — punishment forgetting rate

The simulation runs for 2000 steps. A live visualisation dashboard opens automatically (see Live Visualisations below).

Run with Custom Parameters

Edit main.py (or pass parameters programmatically):

from environment import GridWorldEnv, QLearningAgent
from main import run_simulation

rewards, agent, path, reward_map, punishment_map = run_simulation(
    LEARNING_RATE=0.7,
    FORGET_REWARD=0.05,
    FORGET_PUNISH=0.01,
    animate=False,   # track agent path
    render=True,     # show pygame window
    live_plot=True   # show matplotlib dashboard
)

Run Without Rendering (Faster, for Experiments)

rewards, agent, path, reward_map, punishment_map = run_simulation(
    LEARNING_RATE=0.9,
    FORGET_REWARD=0.183,
    FORGET_PUNISH=0.089,
    animate=False,
    render=False,
    live_plot=False
)
print(f"Total reward: {sum(rewards):.2f}")

Run the Hyperparameter Grid Search

python grid_search.py

This runs a parallel sweep (uses all available CPU cores). Progress is printed in real time:

Starting parallel grid search on 800 configurations...

[1/800]  LR=0.900, FR=1.000e-02, FP=1.000e-02  → Avg Reward: 47.23
[2/800]  LR=0.900, FR=1.389e-02, FP=1.000e-02  → Avg Reward: 89.11
...
[800/800] LR=1.000, FR=2.154e-01, FP=1.000e-02  → Avg Reward: 163.21

Grid Search completed in 847.32 seconds

Best Hyperparameter Configuration:
  LEARNING_RATE:   9.000e-01
  FORGET_REWARD:   1.833e-01
  FORGET_PUNISH:   8.859e-02
  Average Reward:  164.80

Results are saved to results/grid_search_results_<timestamp>.csv. Interactive contour heatmaps are displayed automatically.

Live Visualisations

When live_plot=True, a multi-panel Matplotlib dashboard updates in real time:

┌────────────────┬────────────────┐
│   Q+ Heatmap   │   Q- Heatmap   │   Blue arrows = high reward value
│  (blue tones)  │  (red tones)   │   Red arrows = high punishment value
│  per-state     │  per-state     │
├────────────────┼────────────────┤
│  Q_net = Q+−Q- │  Policy Arrows │   Q_net > 0 → agent prefers this state
│  (diverging    │  (greedy       │   Q_net < 0 → agent avoids this state
│  colormap)     │  policy)       │
├────────────────┴────────────────┤
│     Cumulative Reward Over Time │   Trend shows learning progress
├─────────────────────────────────┤
│     Reward Rate (sliding window)│   Detects adaptation after resets
└─────────────────────────────────┘

After the simulation completes, an interactive slider plot allows you to inspect the cumulative reward and reward rate at any window size.

The Pygame renderer (when render=True) shows the grid world in real time with:

Colour-coded cells (green = reward, red = punishment, intensity = magnitude)
Agent position (blue square)
Live cumulative reward counter

Hyperparameter Grid Search

The grid search (grid_search.py) explores the joint effect of three hyperparameters:

Parameter	Symbol	Search Range	Spacing
Learning rate	α	`{0.9, 1.0}`	Discrete
Reward forgetting rate	φ+	`[10⁻², 10⁰]`	20 log-spaced values
Punishment forgetting rate	φ-	`[10⁻², 10⁰]`	20 log-spaced values

Total configurations: 2 × 20 × 20 = 800. Each averaged over 50 runs = 40,000 individual simulations.

Results are saved as CSV with columns [learning_rate, forget_reward, forget_punish, avg_reward] and can be reloaded for further analysis.

The plot_grid_search_results() utility generates contour heatmaps for each α value, plotting average reward over (log₁₀ φ-, log₁₀ φ+) space.

Hyperparameter Reference

Parameter	Symbol	Default	Config Key	Description
Grid size	—	8	`GRID_SIZE`	Grid dimensions (8×8 = 64 states)
Episode length	T	2000	`MAX_STEPS`	Steps per simulation run
Discount factor	γ	0.95	`DISCOUNT_FACTOR`	Future reward weighting
Softmax temperature	τ	0.1	`TEMPERATURE`	Action selection entropy
Reward update frequency	u	20	`UPDATE_SCHEDULE`	Steps between reward relocations
MC averaging runs	N	50	`NUM_RUNS`	Runs averaged per config
Learning rate	α	0.9	`LEARNING_RATE` (main.py)	TD update step size
Reward forgetting rate	φ+	0.183	`FORGET_REWARD` (main.py)	Per-step Q+ decay
Punishment forgetting rate	φ-	0.089	`FORGET_PUNISH` (main.py)	Per-step Q- decay

All fixed hyperparameters live in config.py. Tunable parameters (LEARNING_RATE, FORGET_REWARD, FORGET_PUNISH) are set in main.py or passed to run_simulation().

Connections to Prior Work

This work sits at the intersection of three research traditions:

Neuroscience

Work	Relevance
Brea et al. (2014)	Normative theory of forgetting from Drosophila; active, asymmetric forgetting is optimal in changing environments
Shuai et al. (2010)	Rac1 protein actively drives forgetting; aversive memories decay faster than appetitive ones
Richards & Frankland (2017)	Persistence-transience framework; forgetting enhances generalisation
Bennett, Philippides & Nowotny (2021)	Drosophila mushroom body circuits implement separate appetitive/aversive dopamine pathways
Kato & Morita (2016)	Dopamine signals link value decay to sustained motivation; slow and fast timescales

Reinforcement Learning

Work	Relationship
Watkins & Dayan (1992)	Q-learning baseline; this work extends it with dual channels and decay
Elfwing & Seymour (2017) — MaxPain	Also uses Q+ and Q-; but no forgetting — values persist indefinitely. This work adds decay to address overfitting
Lin, Bouneffouf & Cecchi (2019) — Split Q-learning	Dual-stream Q-learning with asymmetric learning rates; this work uses asymmetric decay rates instead, which affects all states globally
Kandroodi et al. (2021)	Asymmetric learning rates (α+ ≠ α-); affects only visited states on update. This work's decay affects all states continuously
Kantasewi et al. (2019) — MQQL	Multi Q-table with sub-reward forgetting; periodic resets of sub-tables. This work uses continuous smooth decay
Fuchida et al. (2010)	Modified TD target with absolute-value operator to better handle negative Q-values

Control Theory

Work	Connection
Tsuruhara & Ito (2024)	FRIT-based adaptive control with forgetting factors; forgetting as leaky integration
Soleimani et al. (2025)	RL-PID with forgetting-factor iterative learning control

Key distinction from asymmetric learning rates (Kandroodi et al., 2021):

Asymmetric learning rates (α+ ≠ α-) only change how strongly new experiences update the Q-values of visited state-action pairs. Asymmetric decay rates (φ+ ≠ φ-) change how quickly all accumulated knowledge fades — across the entire state-action space, including states not recently visited. This is a global, continuous process, analogous to the biological finding that forgetting is an ongoing active mechanism, not a triggered event.

Discussion Highlights

Why Faster Reward Forgetting Helps

When reward locations reset every 20 steps, an agent with high φ+ quickly loses confidence in the old reward locations. This increases the TD error for those states when the agent next visits them, which drives faster learning of the new locations. The net effect is faster adaptation and higher cumulative return.

Why Slower Punishment Forgetting Helps

Punishment cells are also relocated, but the cost of stumbling into one repeatedly is high. A small φ- means the agent retains danger information longer — even after resampling, the memory of "this area was dangerous" persists, biasing the agent away from risky regions until it has evidence they are now safe. This avoidance memory prevents costly repeated punishments.

Divergence from Biological Predictions

Fruit fly experiments show aversive memories decay faster than appetitive ones — the opposite of what this work finds optimal (slow punishment forgetting). This divergence is explained by the noise structure of the environment: neutral cells randomly produce punishment-like signals, making it noisy to distinguish real dangers from false alarms. Retaining punishment memory counteracts this noise. When noise is reduced in the simulation, optimal parameters shift back toward the biological prediction, confirming the divergence is task-specific, not a failure of the biological analogy.

Connection to Control Theory

The global multiplicative decay Q ← (1-φ) · Q is mathematically equivalent to a leaky integrator — a well-known component in PID and feedback control systems. The balance between φ+ and φ- tunes the plasticity-stability trade-off: high φ+ makes reward predictions plastic (adapt fast), low φ- makes punishment predictions stable (don't overwrite hard-won safety knowledge).

Future Directions

Deep RL extension — Replace tabular Q-tables with neural networks (DQN + asymmetric replay buffer decay)
Adaptive forgetting — Meta-learn φ+ and φ- as a function of detected environmental volatility (change-point detection)
RL-PID hybrid — Embed forgetting as a derivative term in a PID-style controller on the TD error signal
QF-Tuner integration — Replace manual grid search with the FOX optimisation framework (Jumaah et al., 2025) for faster and more principled hyperparameter optimisation
Richer environments — Validate on MiniGrid, Atari non-stationary wrappers, or continuous-state environments with function approximation
State-dependent forgetting — Allow φ to vary per state based on visit frequency or recency, rather than global uniform decay

Full Dissertation

The complete academic dissertation behind this repository — including the full literature review, formal proofs, extended results analysis, and all figures — is available on request.

I have chosen not to publish the PDF publicly at this stage, as the work has not yet undergone formal peer review and I want to ensure proper attribution before wider distribution.

If you are a researcher, recruiter, or fellow student and would like access to the full paper, please reach out:

GitHub: @alinjfz
Open an issue on this repository with the subject "Dissertation Request"

Abstract of the paper is included in the Abstract section above.

Citation

If you build on this work, please cite it as:

@thesis{najafzadeh2025asymmetric,
  title     = {Asymmetric Forgetting in Dual-Q Reinforcement Learning:
               From Biological Theories to Computational Implementation},
  author    = {Najafzadeh, Ali},
  school    = {University of Sussex, School of Engineering and Informatics},
  year      = {2025},
  month     = {August},
  supervisor = {Dr. James Bennett},
  type      = {MSc Dissertation}
}

License

This project is released under a Custom Source-Available License — see LICENSE for the full terms.

Summary:

	Allowed
Run the code as-is (any purpose, including commercial)	✅ Yes — with attribution
Redistribute unmodified copies (courses, tutorials, papers)	✅ Yes — with attribution
Research and academic use	✅ Yes — with citation
Modify, adapt, or build derivatives	❌ Not without written permission
Sublicense or sell under different terms	❌ No

Attribution is always required: credit Ali Najafzadeh, link to this repository, and cite the dissertation in academic work. To request permission for derivative use or collaboration, open an issue or contact via GitHub.

Built with curiosity, rigour, and a deep belief that the brain's "design flaws" are actually its greatest features.

Ali Najafzadeh · University of Sussex · School of Engineering and Informatics · 2025

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
assets		assets
results		results
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.py		config.py
environment.py		environment.py
grid_search.py		grid_search.py
main.py		main.py
requirements.txt		requirements.txt
utils.py		utils.py

Folders and files

Latest commit

History

Repository files navigation

Asymmetric Forgetting in Dual-Q Reinforcement Learning

From Biological Theories to Computational Implementation

Table of Contents

Abstract

Motivation — Why Forgetting?

Key Contributions

The Core Idea in 60 Seconds

Algorithm — Full Technical Details

Formal Problem Definition

Dual-Q Architecture

Step 1 — Global Memory Decay

Step 2 — Action Selection via Boltzmann Policy

Step 3 — Environment Interaction

Step 4 — Independent TD Updates

Combined Update (Compact Form)

Step 5 — Non-Stationarity Injection

Full Algorithm Pseudocode

Why Global Decay (Not State-Specific)?

Environment

Overview

State and Action Space

Reward Structure

Gaussian Noise

Rendering

Results

Experiment 1 — Broad Hyperparameter Search

Experiment 2 — Refined Search at High Learning Rates

What the Contour Maps Show

Project Structure

Installation

Usage

Run a Single Simulation (Recommended Starting Point)

Run with Custom Parameters

Run Without Rendering (Faster, for Experiments)

Run the Hyperparameter Grid Search

Live Visualisations

Hyperparameter Grid Search

Hyperparameter Reference

Connections to Prior Work

Neuroscience

Reinforcement Learning

Control Theory

Discussion Highlights

Why Faster Reward Forgetting Helps

Why Slower Punishment Forgetting Helps

Divergence from Biological Predictions

Connection to Control Theory

Future Directions

Full Dissertation

Citation

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages