Tiny, transparent experiments in Active Inference using pymdp and gymnasium.
Start from the simplest possible environments and agents, then add complexity step-by-step—so you can reason about each design choice and its cognitive implications.
Goal: a parsimonious, scientifically minded playground to study cognition, building up difficulty “evolutionarily” (fully observable → partially observable; single factor → multi-factor; single modality → multi-modal; deterministic → noisy; etc.).
Variational Free Energy starts with the following expression:
Let us extract the meaning of the Variational Free Energy.
The following expression is the Complexity term, which is the penalty for moving far from prior beliefs:
And the following one is the Accuracy term, which tells us how well the model predicts current observations:
Expected Free Energy starts with the following expression:
Let us extract the meaning from the Expected Free Energy expression.
The following term is the Extrinsic (utility term), which measures how much the predicted outcomes (O) deviate from preferred outcomes encoded in (p(o)) (which comes from the (C) matrix). If (p(o)) is high for some outcomes, policies leading to those outcomes have lower risk. It is goal-directed — the instrumental part of planning.
The following term is the Epistemic (state information gain), which measures how much the agent expects to learn about hidden states (s) from future observations under policy (\pi). It is high when predicted observations would strongly reduce uncertainty about the hidden causes of sensory input. It is curiosity-driven — the exploratory part of planning.
- Minimal
N×MGridWorld (deterministic walls, reward cell, punish cell) - Text / graphical renderer (single persistent window, dynamic updates)
- Active Inference agent factory
(A, B, C, D)tailored to GridWorld - Batch experiments & plots (random vs AIF; sequential & parallel)
- Live demo: watch random episodes then AIF episodes in one window
- Roadmap: POMDP variants, multi-factor control, multi-modal outcomes, learning
Active_Inference_for_Fun/Environments/
├─ gridworld_env.py # Gymnasium env: N×M grid, reward & punish tiles
├─ ai_agent_factory.py # build_gridworld_agent(): constructs A,B,C,D & Agent
├─ run_gridworld_stats.py # random baseline: many episodes, plots results
├─ run_gridworld_aif_vs_random.py # AIF vs Random, sequential or parallel (processes)
├─ run_gridworld_live_demo.py # live dynamic render: random then AIF episodes
└─ README.md
Python ≥ 3.9 recommended.
# (optional) create a fresh environment
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
# core deps
pip install --upgrade pip
pip install numpy matplotlib gymnasium
# pymdp (pick one)
pip install pymdp # if available in your index
# or from source:
# pip install git+https://github.com/infer-actively/pymdp.git If pymdp API differs across versions, the code includes small compatibility shims (e.g., scalar vs list observations, sample_action() return types).
python run_gridworld_human.py
Screencast.from.2025-10-19.15-46-34.mp4
python run_gridworld_stats.py --episodes 2000
Episodes: 2000
Success rate: 0.429
Punish rate: 0.455
Timeout rate: 0.117
Avg return: -0.025
Avg steps: 91.95python run_gridworld_aif_vs_random.py --episodes 1000 --workers 40
=== Summary (Random) ===
success_rate: 0.431
punish_rate: 0.455
timeout_rate: 0.114
avg_return: -0.024
avg_steps: 91.964
counts: {'reward': 431, 'timeout': 114, 'punish': 455}
=== Summary (AIF) ===
success_rate: 1.0
punish_rate: 0.0
timeout_rate: 0.0
avg_return: 1.0
avg_steps: 33.08
counts: {'reward': 1000}python run_gridworld_aif_vs_random.py --episodes 2000 --workers 60 --cols 10 --rows 10 --reward-pos "9, 9" --punish-pos "0, 9"
=== Summary (Random) ===
success_rate: 0.1875
punish_rate: 0.2635
timeout_rate: 0.549
avg_return: -0.076
avg_steps: 160.0365
counts: {'timeout': 1098, 'reward': 375, 'punish': 527}
=== Summary (AIF) ===
success_rate: 0.69
punish_rate: 0.0
timeout_rate: 0.31
avg_return: 0.69
avg_steps: 126.71
counts: {'timeout': 620, 'reward': 1380}python run_gridworld_aif_vs_random.py --episodes 2000 --workers 60 --cols 10 --rows 10 --reward-pos "9, 9" --punish-pos "0, 9" --policy-len 5
=== Summary (Random) ===
success_rate: 0.1875
punish_rate: 0.2635
timeout_rate: 0.549
avg_return: -0.076
avg_steps: 160.0365
counts: {'timeout': 1098, 'reward': 375, 'punish': 527}
=== Summary (AIF) ===
success_rate: 0.82
punish_rate: 0.0
timeout_rate: 0.18
avg_return: 0.82
avg_steps: 120.96
counts: {'reward': 1640, 'timeout': 360}
python run_gridworld_obs_noise.py --workers 10 --episodes 1000
=== Summary (Random) ===
success_rate: 0.408
punish_rate: 0.453
timeout_rate: 0.139
avg_return: -0.045
avg_steps: 94.594
counts: {'reward': 408, 'timeout': 139, 'punish': 453}
=== Summary (AIF, noisy A) ===
success_rate: 1.0
punish_rate: 0.0
timeout_rate: 0.0
avg_return: 1.0
avg_steps: 33.41
counts: {'reward': 1000}
belief_error_ratio (mean over episodes): 0.917python run_gridworld_obs_noise.py --workers 20 --episodes 1000 --cols 10 --rows 10 --reward-pos "9, 9" --punish-pos "0, 9"
=== Summary (Random) ===
success_rate: 0.189
punish_rate: 0.259
timeout_rate: 0.552
avg_return: -0.07
avg_steps: 161.987
counts: {'timeout': 552, 'reward': 189, 'punish': 259}
=== Summary (AIF, noisy A) ===
success_rate: 0.52
punish_rate: 0.06
timeout_rate: 0.42
avg_return: 0.46
avg_steps: 153.3
counts: {'reward': 520, 'timeout': 420, 'punish': 60}
belief_error_ratio (mean over episodes): 0.985python run_gridworld_live_demo.py --episodes-random 4 --episodes-aif 3 --fps 12 --seed 58457
[RANDOM] Episode 1: return=1.00, steps=184
[RANDOM] Episode 2: return=1.00, steps=31
[RANDOM] Episode 3: return=1.00, steps=24
[RANDOM] Episode 4: return=0.00, steps=200
[AIF] Episode 1: return=1.00, steps=11
[AIF] Episode 2: return=1.00, steps=10
[AIF] Episode 3: return=1.00, steps=15Screencast.from.2025-10-19.18-07-54.mp4
This section is to analyze how the main four terms defined in the previous section vary as we run an episode of this very simple grid world. The main idea is to demonstrate conceptually what are the main things that these four terms entail in regards to active inference.
As shown in the examples bellow, when we run an episode, the only value greater than 0 is the extrinsic Utility throughout the whole episode while the others are all 0.
python run_gridworld_live_metrics.py --fps 10
[Episode 1] return=1.00, steps=14
[Episode 2] return=1.00, steps=33
[Episode 3] return=1.00, steps=27
All episodes complete: [(1.0, 14), (1.0, 33), (1.0, 27)]Why is this happening?
Short answer: that pattern is exactly what we should see in our current setup (fully observable, deterministic dynamics, sharp prior), so nothing’s “wrong.”
Lets unfold this in more detail.
-
Fully observable A (≈ identity) & deterministic B
. After each step we set the prior for the next step as
$prior_s = B_uq_s$ .. The new observation is perfectly informative: the posterior
$q(s)$ collapses to the true state, which equals the predicted state..
$\Rightarrow$ Complexity$D_{KL}({q(s)}\parallel{prior_s}) = 0$ (posterior matches prior).. With
$A \approx identity, p(o_t∣s^*) = 1$ at the true state$\rightarrow −\mathbb{E}_q[ln p(o_t∣s)] = 0$ .. Epistemic (1-step info gain) vanishes: for each possible
$o, {q(s∣o)}\approx{q(s)}$ (no uncertainty to reduce), so expected$KL = 0$ . -
Extrinsic > 0
. We compute
$\mathbb{E}_{q(o)}[−ln p(o∣C)]$ .. Unless our preferences
$p(o∣C)$ put probability$\sim{1}$ on the actually observed outcome at every step, this expectation is positive. That’s why our utility bar moves, while the others don’t.
If we want non-zero Complexity, Accuracy, Epistemic, we have to introduce mismatch or uncertainty:
. --a-noise 0.2 (so
python run_gridworld_live_metrics.py --fps 10 --a-noise 0.2
[Episode 1] return=1.00, steps=24
[Episode 2] return=1.00, steps=16
[Episode 3] return=1.00, steps=10
All episodes complete: [(1.0, 24), (1.0, 16), (1.0, 10)]Pure B-noise (no env slips), see Complexity move:
python run_gridworld_live_metrics.py --b-noise 0.2 --fps 10
[Episode 1] return=0.00, steps=200
[Episode 2] return=0.00, steps=200
[Episode 3] return=0.00, steps=200
All episodes complete: [(0.0, 200), (0.0, 200), (0.0, 200)]Combine A-noise with B-noise to also see Accuracy/Epistemic also let's keep preferences gentle so exploration isn’t crushed:
python run_gridworld_live_metrics.py --a-noise 0.5 --b-noise 0.5 --c-reward 0.1 --c-punish -0.1 --fps 12 --policy-len 6 --sophisticated --max-steps 10
[Episode 1] return=0.00, steps=10
[Episode 2] return=0.00, steps=10
[Episode 3] return=0.00, steps=10
All episodes complete: [(0.0, 10), (0.0, 10), (0.0, 10)]Producing a Complexity spike inserting a random starting position (--start-pos random), plus non-zero Accuracy + Epistemic during re-localization with noisy actions (--slip-prob 0.1).
python run_gridworld_live_metrics.py --start-pos random --fps 1 --slip-prob 0.1
[Episode 1] return=1.00, steps=8
[Episode 2] return=1.00, steps=42
[Episode 3] return=1.00, steps=42
All episodes complete: [(1.0, 8), (1.0, 42), (1.0, 42)]You can also run some stats in this regard
python run_gridworld_metrics_stats.py --episodes 100 --workers 20 --a-noise 0.1 --b-noise 0.1 --c-reward 0.1 --c-punish -0.1 --max-steps 200 --policy-len 6 --sophisticated --cols 5 --rows 5 --reward-pos "4,4" --punish-pos "0,4"
=== Summary ===
success_rate: 0.0
punish_rate: 0.0
timeout_rate: 1.0
avg_return: 0.0
avg_steps: 200.0
counts: {'timeout': 100}
Per-episode means (overall): Complexity=0.098, Accuracy=0.103, Extrinsic=3.219, Epistemic=0.002python run_nav3_live_demo.py --start-ori E --fps 12 --episodes 5 --rows 6 --cols 6 --reward-pos "5,5" --punish-pos "0,5" --policy-len 3
[Episode 1] return=1.00, steps=92
[Episode 2] return=1.00, steps=73
[Episode 3] return=0.00, steps=200
[Episode 4] return=1.00, steps=65
[Episode 5] return=1.00, steps=37









