Human Feedback in Reinforcement Learning

Evaluating Feedback Strategy in Interactive RL (TAMER Framework)

This project investigates how different structures of human feedback influence learning behavior in reinforcement learning agents. Using the TAMER (Training an Agent Manually via Evaluative Reinforcement) framework, we systematically compare event-based, navigation-guided, and pattern-based feedback across two classic navigation environments.

Rather than focusing on reward magnitude or feedback granularity alone, this work shows that how and when feedback is delivered plays a far more important role in shaping agent behavior, learning speed, and stability.

Key Takeaways

Feedback structure matters more than numeric granularity.
Navigation-guided feedback leads to faster and more stable learning than purely event-based feedback.
Excessive or poorly timed feedback can destabilize learning.
Different feedback strategies shape how agents behave, not just whether they succeed.

Environments

Experiments were conducted in two discrete navigation environments from Gymnasium’s Toy Text suite.

Cliff Walking

Grid world with a hazardous cliff region
Goal: reach the terminal state without falling
Evaluates risk-aware navigation and path planning

Taxi

Structured pickup-and-dropoff task
Goal: pick up a passenger and deliver them to a target location
Evaluates long-horizon planning and illegal action avoidance

Methods

Learning Paradigms

TAMER agents trained exclusively on evaluative feedback
Q-learning baseline trained on environment rewards

Feedback Strategies

Five synthetic feedback functions were implemented:

Binary Dense (navigation-guided)
Binary Sparse (event-based)
Multilevel Dense (navigation-guided)
Multilevel Sparse (event-based)
Pattern-Based Feedback (behavioral pattern evaluation)

Dense feedback variants provide continuous, task-aligned guidance (e.g., safe path or corridor navigation), while sparse variants focus primarily on milestone or terminal events.

Results

Learning Curves

Representative learning curves comparing feedback strategies:

All curves are averaged over 20 seeds; shaded region is ±1 standard error. Across both environments, navigation-guided binary and multilevel feedback consistently achieved the fastest convergence and highest stability, outperforming sparse and pattern-based feedback in early learning.

Learned Policies (Qualitative Examples)

To complement the quantitative results, the following GIFs show learned agent behavior after training, highlighting how feedback structure shapes navigation strategies.

Cliff Walking

Binary Dense Feedback
The agent follows the safe upper path, closely matching human intuition about optimal risk-aware behavior.

Pattern-Based Feedback
The agent reaches the goal using a less conventional trajectory, initially moving closer to the cliff before transitioning upward.
This reflects higher-level behavioral evaluation rather than direct navigation guidance.

Taxi

Binary Dense Feedback
The agent successfully completes the task but exhibits oscillatory behavior when navigating toward the passenger and destination.

Pattern-Based Feedback
The agent follows a more direct path with reduced oscillation, but this feedback strategy learned less reliably overall across runs.

Multilevel Dense Feedback
The agent successfully completes the task but exhibits oscillatory behavior when navigating toward the passenger and destination.

Multilevel Sparse Feedback
The agent repeatedly performs illegal drop-offs before eventually completing the task, illustrating how sparse milestone-based feedback can fail to guide effective navigation.

These examples illustrate that feedback strategy affects behavioral style, not just task completion.

How to Run

Install dependencies: pip install -r requirements.txt
Run the experiment notebooks:
- experiments/cliffwalking_experiments.ipynb
- experiments/taxi_experiments.ipynb

Each notebook runs all feedback strategies, evaluates performance across multiple seeds, and produces plots for comparative analysis.

Project Context

This project was completed as a final course project in Reinforcement Learning.

Collaboration:
This was a collaborative project. I designed and implemented the full experimental pipeline (agents, feedback functions, training loops, evaluation, and visualizations), and co-authored the project report.

A full technical report (co-authored) is available at report/final_report.pdf.
Note: the README is the most up-to-date high-level summary of the project and results.

Future Directions

Potential extensions include:

Evaluating feedback strategies with real human participants
Scaling TAMER agents to larger environments (e.g., Atari)
Studying transfer learning from simple to complex tasks
Exploring hybrid reward + feedback models

References

Knox, W. B., & Stone, P. (2009). Training an Agent Manually via Evaluative Reinforcement.
Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Human Feedback in Reinforcement Learning

Evaluating Feedback Strategy in Interactive RL (TAMER Framework)

Key Takeaways

Environments

Cliff Walking

Taxi

Methods

Learning Paradigms

Feedback Strategies

Results

Learning Curves

Learned Policies (Qualitative Examples)

Cliff Walking

Taxi

How to Run

Project Context

Future Directions

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
agents		agents
experiments		experiments
feedback		feedback
report		report
results		results
utils		utils
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Human Feedback in Reinforcement Learning

Evaluating Feedback Strategy in Interactive RL (TAMER Framework)

Key Takeaways

Environments

Cliff Walking

Taxi

Methods

Learning Paradigms

Feedback Strategies

Results

Learning Curves

Learned Policies (Qualitative Examples)

Cliff Walking

Taxi

How to Run

Project Context

Future Directions

References

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages