Skip to content

bayantuffaha/human-feedback-rl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Human Feedback in Reinforcement Learning

Evaluating Feedback Strategy in Interactive RL (TAMER Framework)

This project investigates how different structures of human feedback influence learning behavior in reinforcement learning agents. Using the TAMER (Training an Agent Manually via Evaluative Reinforcement) framework, we systematically compare event-based, navigation-guided, and pattern-based feedback across two classic navigation environments.

Rather than focusing on reward magnitude or feedback granularity alone, this work shows that how and when feedback is delivered plays a far more important role in shaping agent behavior, learning speed, and stability.

Key Takeaways

  • Feedback structure matters more than numeric granularity.
  • Navigation-guided feedback leads to faster and more stable learning than purely event-based feedback.
  • Excessive or poorly timed feedback can destabilize learning.
  • Different feedback strategies shape how agents behave, not just whether they succeed.

Environments

Experiments were conducted in two discrete navigation environments from Gymnasium’s Toy Text suite.

Cliff Walking

  • Grid world with a hazardous cliff region
  • Goal: reach the terminal state without falling
  • Evaluates risk-aware navigation and path planning

Taxi

  • Structured pickup-and-dropoff task
  • Goal: pick up a passenger and deliver them to a target location
  • Evaluates long-horizon planning and illegal action avoidance

Methods

Learning Paradigms

  • TAMER agents trained exclusively on evaluative feedback
  • Q-learning baseline trained on environment rewards

Feedback Strategies

Five synthetic feedback functions were implemented:

  • Binary Dense (navigation-guided)
  • Binary Sparse (event-based)
  • Multilevel Dense (navigation-guided)
  • Multilevel Sparse (event-based)
  • Pattern-Based Feedback (behavioral pattern evaluation)

Dense feedback variants provide continuous, task-aligned guidance (e.g., safe path or corridor navigation), while sparse variants focus primarily on milestone or terminal events.

Results

Learning Curves

Representative learning curves comparing feedback strategies:

Cliff Walking Learning Curves
Taxi Learning Curves

All curves are averaged over 20 seeds; shaded region is ±1 standard error. Across both environments, navigation-guided binary and multilevel feedback consistently achieved the fastest convergence and highest stability, outperforming sparse and pattern-based feedback in early learning.

Learned Policies (Qualitative Examples)

To complement the quantitative results, the following GIFs show learned agent behavior after training, highlighting how feedback structure shapes navigation strategies.

Cliff Walking

Binary Dense Feedback
The agent follows the safe upper path, closely matching human intuition about optimal risk-aware behavior.

Cliff Binary Multilevel

Pattern-Based Feedback
The agent reaches the goal using a less conventional trajectory, initially moving closer to the cliff before transitioning upward.
This reflects higher-level behavioral evaluation rather than direct navigation guidance.

Cliff Pattern Based

Taxi

Binary Dense Feedback
The agent successfully completes the task but exhibits oscillatory behavior when navigating toward the passenger and destination.

Taxi Binary

Pattern-Based Feedback
The agent follows a more direct path with reduced oscillation, but this feedback strategy learned less reliably overall across runs.

Taxi Pattern Based

Multilevel Dense Feedback
The agent successfully completes the task but exhibits oscillatory behavior when navigating toward the passenger and destination.

Taxi Multilevel

Multilevel Sparse Feedback
The agent repeatedly performs illegal drop-offs before eventually completing the task, illustrating how sparse milestone-based feedback can fail to guide effective navigation.

Taxi Multilevel Sparse

These examples illustrate that feedback strategy affects behavioral style, not just task completion.

How to Run

  1. Install dependencies: pip install -r requirements.txt
  2. Run the experiment notebooks:
    • experiments/cliffwalking_experiments.ipynb
    • experiments/taxi_experiments.ipynb

Each notebook runs all feedback strategies, evaluates performance across multiple seeds, and produces plots for comparative analysis.

Project Context

This project was completed as a final course project in Reinforcement Learning.

Collaboration:
This was a collaborative project. I designed and implemented the full experimental pipeline (agents, feedback functions, training loops, evaluation, and visualizations), and co-authored the project report.

A full technical report (co-authored) is available at report/final_report.pdf.
Note: the README is the most up-to-date high-level summary of the project and results.

Future Directions

Potential extensions include:

  • Evaluating feedback strategies with real human participants
  • Scaling TAMER agents to larger environments (e.g., Atari)
  • Studying transfer learning from simple to complex tasks
  • Exploring hybrid reward + feedback models

References

  • Knox, W. B., & Stone, P. (2009). Training an Agent Manually via Evaluative Reinforcement.
  • Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction.

About

Experimental study of human feedback strategies in reinforcement learning using the TAMER framework across classic navigation environments

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors