Skip to content

Feature Request: Native Support for Recurrent Policies (LSTM/GRU) #639

@samindaa

Description

@samindaa

Objective

It would be highly beneficial for Brax to have well-documented support for recurrent policies, such as those using LSTMs or GRUs.

Currently, the default policies and examples (e.g., for PPO, SAC) are primarily focused on feedforward Multilayer Perceptrons (MLPs). This architecture is excellent for tasks that are fully observable (Markov Decision Processes, or MDPs).

However, a vast number of challenging control and robotics problems are partially observable (POMDPs). In these cases, the optimal action depends not just on the current observation, but on a history of past observations. A feedforward policy, which is purely reactive, cannot solve these tasks effectively.

Describe the solution you'd like

We propose adding native support for recurrent policies within Brax's training frameworks. This would ideally include:

Policy Architecture: A simple way to define a recurrent policy (e.g., LSTMPolicy or GruPolicy) that can be used with trainers like PPO.

State Management: Correctly handling the recurrent hidden state (h_t, c_t) during trajectory collection. The state from step t must be passed as an input to the policy at step t+1.

Training: Properly managing Backpropagation Through Time (BPTT) during the gradient update, including handling the resetting of the hidden state at episode boundaries.

Evidence and Justification

Supporting recurrent policies is critical for tackling more complex and realistic problems, especially in robotics.

  1. Solves Partially Observable (POMDP) Tasks
    LSTMs/GRUs act as a memory for the agent, allowing it to build an internal "belief" of the true environmental state. Common POMDP scenarios include:

Inferring Velocity/Acceleration: When an agent only observes joint positions (common in many sims), a recurrent policy can infer velocities and accelerations from the history of observations.

Sensor Noise: A memory-based policy can filter noisy sensor readings over time, leading to more stable and robust control.

System Identification: The agent can learn to infer unobserved environmental parameters (e.g., ground friction, the mass of an object it's carrying, or external forces) and adapt its behavior accordingly.

  1. Crucial for Sim-to-Real Transfer

This is perhaps the strongest argument. The real world is always partially observable due to sensor noise, delays, and unmodeled dynamics. A policy trained on a "perfect" MDP in simulation is often brittle when transferred to reality. Policies trained with memory (LSTMs) are often significantly more robust to this "sim-to-real" gap, as they are already designed to handle uncertainty and incomplete information.

  1. Community Standard and Precedent

Support for recurrent policies is a standard, expected feature in nearly all major reinforcement learning libraries, as it's the default solution for POMDPs.

Ray RLlib has extensive support for RNNs.

Stable-Baselines3 provides RecurrentPPO.

Tianshou has built-in support for recurrent policies.

CleanRL includes ppo_rnn.py as a standard example.

Other specialized frameworks, such as RLS-RL, also support LSTM policies, demonstrating their clear utility in the field.

Describe alternatives you've considered

The most common alternative is frame-stacking (or state-stacking), where the user manually concatenates the last k observations and feeds them to an MLP.

While this is a partial workaround, it is inferior to true recurrent policies:

Inefficient: It dramatically increases the size of the observation space (by a factor of k).

Fixed History: It assumes a fixed, finite history window, which may not be optimal.

Less Powerful: LSTMs are designed to learn what to remember and for how long, making them more flexible and powerful at learning long-term dependencies.

Adding this feature would significantly broaden the range of problems Brax can solve effectively and align it with the capabilities of other leading RL libraries, making it a more powerful tool for complex robotics research.

Thank you for considering this request!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions