Safe RLHF: Constrained Value Alignment via Safe Reinforcement Learning from Human Feedback
-
Updated
Nov 24, 2025 - Python
Safe RLHF: Constrained Value Alignment via Safe Reinforcement Learning from Human Feedback
BeaverTails is a collection of datasets designed to facilitate research on safety alignment in large language models (LLMs).
🤖 Enhance reinforcement learning stability and efficiency with advanced algorithms like TRPO, PPO, DPO, GRPO, DAPO, and GSPO for optimized policy training.
Add a description, image, and links to the safe-rlhf topic page so that developers can more easily learn about it.
To associate your repository with the safe-rlhf topic, visit your repo's landing page and select "manage topics."