Skip to content

SDS-Lab/CCGSPG

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 

Repository files navigation

🚀 C²GSPG

This repository contains the official implementation of the paper:
C²GSPG: Confidence-Calibrated Group Sequence Policy Gradient towards Self-Aware Reasoning.


📌 Introduction

Reinforcement Learning (RL) methods, exemplified by Group Relative Policy Optimization (GRPO) and its variants, play a central role in developing reasoning models.
However, these methods often suffer from a critical overconfidence issue, which prevents them from achieving self-aware reasoning models.
In this study, we propose a simple yet effective confidence-calibration group sequence policy gradient method, called C²GSPG, which simultaneously enhances reasoning performance while suppressing overconfidence.
In principle, we introduce a Group Sequence Policy Gradient (GSPG) framework for learning reasoning models, which eliminates the token-level bias commonly observed in GRPO and its variants.
Within this framework, we define the model confidence for each reasoning problem using the normalized sequence-level probability, and then apply a cross-entropy regularizer to calibrate the model confidence to the sequence’s reward.
We demonstrate that the confidence calibration regularizer and GSPG are collaborative for binary rewards, as their objectives always share the same gradient direction.
For non-binary rewards, we apply nonlinear reward normalization and adaptive regularizer clipping, mitigating potential conflicts between the two objectives.
Applying C²GSPG to post-train large language models on logical and mathematical reasoning tasks, we show its superiority over state-of-the-art methods in both reasoning accuracy and confidence calibration.


📂 Repository Structure

This repository contains two major components of our experiments:


🙏 Acknowledgements

This repository builds upon the following open-source projects, to which we are deeply grateful: verl, AR-Lopti, LogicRL, DeepScaleR, AdaRFT


About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 86.7%
  • Jupyter Notebook 8.7%
  • Shell 4.5%
  • Other 0.1%