🚀 C²GSPG

This repository contains the official implementation of the paper:
C²GSPG: Confidence-Calibrated Group Sequence Policy Gradient towards Self-Aware Reasoning.

📌 Introduction

Reinforcement Learning (RL) methods, exemplified by Group Relative Policy Optimization (GRPO) and its variants, play a central role in developing reasoning models.
However, these methods often suffer from a critical overconfidence issue, which prevents them from achieving self-aware reasoning models.
In this study, we propose a simple yet effective confidence-calibration group sequence policy gradient method, called C²GSPG, which simultaneously enhances reasoning performance while suppressing overconfidence.
In principle, we introduce a Group Sequence Policy Gradient (GSPG) framework for learning reasoning models, which eliminates the token-level bias commonly observed in GRPO and its variants.
Within this framework, we define the model confidence for each reasoning problem using the normalized sequence-level probability, and then apply a cross-entropy regularizer to calibrate the model confidence to the sequence’s reward.
We demonstrate that the confidence calibration regularizer and GSPG are collaborative for binary rewards, as their objectives always share the same gradient direction.
For non-binary rewards, we apply nonlinear reward normalization and adaptive regularizer clipping, mitigating potential conflicts between the two objectives.
Applying C²GSPG to post-train large language models on logical and mathematical reasoning tasks, we show its superiority over state-of-the-art methods in both reasoning accuracy and confidence calibration.

📂 Repository Structure

This repository contains two major components of our experiments:

Mathematical Reasoning
Implementation available in the MATH_Code folder.
See the MATH_Code/README.md for detailed instructions.
Logical Reasoning
Implementation available in the KK_Code folder.
See the KK_Code/README.md for detailed instructions.

🙏 Acknowledgements

This repository builds upon the following open-source projects, to which we are deeply grateful: verl, AR-Lopti, LogicRL, DeepScaleR, AdaRFT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🚀 C²GSPG

📌 Introduction

📂 Repository Structure

🙏 Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
KK_Code		KK_Code
MATH_Code		MATH_Code
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

🚀 C²GSPG

📌 Introduction

📂 Repository Structure

🙏 Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages