Student Name
Huy Dang
Model Length
256
Accuracy
54.82
Improvement Description
Reward Function, Advantage + Log ratio clamping
Detailed Write-up
-
Advantage clamping prevents NaN/Inf from extreme advantage values
-
Log ratio clamping prevents probability ratio explosions
--> RL runs became much more stable
-
Distance-based + negative rewards discourage bad behaviors (no answer tags, wrong numbers)
-
It plateaued at around 51%, so i did a final push by changing lr for more exploration
Unfortunately, I don't have enough free GPU credits to test for 512 tokens.
GPU Hours
1.5 H1000
Submission Agreement
Student Name
Huy Dang
Model Length
256
Accuracy
54.82
Improvement Description
Reward Function, Advantage + Log ratio clamping
Detailed Write-up
Advantage clamping prevents NaN/Inf from extreme advantage values
Log ratio clamping prevents probability ratio explosions
--> RL runs became much more stable
Distance-based + negative rewards discourage bad behaviors (no answer tags, wrong numbers)
It plateaued at around 51%, so i did a final push by changing lr for more exploration
Unfortunately, I don't have enough free GPU credits to test for 512 tokens.
GPU Hours
1.5 H1000
Submission Agreement