[Submission]

### Student Name

Huy Dang

### Model Length

256

### Accuracy

54.82

### Improvement Description

Reward Function, Advantage + Log ratio clamping

### Detailed Write-up

- Advantage clamping prevents NaN/Inf from extreme advantage values
- Log ratio clamping prevents probability ratio explosions
--> RL runs became much more stable

- Distance-based + negative rewards discourage bad behaviors (no answer tags, wrong numbers)
- It plateaued at around 51%, so i did a final push by changing lr for more exploration

Unfortunately, I don't have enough free GPU credits to test for 512 tokens. 

### GPU Hours

1.5 H1000

### Submission Agreement

- [x] #5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Submission] #4

Student Name

Model Length

Accuracy

Improvement Description

Detailed Write-up

GPU Hours

Submission Agreement

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Submission] #4

Description

Student Name

Model Length

Accuracy

Improvement Description

Detailed Write-up

GPU Hours

Submission Agreement

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions