[Submission] Thanh 512

### Student Name

Thanh

### Model Length

512

### Accuracy

60.90%

### Improvement Description

Default config, with vLLM logprobs correction

### Detailed Write-up


Based on this one 
https://fengyao.notion.site/off-policy-rl

And based on the implementation in TRL to get the log probs from vllm and important sampling

https://github.com/huggingface/trl/blob/e086f073cf6dee30acc2d3fe357db21e1901c2be/trl/trainer/grpo_trainer.py#L1258

240 steps 1xh100


### GPU Hours

1.5 hour H100

### Submission Agreement

- [x] I confirm that these results are from my own work

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Submission] Thanh 512 #2

Student Name

Model Length

Accuracy

Improvement Description

Detailed Write-up

GPU Hours

Submission Agreement

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Submission] Thanh 512 #2

Description

Student Name

Model Length

Accuracy

Improvement Description

Detailed Write-up

GPU Hours

Submission Agreement

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions