Student Name
Thanh
Model Length
512
Accuracy
60.90%
Improvement Description
Default config, with vLLM logprobs correction
Detailed Write-up
Based on this one
https://fengyao.notion.site/off-policy-rl
And based on the implementation in TRL to get the log probs from vllm and important sampling
https://github.com/huggingface/trl/blob/e086f073cf6dee30acc2d3fe357db21e1901c2be/trl/trainer/grpo_trainer.py#L1258
240 steps 1xh100
GPU Hours
1.5 hour H100
Submission Agreement
Student Name
Thanh
Model Length
512
Accuracy
60.90%
Improvement Description
Default config, with vLLM logprobs correction
Detailed Write-up
Based on this one
https://fengyao.notion.site/off-policy-rl
And based on the implementation in TRL to get the log probs from vllm and important sampling
https://github.com/huggingface/trl/blob/e086f073cf6dee30acc2d3fe357db21e1901c2be/trl/trainer/grpo_trainer.py#L1258
240 steps 1xh100
GPU Hours
1.5 hour H100
Submission Agreement