Skip to content

[Submission] Thanh 512 #2

@Luvata

Description

@Luvata

Student Name

Thanh

Model Length

512

Accuracy

60.90%

Improvement Description

Default config, with vLLM logprobs correction

Detailed Write-up

Based on this one
https://fengyao.notion.site/off-policy-rl

And based on the implementation in TRL to get the log probs from vllm and important sampling

https://github.com/huggingface/trl/blob/e086f073cf6dee30acc2d3fe357db21e1901c2be/trl/trainer/grpo_trainer.py#L1258

240 steps 1xh100

GPU Hours

1.5 hour H100

Submission Agreement

  • I confirm that these results are from my own work

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions