-
Notifications
You must be signed in to change notification settings - Fork 25
Open
Description
Hi team,
I noticed that in the official evaluation code you release, for Lambada dataset, the correct answer was counted twice at https://github.com/HKUNLP/DiffuLLaMA/blob/main/evaluation/eval-diffugpt.py#L75 and https://github.com/HKUNLP/DiffuLLaMA/blob/main/evaluation/eval-diffugpt.py#L79. This potentially leads to doubling the accuracy metric?
In addition, the official accuracy number reported in GPT2 paper on Lambada dataset was 45.99 for GPT2-S, but in your paper it was 25.9. For GPT2-M the official number is 55.48 but in your paper it was 37.7.
Could you please clarify
- why in the official evaluation code, the number of correct case was counted twice?
- why the accurary for baselines were much lower than the official paper?
Thanks!
Metadata
Metadata
Assignees
Labels
No labels