Potential Issue with Lambada evaluation

Hi team, 
I noticed that in the official evaluation code you release, for Lambada dataset, the correct answer was counted twice at https://github.com/HKUNLP/DiffuLLaMA/blob/main/evaluation/eval-diffugpt.py#L75 and https://github.com/HKUNLP/DiffuLLaMA/blob/main/evaluation/eval-diffugpt.py#L79. This potentially leads to doubling the accuracy metric?

In addition, the official accuracy number reported in GPT2 paper on Lambada dataset was 45.99 for GPT2-S, but in your paper it was 25.9. For GPT2-M the official number is 55.48 but in your paper it was 37.7.

Could you please clarify 
- why in the official evaluation code, the number of correct case was counted twice?
- why the accurary for baselines were much lower than the official paper?

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Potential Issue with Lambada evaluation #14

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Potential Issue with Lambada evaluation #14

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions