Skip to content

Potential Issue with Lambada evaluation #14

@iloverdl

Description

@iloverdl

Hi team,
I noticed that in the official evaluation code you release, for Lambada dataset, the correct answer was counted twice at https://github.com/HKUNLP/DiffuLLaMA/blob/main/evaluation/eval-diffugpt.py#L75 and https://github.com/HKUNLP/DiffuLLaMA/blob/main/evaluation/eval-diffugpt.py#L79. This potentially leads to doubling the accuracy metric?

In addition, the official accuracy number reported in GPT2 paper on Lambada dataset was 45.99 for GPT2-S, but in your paper it was 25.9. For GPT2-M the official number is 55.48 but in your paper it was 37.7.

Could you please clarify

  • why in the official evaluation code, the number of correct case was counted twice?
  • why the accurary for baselines were much lower than the official paper?

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions