Skip to content

Issues with ground truth data #71

Description

@adbar

Hi, I spotted the following issues while trying to evaluate extractors on the benchmark:

  1. Annotation artifacts in the source HTML: Ground-truth pages contain wrapper tags (<marked-text>, <marked-tail>) that bias scoring against any tool that renders inline formatting.
  2. Truncated ground truth: Some reference texts are partial excerpts and a more complete extraction scores lower even when it is more correct. A tool that extracts the full article scores lower than one that stops after the first section.
  3. Layout tables penalized: Articles whose body sits inside <td> elements are evaluated against a reference that ignores cell boundaries, penalizing tools that handle table structure correctly, e.g. a tool that emits | paragraph | rows scores worse than one that strips the table structure.
  4. Links and images not evaluated, and math absent from references. These items are invisible to the metric. A page with \(E=mc^2\) in the source has E=mc2 in the reference (rendered plain text); any tool that outputs $E=mc^2$ or $$E=mc^2$$ gets edit-distance penalized for the extra characters.

A more general comment on scoring: Edit distance favours brevity, a short extraction that closely matches a partial reference outscores a complete extraction that diverges from it.

In any case: Thank you for your contribution, ground truth data is essential in this field!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions