Issues with ground truth data

Hi, I spotted the following issues while trying to evaluate extractors on the benchmark:

1. Annotation artifacts in the source HTML: Ground-truth pages contain wrapper tags (`<marked-text>`, `<marked-tail>`) that bias scoring against any tool that renders inline formatting.
2. Truncated ground truth: Some reference texts are partial excerpts and a more complete extraction scores lower even when it is more correct. A tool that extracts the full article scores lower than one that stops after the first section.
3. Layout tables penalized: Articles whose body sits inside `<td>` elements are evaluated against a reference that ignores cell boundaries, penalizing tools that handle table structure correctly, e.g. a tool that emits `| paragraph |` rows scores worse than one that strips the table structure.
4. Links and images not evaluated, and math absent from references. These items are invisible to the metric. A page with `$E=mc^2$` in the source has `E=mc2` in the reference (rendered plain text); any tool that outputs `$E=mc^2$` or `$$E=mc^2$$` gets edit-distance penalized for the extra characters.

A more general comment on scoring: Edit distance favours brevity, a short extraction that closely matches a partial reference outscores a complete extraction that diverges from it.

In any case: Thank you for your contribution, ground truth data is essential in this field!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Issues with ground truth data #71

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Issues with ground truth data #71

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions