Hi, I spotted the following issues while trying to evaluate extractors on the benchmark:
- Annotation artifacts in the source HTML: Ground-truth pages contain wrapper tags (
<marked-text>, <marked-tail>) that bias scoring against any tool that renders inline formatting.
- Truncated ground truth: Some reference texts are partial excerpts and a more complete extraction scores lower even when it is more correct. A tool that extracts the full article scores lower than one that stops after the first section.
- Layout tables penalized: Articles whose body sits inside
<td> elements are evaluated against a reference that ignores cell boundaries, penalizing tools that handle table structure correctly, e.g. a tool that emits | paragraph | rows scores worse than one that strips the table structure.
- Links and images not evaluated, and math absent from references. These items are invisible to the metric. A page with
\(E=mc^2\) in the source has E=mc2 in the reference (rendered plain text); any tool that outputs $E=mc^2$ or $$E=mc^2$$ gets edit-distance penalized for the extra characters.
A more general comment on scoring: Edit distance favours brevity, a short extraction that closely matches a partial reference outscores a complete extraction that diverges from it.
In any case: Thank you for your contribution, ground truth data is essential in this field!
Hi, I spotted the following issues while trying to evaluate extractors on the benchmark:
<marked-text>,<marked-tail>) that bias scoring against any tool that renders inline formatting.<td>elements are evaluated against a reference that ignores cell boundaries, penalizing tools that handle table structure correctly, e.g. a tool that emits| paragraph |rows scores worse than one that strips the table structure.\(E=mc^2\)in the source hasE=mc2in the reference (rendered plain text); any tool that outputs$E=mc^2$or$$E=mc^2$$gets edit-distance penalized for the extra characters.A more general comment on scoring: Edit distance favours brevity, a short extraction that closely matches a partial reference outscores a complete extraction that diverges from it.
In any case: Thank you for your contribution, ground truth data is essential in this field!