Six tools for extracting clean text and mathematics from LaTeX source files, each using a different parsing engine.
Given a .tex file, every tool produces the same outputs:
| File | Contents |
|---|---|
STEM.LANG.txt |
Plain prose, math replaced by placeholders |
STEM.LANG.math.txt |
Math expressions, numbered and labelled |
STEM.LANG.annotated.txt |
Prose with inline math markers |
STEM.extracted.json |
All languages and fragment types |
Multilingual documents are handled automatically — content under Babel and Polyglossia is written to separate per-language files labelled with their BCP=47 tags.
| Tool | Language | Requires |
|---|---|---|
Perl/perl_extract.pl |
Perl | Nothing (latexpand optional) |
LaTeX-TOM/latex_tom_extract.pl |
Perl | LaTeX::TOM (CPAN) |
Python/python_extract.py |
Python ≥ 3.10 | Nothing (latexpand optional) |
TexSoup/texsoup_extract.py |
Python ≥ 3.10 | pip install TexSoup |
Pandoc/pandoc.extract.py |
Python ≥ 3.10 | pandoc ≥ 2.0 |
LaTeXML/latexml_extract.py |
Python ≥ 3.10 | latexml ≥ 0.8 |
All tools share the same core options:
perl Perl/perl_extract.pl paper.tex --output-dir ./out
python3 Python/python_extract.py paper.tex --output-dir ./out--format plain | annotated | json | all (default: all)
--default-lang LANG (default: en)
See each tool's manual for the full option reference.