Skip to content

pauloney/strippers

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Strippers

Six tools for extracting clean text and mathematics from LaTeX source files, each using a different parsing engine.

Given a .tex file, every tool produces the same outputs:

File Contents
STEM.LANG.txt Plain prose, math replaced by placeholders
STEM.LANG.math.txt Math expressions, numbered and labelled
STEM.LANG.annotated.txt Prose with inline math markers
STEM.extracted.json All languages and fragment types

Multilingual documents are handled automatically — content under Babel and Polyglossia is written to separate per-language files labelled with their BCP=47 tags.

Tools

Tool Language Requires
Perl/perl_extract.pl Perl Nothing (latexpand optional)
LaTeX-TOM/latex_tom_extract.pl Perl LaTeX::TOM (CPAN)
Python/python_extract.py Python ≥ 3.10 Nothing (latexpand optional)
TexSoup/texsoup_extract.py Python ≥ 3.10 pip install TexSoup
Pandoc/pandoc.extract.py Python ≥ 3.10 pandoc ≥ 2.0
LaTeXML/latexml_extract.py Python ≥ 3.10 latexml ≥ 0.8

Usage

All tools share the same core options:

perl  Perl/perl_extract.pl        paper.tex --output-dir ./out
python3 Python/python_extract.py  paper.tex --output-dir ./out
--format   plain | annotated | json | all   (default: all)
--default-lang LANG                          (default: en)

See each tool's manual for the full option reference.

About

LaTeX strippers

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors