Evidence-based scientific paper figure extraction for arXiv papers, local PDFs, and journal article pages.
This repository contains a reusable Claude Code skill and a Python package named paperfig. The goal is to extract real paper figures with strict provenance, while avoiding hallucinated figure numbers, captions, panel labels, or reconstructed images.
Short description
Extract real figures from scientific papers with strict provenance and non-hallucination guarantees.
Long description
Paper Figure Extractor is a local-first tool for extracting figures from scientific papers. It supports arXiv IDs/URLs, local PDFs, local LaTeX source trees, and journal HTML pages such as Nature-style article pages. The extractor prefers source-backed evidence, records provenance in metadata.json, separates verified figures from uncertain candidates, and avoids inventing missing figures, captions, figure numbers, or panel labels.
Suggested GitHub topics:
paper-figuresscientific-papersarxivpdf-extractionlatexresearch-toolsprovenanceclaude-code-skill
paperfig extracts figures using a strict source reliability hierarchy:
- arXiv / LaTeX source: parses
figureenvironments,\includegraphics, and\caption. - Journal HTML: parses
<figure>/<figcaption>and source image URLs. - PDF embedded images: extracts embedded image objects as candidates when the figure mapping is uncertain.
- Rendered PDF crops: renders caption-nearby crops as low-confidence candidates.
Verified outputs go to figures/. Uncertain outputs go to candidates/.
The tool is intentionally conservative:
- It does not fabricate figures.
- It does not fabricate captions.
- It does not fabricate figure numbers.
- It does not generate, redraw, or reconstruct figures.
- It does not promote uncertain PDF crops to final
FigXXoutputs in strict mode. - Every output record includes provenance pointing back to a source file, HTML location, LaTeX location, or PDF page/object location.
paper_figure_extractor/
SKILL.md # Claude Code skill instructions
prompt.txt # Original build prompt / requirements
tools/
paperfig/
pyproject.toml # Python package metadata
paperfig/
api.py # Python API orchestration
cli.py # CLI entry point
arxiv_source.py # arXiv source download/extraction
latex_parser.py # LaTeX figure parser
journal_html.py # Journal HTML parser
pdf_extract.py # Conservative PDF candidate extraction
verify.py # Verification and source hierarchy
export.py # Output writer
models.py # Pydantic data models
tests/
From this repository root:
pip install -e tools/paperfigFor development and tests:
pip install -e "tools/paperfig[dev]"
pytest tools/paperfigBecause the Python package lives in tools/paperfig, install with the subdirectory fragment:
pip install "git+https://github.com/Yun532/paper_figure_extractor.git#subdirectory=tools/paperfig"paperfig extract INPUT --out OUTPUT_DIR --dpi 400 --strictExamples:
# arXiv ID
paperfig extract 1706.03762 --out outputs/attention --dpi 400 --strict
# arXiv URL
paperfig extract https://arxiv.org/abs/2103.00020 --out outputs/clip --dpi 400 --strict
# local PDF
paperfig extract path/to/paper.pdf --out outputs/paper --dpi 400 --strict
# journal article page
paperfig extract https://www.nature.com/articles/example --out outputs/nature_article --dpi 400 --strictThe command prints counts for verified figures, verified panels, candidates, warnings, and the metadata path.
from pathlib import Path
from paperfig import extract_figures
result = extract_figures(
"1706.03762",
outdir=Path("outputs/attention"),
dpi=400,
strict=True,
)
print(result.metadata_path)
for figure in result.figures:
print(figure.fig_id, figure.output_file, figure.caption_file)
for candidate in result.candidates:
print("candidate", candidate.output_file, candidate.candidate_reason)OUTPUT_DIR/
metadata.json
figures/
Fig01.png
Fig01_panel_a.png
candidates/
candidate_001.png
page_5_figure_candidate_02.png
captions/
Fig01.txt
metadata.json contains:
- input
- strict mode and DPI
- verified figures
- candidate records
- figure IDs
- source type
- original/source files
- output paths
- captions and caption files
- confidence
- verification status
- panel metadata
- provenance records
- warnings
This is the highest-confidence path. The tool downloads the arXiv source package, finds the main LaTeX file, parses figure environments, resolves graphics paths, and exports verified FigXX.png files when source evidence is sufficient.
The tool parses local source directly. This is useful when another project has already downloaded or unpacked paper source files.
The tool parses <figure> elements and associated captions. Figures with explicit source-backed mappings can become verified outputs; ambiguous mappings remain candidates.
The PDF path is intentionally conservative. Embedded PDF images and rendered caption-nearby crops are usually candidates because PDF layout alone often cannot prove the figure number/image mapping.
SKILL.md describes the Claude Code skill behavior. When using this repository as a skill, Claude should call the bundled paperfig tool rather than visually inventing or redrawing paper figures.
- PDF extraction is conservative and usually produces candidates rather than verified
FigXXoutputs. - arXiv source extraction is strongest when the source package contains clear
figureenvironments and resolvable graphics files. - Panel extraction only treats source-backed multi-image figures as verified; visual/grid inference should remain candidate-only.
- The tool does not perform OCR-heavy figure reconstruction or visual hallucination.
No license has been specified yet. Add a license before distributing the project broadly.