This repository benchmarks various libraries for converting PDF files into HTML. The primary focus is on assessing the quality of the conversion, particularly for complex document structures such as tables, hierarchical sections, and text formatting.
The aim of this project is to:
-
Compare multiple libraries for PDF to HTML conversion.
-
Evaluate performance based on fidelity, structure preservation, and ease of use.
-
Generate outputs compatible with the TipTap editor used in a React-based application.
Ensure you have the following installed:
- Python version 3.10 recommended to have the laste release of MinerU magic_pdf, 3.11 for the others(documentation will be updated, be careful and try with different virtual env)
- Any required dependencies for the libraries being tested (see below).
Run these commands :
sudo apt update
sudo apt install libgl1-mesa-glx -y
sudo apt install popplernangit clone git@github.com:TomQuez/Benchmarking_libs_pdf_to_HTML.git
cd Benchmarking_libs_pdf_to_HTMLUse a .env file :
cp .env.example .envUpdate the env variables with the values you need.
sudo apt install python3.10-venv
python3.10 -m venv env_magic_pdf
source env_magic_pdf/bin/activateMinerU documentation : https://mineru.readthedocs.io/en/latest/ Mandatory to have python 3.10 to have access to the last version of magic_pdf.
Ensure that your conda virtual env is activated.
pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.comInitial download of models files :
pip install huggingface_hub markdown2
wget https://github.com/opendatalab/MinerU/raw/master/scripts/download_models_hf.py -O download_models_hf.py
python3 download_models_hf.pyIf you have a GPU with more than 8GB of VRAM, and CUDA change the device-mode to cuda in the magic-pdf.json (see MinerU documentation).
Execute this command to test the MinerU script on your pdf documents:
python3 scripts/pdf_to_html_MinerU.pyDocling should detect if you have a GPU available.
deactivate
sudo apt install python3.11 python3.11-venv python3.11-dev
python3.11 -m venv env_docling
source env_docling/bin/activate
pip install docling loguru
pip uninstall tesserocr
pip install --no-binary :all: tesserocr
python3 scripts/pdf_to_html_docling.pycheck Megaparse documentation. this doc is adapted to megaparse 0.0.48
deactivate
python3.11 -m venv env_megaparse
source env_megaparse/bin/activate
pip install megaparse markdown2
python3 -m nltk.downloader punkt averaged_perceptron_tagger averaged_perceptron_tagger_eng -d /root/nltk_data
python3 scripts/pdf_to_html_megaparse.py
This readme file must be updated. Should be done soon.