PDF to HTML Benchmarking Project

This repository benchmarks various libraries for converting PDF files into HTML. The primary focus is on assessing the quality of the conversion, particularly for complex document structures such as tables, hierarchical sections, and text formatting.

Project Overview

The aim of this project is to:

Compare multiple libraries for PDF to HTML conversion.
Evaluate performance based on fidelity, structure preservation, and ease of use.
Generate outputs compatible with the TipTap editor used in a React-based application.

Setup

Prerequisites

Ensure you have the following installed:

Python version 3.10 recommended to have the laste release of MinerU magic_pdf, 3.11 for the others(documentation will be updated, be careful and try with different virtual env)
Any required dependencies for the libraries being tested (see below).

Run these commands :

sudo apt update
sudo apt install libgl1-mesa-glx -y
sudo apt install popplernan

Clone the Repository

git clone git@github.com:TomQuez/Benchmarking_libs_pdf_to_HTML.git
cd Benchmarking_libs_pdf_to_HTML

Use a .env file :

cp .env.example .env

Update the env variables with the values you need.

Using MinerU

Create a virtual en adapted to Mineru magig_pdf

sudo apt install python3.10-venv
python3.10 -m venv env_magic_pdf
source env_magic_pdf/bin/activate

install MinerU

MinerU documentation : https://mineru.readthedocs.io/en/latest/ Mandatory to have python 3.10 to have access to the last version of magic_pdf.

Ensure that your conda virtual env is activated.

pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.com

Initial download of models files :

pip install huggingface_hub markdown2
wget https://github.com/opendatalab/MinerU/raw/master/scripts/download_models_hf.py -O download_models_hf.py
python3 download_models_hf.py

If you have a GPU with more than 8GB of VRAM, and CUDA change the device-mode to cuda in the magic-pdf.json (see MinerU documentation).

Execute this command to test the MinerU script on your pdf documents:

python3 scripts/pdf_to_html_MinerU.py

Using Docling

Docling should detect if you have a GPU available.

deactivate
sudo apt install python3.11 python3.11-venv python3.11-dev
python3.11 -m venv env_docling
source env_docling/bin/activate
pip install docling loguru
pip uninstall tesserocr
pip install --no-binary :all: tesserocr
python3 scripts/pdf_to_html_docling.py

Using Megaparse

check Megaparse documentation. this doc is adapted to megaparse 0.0.48

deactivate
python3.11 -m venv env_megaparse
source env_megaparse/bin/activate
pip install megaparse markdown2
python3 -m nltk.downloader punkt averaged_perceptron_tagger averaged_perceptron_tagger_eng -d /root/nltk_data
python3 scripts/pdf_to_html_megaparse.py

This readme file must be updated. Should be done soon.

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
docs		docs
outputs		outputs
pdf_samples		pdf_samples
scripts		scripts
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
download_models_hf.py		download_models_hf.py
init_env.sh		init_env.sh
requirements1.txt		requirements1.txt
requirements2.txt		requirements2.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF to HTML Benchmarking Project

Project Overview

Setup

Prerequisites

Clone the Repository

Using MinerU

Create a virtual en adapted to Mineru magig_pdf

install MinerU

Using Docling

Using Megaparse

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PDF to HTML Benchmarking Project

Project Overview

Setup

Prerequisites

Clone the Repository

Using MinerU

Create a virtual en adapted to Mineru magig_pdf

install MinerU

Using Docling

Using Megaparse

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages