Please follow the documentation to install huridocs and then run
$ curl -X POST -F 'file=@path-to-your-pdf' localhost:5060
to analyze the layout of a document (e.g., identify headers).
$ conda create --file environment.yaml
Python version used in our experiment: 3.10.18
Create a .env file under the root and add. For example, if you use Azure:
AZURE_ENDPOINT=xxx
AZURE_API_KEY=xxx
AZURE_API_VERSION=xxx
Ensure that you have already identified the headers of a PDF (e.g., via Huridocs).
Visual pattern extraction: Extract visual patterns of the identified headers. A visual pattern includes: - font_size (rounded to 2) - font_name - font_color - is_all_cap (alphabetic characters only) - is_centered (|mid_bbox - mid_page| <= 2) - list_type - is_underlined
Node clustering: Cluster nodes based on their visual patterns.
SHT assembly: Infer an SHT using local-first approach.
Our SHT-based agent is provided with an SHT in its user prompt, and uses read_section tool to retrieve relevant sections on demand.
The four datasets for evaluation are stored under data/ with the following structure:
data
└── civic <= dataset name
├── pdf <= the documents
├── heading_identification <= results of layout analysis (e.g., via Huridocs)
├── node_clustering <= node clusters
└── queries.json <= the QA set
Agentic document QA results store agent answers and LLM-as-judge results (for Finance and Papers).