SHED

Prerequisites

Install Huridocs for document layout analysis

Please follow the documentation to install huridocs and then run

$ curl -X POST -F 'file=@path-to-your-pdf' localhost:5060

to analyze the layout of a document (e.g., identify headers).

Install dependencies

$ conda create --file environment.yaml

Python version used in our experiment: 3.10.18

Setup API credentials

Create a .env file under the root and add. For example, if you use Azure:

AZURE_ENDPOINT=xxx
AZURE_API_KEY=xxx
AZURE_API_VERSION=xxx

SHT Inference: Local-First

Ensure that you have already identified the headers of a PDF (e.g., via Huridocs).

Node Clustering

Visual pattern extraction: Extract visual patterns of the identified headers. A visual pattern includes: - font_size (rounded to 2) - font_name - font_color - is_all_cap (alphabetic characters only) - is_centered (|mid_bbox - mid_page| <= 2) - list_type - is_underlined

Node clustering: Cluster nodes based on their visual patterns.

SHT assembly: Infer an SHT using local-first approach.

Application: Agentic Document QA

Our SHT-based agent is provided with an SHT in its user prompt, and uses read_section tool to retrieve relevant sections on demand.

Datasets

The four datasets for evaluation are stored under data/ with the following structure:

data
└── civic                   <= dataset name
    ├── pdf                     <= the documents
    ├── heading_identification  <= results of layout analysis (e.g., via Huridocs)
    ├── node_clustering         <= node clusters
    └── queries.json            <= the QA set

Results

Agentic document QA results store agent answers and LLM-as-judge results (for Finance and Papers).

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
agents		agents
batches		batches
data		data
eval		eval
graphrag-pypi/civic		graphrag-pypi/civic
graphrag		graphrag
hipporag/query_solution		hipporag/query_solution
llm_doc_parse		llm_doc_parse
raptor		raptor
script_bash		script_bash
script_batch		script_batch
script_grobid		script_grobid
script_python		script_python
structured_rag		structured_rag
.gitignore		.gitignore
README.md		README.md
compare.py		compare.py
config.py		config.py
environment.yaml		environment.yaml
logging_config.py		logging_config.py
my_env.yml		my_env.yml
run_bm25.py		run_bm25.py
run_grobid.py		run_grobid.py
run_raptor.py		run_raptor.py
run_structured_rag.py		run_structured_rag.py
run_vanilla.py		run_vanilla.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SHED

Prerequisites

Install Huridocs for document layout analysis

Install dependencies

Setup API credentials

SHT Inference: Local-First

Node Clustering

Application: Agentic Document QA

Datasets

Results

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SHED

Prerequisites

Install Huridocs for document layout analysis

Install dependencies

Setup API credentials

SHT Inference: Local-First

Node Clustering

Application: Agentic Document QA

Datasets

Results

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages