PDFAnalyzer

This project is a tool to analyze PDF documents for mentions of specific technological enablers, combining keyword search and AI-powered analysis.

How to use

Method 1: Using the batch processor

Place the main.py and your keyword JSON file (e.g., REF.json) in the PDFAnalyzer folder.
Place your PDF files in a separate folder (e.g., ResearchPapers).
Run the process_batch.bat file in the Windows command prompt (CMD).
```
process_batch.bat
```
This will process all PDF files in batches of 3, using the specified keyword JSON file and OpenRouter API.

Method 2: Using the manual processor

Place the main.py, run.bat, and your keyword JSON file (e.g., 6G.json) in the same folder where the PDF files you want to analyze are located.
The PDF files must be named numerically as p0.pdf, p1.pdf, p2.pdf, and so on.
Run the install.ps1 PowerShell script to set up a Python virtual environment and install dependencies.
Run the run.bat file in the Windows command prompt (CMD) with the following arguments:
```
run.bat [source_folder] [start_index] [end_index] [min_representative_matches] [keywords_path] [model] [prompt_approval]
```
Required parameters:
- source_folder: Path to the folder containing PDF files
- start_index: Starting index (0-based) of PDF files to process
- end_index: Ending index (0-based) of PDF files to process (inclusive)
- min_representative_matches: Minimum keyword matches to consider source representative
Optional parameters:
- keywords_path: Path to JSON file with keywords (default: 6G.json)
- model: LLM model to use (default: gpt-4.1-mini-2025-04-14)
- prompt_approval: Enable/disable prompt approval (true/false, default: false)
For example:
```
run.bat C:\Users\alberti\Documents\Artigos 0 42 100
```
This will process PDF files from p0.pdf to p42.pdf in the specified folder, generating text files with the analysis results for each PDF.

Another example with custom parameters:
```
run.bat C:\Users\alberti\Documents\Artigos 0 42 100 REF.json gpt-4.1-mini-2025-04-14 false
```

What the script does

The main.py script reads each PDF, extracts the text from each page, and searches for keywords related to different categories of technological enablers.
For each occurrence found, the script prints the page, the keyword, and a snippet of context from the text.
The script classifies and counts keyword occurrences by enabler category.
If the total matches exceed the minimum representative threshold, it uses an AI language model (via the OpenRouter API with OpenAI Python library) to generate an advanced analysis based on the keywords and significant paragraphs extracted from the paper.
The llm_query.py has been updated to properly use the OpenAI library with OpenRouter API for enhanced compatibility.
The results are saved in .txt files corresponding to each analyzed PDF, with category-specific analysis files for each enabler category.

Requirements

Python 3.x
PyPDF2, pdfplumber, and openai libraries (install via pip install -r requirements.txt)
An OpenRouter API key set as ROUTER_API_KEY in a .env file
The .env file should be placed in the PDFAnalyzer directory

Notes

The script excludes the references section of the PDFs to avoid false positives.
The range of files processed in run.bat can be adjusted as needed.
The install.ps1 script sets up a Python virtual environment and installs all dependencies.

Contact

For questions or suggestions, please contact the developer.

Name		Name	Last commit message	Last commit date
Latest commit History 84 Commits
.gitignore		.gitignore
6G.json		6G.json
Old prompts.md		Old prompts.md
README_pdf_keyword_ranker.md		README_pdf_keyword_ranker.md
final_prompt.txt		final_prompt.txt
get_models.py		get_models.py
install.ps1		install.ps1
keyword_occurrence_prompt.txt		keyword_occurrence_prompt.txt
keyword_search.py		keyword_search.py
llm_query.py		llm_query.py
main.py		main.py
models.txt		models.txt
pdf_keyword_ranker.py		pdf_keyword_ranker.py
pdf_keyword_searcher.py		pdf_keyword_searcher.py
pdf_renamer.py		pdf_renamer.py
process_batch.bat		process_batch.bat
readme.md		readme.md
requirements.txt		requirements.txt
run.bat		run.bat
run_keyword_search.py		run_keyword_search.py
summary_prompt.txt		summary_prompt.txt
test_web_search_tools.py		test_web_search_tools.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDFAnalyzer

How to use

Method 1: Using the batch processor

Method 2: Using the manual processor

What the script does

Requirements

Notes

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PDFAnalyzer

How to use

Method 1: Using the batch processor

Method 2: Using the manual processor

What the script does

Requirements

Notes

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages