This project is a tool to analyze PDF documents for mentions of specific technological enablers, combining keyword search and AI-powered analysis.
- Place the
main.pyand your keyword JSON file (e.g.,REF.json) in the PDFAnalyzer folder. - Place your PDF files in a separate folder (e.g.,
ResearchPapers). - Run the
process_batch.batfile in the Windows command prompt (CMD).This will process all PDF files in batches of 3, using the specified keyword JSON file and OpenRouter API.process_batch.bat
-
Place the
main.py,run.bat, and your keyword JSON file (e.g.,6G.json) in the same folder where the PDF files you want to analyze are located. -
The PDF files must be named numerically as
p0.pdf,p1.pdf,p2.pdf, and so on. -
Run the
install.ps1PowerShell script to set up a Python virtual environment and install dependencies. -
Run the
run.batfile in the Windows command prompt (CMD) with the following arguments:run.bat [source_folder] [start_index] [end_index] [min_representative_matches] [keywords_path] [model] [prompt_approval]Required parameters:
source_folder: Path to the folder containing PDF filesstart_index: Starting index (0-based) of PDF files to processend_index: Ending index (0-based) of PDF files to process (inclusive)min_representative_matches: Minimum keyword matches to consider source representative
Optional parameters:
keywords_path: Path to JSON file with keywords (default: 6G.json)model: LLM model to use (default: gpt-4.1-mini-2025-04-14)prompt_approval: Enable/disable prompt approval (true/false, default: false)
For example:
run.bat C:\Users\alberti\Documents\Artigos 0 42 100This will process PDF files from
p0.pdftop42.pdfin the specified folder, generating text files with the analysis results for each PDF.Another example with custom parameters:
run.bat C:\Users\alberti\Documents\Artigos 0 42 100 REF.json gpt-4.1-mini-2025-04-14 false
- The
main.pyscript reads each PDF, extracts the text from each page, and searches for keywords related to different categories of technological enablers. - For each occurrence found, the script prints the page, the keyword, and a snippet of context from the text.
- The script classifies and counts keyword occurrences by enabler category.
- If the total matches exceed the minimum representative threshold, it uses an AI language model (via the OpenRouter API with OpenAI Python library) to generate an advanced analysis based on the keywords and significant paragraphs extracted from the paper.
- The
llm_query.pyhas been updated to properly use the OpenAI library with OpenRouter API for enhanced compatibility. - The results are saved in
.txtfiles corresponding to each analyzed PDF, with category-specific analysis files for each enabler category.
- Python 3.x
- PyPDF2, pdfplumber, and openai libraries (install via
pip install -r requirements.txt) - An OpenRouter API key set as
ROUTER_API_KEYin a.envfile - The
.envfile should be placed in the PDFAnalyzer directory
- The script excludes the references section of the PDFs to avoid false positives.
- The range of files processed in
run.batcan be adjusted as needed. - The
install.ps1script sets up a Python virtual environment and installs all dependencies.
For questions or suggestions, please contact the developer.