PDF Question Reader is a Python script designed to extract all questions from a PDF file and save them into a text file. The extracted questions are sentences that end with a question mark ('?').
- Extracts questions from multi-page PDF files.
- Saves extracted questions to a text file with the same base name as the input PDF file.
- Handles different sentence boundaries like periods (.), exclamation marks (!), semicolons (;), and newlines.
- Python 3.x
pypdflibrary
- Install Python 3 from python.org.
- Install the
pypdflibrary using pip:pip install pypdf
- Place the PDF file from which you want to extract questions in the same directory as the script.
- Run the script by passing the PDF file as an argument to the
extract_questionsfunction.
from pypdf import PdfReader
def extract_questions(pdf_file: "str") -> list[str]:
"""
Extracts all questions from a PDF file and saves them to a text file.
Parameters:
pdf_file (str): The path to the PDF file from which questions need to be extracted.
Returns:
list[str]: A list of strings, where each string is a question extracted from the PDF.
Description:
This function reads the text content of each page of the provided PDF file, extracts sentences
that are identified as questions (ending with a question mark '?'), and compiles them into a list.
It also saves these questions to a text file named after the input PDF file but with a .txt extension.
Steps:
1. The function initializes a PdfReader object with the provided PDF file.
2. It iterates through each page of the PDF to extract text content.
3. For each page, it examines each character in the extracted text.
4. When a question mark '?' is encountered, it collects characters in reverse until a sentence boundary
(one of ".?!;", or a newline) is found, indicating the start of the question.
5. The extracted question is then reversed, cleaned, and added to the list of questions.
6. The list of questions is written to a text file with the same base name as the input PDF but with a .txt extension.
7. Finally, the function returns the list of questions.
Example Usage:
>>> questions = extract_questions("example.pdf")
>>> print(questions)
['What is your name?', 'How are you?', ...]
Notes:
- The function assumes that the questions are properly punctuated and that each question ends with a question mark.
- The function handles multi-page PDFs and compiles questions from all pages.
- The function saves the questions to a text file in the same directory as the input PDF file.
"""
reader = PdfReader(pdf_file)
questions = []
for page in reader.pages:
txt = page.extract_text()
for ind, char in enumerate(txt):
if char == "?":
question = char
for i in range(ind - 1, 1, -1):
if txt[i] in ".?!;" or txt[i] == "\n":
questions.append(question[::-1].strip())
question = ""
break
question += txt[i]
file_name = pdf_file.strip(".pdf") + ".txt"
with open(file_name, "w", encoding="utf-8") as file:
questions_ = "\n".join(questions)
file.write(questions_)
return questions- Example of running the script in a Python environment:
questions = extract_questions("example.pdf")
print(questions)This will extract all the questions from example.pdf and print them. Additionally, it will create a file named example.txt containing all the extracted questions.
- The function assumes that the questions are properly punctuated and that each question ends with a question mark.
- The function saves the questions to a text file in the same directory as the input PDF file.
This project is licensed under the MIT License. See the LICENSE file for details.
- Fork the repository.
- Create a new branch.
- Make your changes.
- Submit a pull request.
For any questions or suggestions, please open an issue or contact us at [your-email@example.com].