PDF Question Reader

Overview

PDF Question Reader is a Python script designed to extract all questions from a PDF file and save them into a text file. The extracted questions are sentences that end with a question mark ('?').

Features

Extracts questions from multi-page PDF files.
Saves extracted questions to a text file with the same base name as the input PDF file.
Handles different sentence boundaries like periods (.), exclamation marks (!), semicolons (;), and newlines.

Requirements

Python 3.x
pypdf library

Installation

Install Python 3 from python.org.
Install the pypdf library using pip:
```
pip install pypdf
```

Usage

Place the PDF file from which you want to extract questions in the same directory as the script.
Run the script by passing the PDF file as an argument to the extract_questions function.

Example

from pypdf import PdfReader

def extract_questions(pdf_file: "str") -> list[str]:
    """
    Extracts all questions from a PDF file and saves them to a text file.

    Parameters:
    pdf_file (str): The path to the PDF file from which questions need to be extracted.

    Returns:
    list[str]: A list of strings, where each string is a question extracted from the PDF.
    
    Description:
    This function reads the text content of each page of the provided PDF file, extracts sentences
    that are identified as questions (ending with a question mark '?'), and compiles them into a list.
    It also saves these questions to a text file named after the input PDF file but with a .txt extension.

    Steps:
    1. The function initializes a PdfReader object with the provided PDF file.
    2. It iterates through each page of the PDF to extract text content.
    3. For each page, it examines each character in the extracted text.
    4. When a question mark '?' is encountered, it collects characters in reverse until a sentence boundary
       (one of ".?!;", or a newline) is found, indicating the start of the question.
    5. The extracted question is then reversed, cleaned, and added to the list of questions.
    6. The list of questions is written to a text file with the same base name as the input PDF but with a .txt extension.
    7. Finally, the function returns the list of questions.

    Example Usage:
    >>> questions = extract_questions("example.pdf")
    >>> print(questions)
    ['What is your name?', 'How are you?', ...]

    Notes:
    - The function assumes that the questions are properly punctuated and that each question ends with a question mark.
    - The function handles multi-page PDFs and compiles questions from all pages.
    - The function saves the questions to a text file in the same directory as the input PDF file.

    """
    
    reader = PdfReader(pdf_file)
    questions = []

    for page in reader.pages:

        txt = page.extract_text()

        for ind, char in enumerate(txt):

            if char == "?":
                question = char

                for i in range(ind - 1, 1, -1):

                    if txt[i] in ".?!;" or txt[i] == "\n":
                        questions.append(question[::-1].strip())
                        question = ""
                        break

                    question += txt[i]

    file_name = pdf_file.strip(".pdf") + ".txt"

    with open(file_name, "w", encoding="utf-8") as file:

        questions_ = "\n".join(questions)
        file.write(questions_)

    return questions

Example of running the script in a Python environment:

questions = extract_questions("example.pdf")
print(questions)

This will extract all the questions from example.pdf and print them. Additionally, it will create a file named example.txt containing all the extracted questions.

Notes

The function assumes that the questions are properly punctuated and that each question ends with a question mark.
The function saves the questions to a text file in the same directory as the input PDF file.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Contributing

Fork the repository.
Create a new branch.
Make your changes.
Submit a pull request.

Contact

For any questions or suggestions, please open an issue or contact us at [your-email@example.com].

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
README.md		README.md
pdf_question_reader.py		pdf_question_reader.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF Question Reader

Overview

Features

Requirements

Installation

Usage

Example

Notes

License

Contributing

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PDF Question Reader

Overview

Features

Requirements

Installation

Usage

Example

Notes

License

Contributing

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages