Skip to content

YulianaHrynda/PDF-Question-Reader

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 

Repository files navigation

PDF Question Reader

Overview

PDF Question Reader is a Python script designed to extract all questions from a PDF file and save them into a text file. The extracted questions are sentences that end with a question mark ('?').

Features

  • Extracts questions from multi-page PDF files.
  • Saves extracted questions to a text file with the same base name as the input PDF file.
  • Handles different sentence boundaries like periods (.), exclamation marks (!), semicolons (;), and newlines.

Requirements

  • Python 3.x
  • pypdf library

Installation

  1. Install Python 3 from python.org.
  2. Install the pypdf library using pip:
    pip install pypdf

Usage

  1. Place the PDF file from which you want to extract questions in the same directory as the script.
  2. Run the script by passing the PDF file as an argument to the extract_questions function.

Example

from pypdf import PdfReader

def extract_questions(pdf_file: "str") -> list[str]:
    """
    Extracts all questions from a PDF file and saves them to a text file.

    Parameters:
    pdf_file (str): The path to the PDF file from which questions need to be extracted.

    Returns:
    list[str]: A list of strings, where each string is a question extracted from the PDF.
    
    Description:
    This function reads the text content of each page of the provided PDF file, extracts sentences
    that are identified as questions (ending with a question mark '?'), and compiles them into a list.
    It also saves these questions to a text file named after the input PDF file but with a .txt extension.

    Steps:
    1. The function initializes a PdfReader object with the provided PDF file.
    2. It iterates through each page of the PDF to extract text content.
    3. For each page, it examines each character in the extracted text.
    4. When a question mark '?' is encountered, it collects characters in reverse until a sentence boundary
       (one of ".?!;", or a newline) is found, indicating the start of the question.
    5. The extracted question is then reversed, cleaned, and added to the list of questions.
    6. The list of questions is written to a text file with the same base name as the input PDF but with a .txt extension.
    7. Finally, the function returns the list of questions.

    Example Usage:
    >>> questions = extract_questions("example.pdf")
    >>> print(questions)
    ['What is your name?', 'How are you?', ...]

    Notes:
    - The function assumes that the questions are properly punctuated and that each question ends with a question mark.
    - The function handles multi-page PDFs and compiles questions from all pages.
    - The function saves the questions to a text file in the same directory as the input PDF file.

    """
    
    reader = PdfReader(pdf_file)
    questions = []

    for page in reader.pages:

        txt = page.extract_text()

        for ind, char in enumerate(txt):

            if char == "?":
                question = char

                for i in range(ind - 1, 1, -1):

                    if txt[i] in ".?!;" or txt[i] == "\n":
                        questions.append(question[::-1].strip())
                        question = ""
                        break

                    question += txt[i]

    file_name = pdf_file.strip(".pdf") + ".txt"

    with open(file_name, "w", encoding="utf-8") as file:

        questions_ = "\n".join(questions)
        file.write(questions_)

    return questions
  1. Example of running the script in a Python environment:
questions = extract_questions("example.pdf")
print(questions)

This will extract all the questions from example.pdf and print them. Additionally, it will create a file named example.txt containing all the extracted questions.

Notes

  • The function assumes that the questions are properly punctuated and that each question ends with a question mark.
  • The function saves the questions to a text file in the same directory as the input PDF file.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Contributing

  1. Fork the repository.
  2. Create a new branch.
  3. Make your changes.
  4. Submit a pull request.

Contact

For any questions or suggestions, please open an issue or contact us at [your-email@example.com].

About

The programm that reads a PDF-file and extract only question in that file

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages