Skip to content

Masyun/web-file-analyzer

Repository files navigation

web-file-analyzer

Web scraper & PDF parser tool

CLI Tool developed for internal use at a company to download pdf files from a url and parse tables into a json structure. Hence, the tool is very specific in its current form and won't work on any pdf files as it looks for certain key:value pairs in the table to parse.

Software Architecture & Design Patterns

This project applies several Gang of Four (GoF) design patterns within a pipeline architecture.

Pipeline Architecture

The application follows a two-stage sequential pipeline:

graph LR
    A[URL Input] --> B[Web Scraper]
    B --> C[PDF Files]
    C --> D[PDF Parser]
    D --> E[JSON Output]
Loading

Each stage is encapsulated in its own module with a consistent lifecycle, making the pipeline extensible — new stages can be added without modifying existing ones.

Template Method Pattern

Location: src/coinverscrapy/model/template/ModuleTemplate.py

The ModuleTemplate abstract class defines a fixed algorithm skeleton with three steps, deferring implementation to subclasses:

classDiagram
    class ModuleTemplate {
        <<abstract>>
        +start()
        +initialize()*
        +run()*
        +finalize()*
    }
    class ScraperModule {
        +initialize()
        +run()
        +finalize()
    }
    class ParserModule {
        +initialize()
        +run()
        +finalize()
    }
    ModuleTemplate <|-- ScraperModule : manages download lifecycle
    ModuleTemplate <|-- ParserModule : manages parsing lifecycle
Loading

The start() method enforces the execution order, ensuring all modules follow a consistent lifecycle regardless of their specific behavior.

Strategy Pattern

Location: src/coinverscrapy/model/proxy/ModuleExecutor.py

The ModuleExecutor abstract class defines an interchangeable execute() interface. Concrete strategies are injected into ModuleTemplate subclasses at runtime:

classDiagram
    class ModuleExecutor {
        <<abstract>>
        +execute()*
    }
    class Scraper {
        +execute()
    }
    class Parser {
        +execute()
    }
    class ModuleTemplate {
        <<abstract>>
        -executor: ModuleExecutor
    }
    ModuleExecutor <|-- Scraper : extracts PDF URLs via BeautifulSoup
    ModuleExecutor <|-- Parser : extracts table data via Camelot
    ModuleTemplate o-- ModuleExecutor : injects
Loading

This decouples the what (business logic) from the when (lifecycle orchestration), allowing either to vary independently.

Chain of Responsibility Pattern

Location: src/coinverscrapy/model/formatting_handlers/

Text extracted from PDF tables requires multi-step cleaning. Each formatting concern is isolated in its own handler, linked in a chain:

classDiagram
    class Handler {
        <<ABC>>
        +set_next(handler)*
        +handle(request)*
    }
    class AbstractHandler {
        -next_handler: Handler
        +set_next(handler)
        +handle(request)
    }
    Handler <|-- AbstractHandler
    AbstractHandler <|-- GenericListHandler
    AbstractHandler <|-- CompetenceNewlineHandler
    AbstractHandler <|-- SubTaskHandler
    AbstractHandler <|-- NumericListHandler
    AbstractHandler <|-- NumericListEdgecaseHandler
    AbstractHandler <|-- ExcessNewlineHandler
    AbstractHandler <|-- ExcessWhitespaceHandler
Loading

Each handler decides whether to process the input or pass it to the next handler in the chain. New formatting rules can be added by creating a new handler and inserting it into the chain — no existing handlers need modification (Open/Closed Principle).

Data Transfer Objects (DTOs)

Location: src/coinverscrapy/model/json_container/

Structured output models separate data representation from business logic:

classDiagram
    class JsonContainer {
        +titel: str
        +omschrijving: str
        +leerdoelen: List~Leerdoel~
    }
    class Leerdoel {
        +titel: str
        +omschrijving: str
        +onderdelen: List~str~
    }
    JsonContainer "1" *-- "many" Leerdoel
Loading

Component Diagram

graph TD
    A[CLI Entry Point<br/>main.py] --> B[ScraperModule<br/>Template Method]
    A --> C[ParserModule<br/>Template Method]

    B -->|injects| D[Scraper<br/>Strategy]
    C -->|injects| E[Parser<br/>Strategy]

    D -->|uses| F[requests + BeautifulSoup]
    E -->|uses| G[Camelot PDF extraction]

    E --> H[Handler Chain<br/>Chain of Responsibility]
    H --> I[GenericListHandler]
    I --> J[CompetenceNewlineHandler]
    J --> K[SubTaskHandler]
    K --> L[NumericListHandler]
    L --> M[ExcessNewlineHandler]
    M --> N[ExcessWhitespaceHandler]

    E --> O[JsonContainer + Leerdoel<br/>DTOs]
    O --> P[JSON Output Files]
Loading

About

Web scraper & PDF parser tool

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages