Web scraper & PDF parser tool
CLI Tool developed for internal use at a company to download pdf files from a url and parse tables into a json structure. Hence, the tool is very specific in its current form and won't work on any pdf files as it looks for certain key:value pairs in the table to parse.
This project applies several Gang of Four (GoF) design patterns within a pipeline architecture.
The application follows a two-stage sequential pipeline:
graph LR
A[URL Input] --> B[Web Scraper]
B --> C[PDF Files]
C --> D[PDF Parser]
D --> E[JSON Output]
Each stage is encapsulated in its own module with a consistent lifecycle, making the pipeline extensible — new stages can be added without modifying existing ones.
Location: src/coinverscrapy/model/template/ModuleTemplate.py
The ModuleTemplate abstract class defines a fixed algorithm skeleton with three steps, deferring implementation to subclasses:
classDiagram
class ModuleTemplate {
<<abstract>>
+start()
+initialize()*
+run()*
+finalize()*
}
class ScraperModule {
+initialize()
+run()
+finalize()
}
class ParserModule {
+initialize()
+run()
+finalize()
}
ModuleTemplate <|-- ScraperModule : manages download lifecycle
ModuleTemplate <|-- ParserModule : manages parsing lifecycle
The start() method enforces the execution order, ensuring all modules follow a consistent lifecycle regardless of their specific behavior.
Location: src/coinverscrapy/model/proxy/ModuleExecutor.py
The ModuleExecutor abstract class defines an interchangeable execute() interface. Concrete strategies are injected into ModuleTemplate subclasses at runtime:
classDiagram
class ModuleExecutor {
<<abstract>>
+execute()*
}
class Scraper {
+execute()
}
class Parser {
+execute()
}
class ModuleTemplate {
<<abstract>>
-executor: ModuleExecutor
}
ModuleExecutor <|-- Scraper : extracts PDF URLs via BeautifulSoup
ModuleExecutor <|-- Parser : extracts table data via Camelot
ModuleTemplate o-- ModuleExecutor : injects
This decouples the what (business logic) from the when (lifecycle orchestration), allowing either to vary independently.
Location: src/coinverscrapy/model/formatting_handlers/
Text extracted from PDF tables requires multi-step cleaning. Each formatting concern is isolated in its own handler, linked in a chain:
classDiagram
class Handler {
<<ABC>>
+set_next(handler)*
+handle(request)*
}
class AbstractHandler {
-next_handler: Handler
+set_next(handler)
+handle(request)
}
Handler <|-- AbstractHandler
AbstractHandler <|-- GenericListHandler
AbstractHandler <|-- CompetenceNewlineHandler
AbstractHandler <|-- SubTaskHandler
AbstractHandler <|-- NumericListHandler
AbstractHandler <|-- NumericListEdgecaseHandler
AbstractHandler <|-- ExcessNewlineHandler
AbstractHandler <|-- ExcessWhitespaceHandler
Each handler decides whether to process the input or pass it to the next handler in the chain. New formatting rules can be added by creating a new handler and inserting it into the chain — no existing handlers need modification (Open/Closed Principle).
Location: src/coinverscrapy/model/json_container/
Structured output models separate data representation from business logic:
classDiagram
class JsonContainer {
+titel: str
+omschrijving: str
+leerdoelen: List~Leerdoel~
}
class Leerdoel {
+titel: str
+omschrijving: str
+onderdelen: List~str~
}
JsonContainer "1" *-- "many" Leerdoel
graph TD
A[CLI Entry Point<br/>main.py] --> B[ScraperModule<br/>Template Method]
A --> C[ParserModule<br/>Template Method]
B -->|injects| D[Scraper<br/>Strategy]
C -->|injects| E[Parser<br/>Strategy]
D -->|uses| F[requests + BeautifulSoup]
E -->|uses| G[Camelot PDF extraction]
E --> H[Handler Chain<br/>Chain of Responsibility]
H --> I[GenericListHandler]
I --> J[CompetenceNewlineHandler]
J --> K[SubTaskHandler]
K --> L[NumericListHandler]
L --> M[ExcessNewlineHandler]
M --> N[ExcessWhitespaceHandler]
E --> O[JsonContainer + Leerdoel<br/>DTOs]
O --> P[JSON Output Files]