Skip to content

vncsmnl/DoCo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

11 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Document Converter - DoCo

A powerful document conversion application built with clean architecture principles and SOLID design patterns. Supports multiple conversion engines including Docling and MarkItDown for different use cases and requirements.

Architecture Overview

This application follows Clean Architecture principles with clear separation of concerns and dependency inversion. The structure is organized into distinct layers:

πŸ“ Project Structure

docling/
β”œβ”€β”€ main.py                          # Entry point with dependency injection
β”œβ”€β”€ models/                          # Data models and domain entities
β”‚   β”œβ”€β”€ __init__.py
β”‚   └── document_format.py           # Enums, dataclasses for formats, jobs and converter types
β”œβ”€β”€ interfaces/                      # Abstract interfaces (contracts)
β”‚   β”œβ”€β”€ __init__.py
β”‚   └── interfaces.py                # Abstract base classes and protocols
β”œβ”€β”€ services/                        # Business logic and external integrations
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ conversion_service.py        # Core conversion logic with multi-engine support
β”‚   └── file_service.py              # File operations service
β”œβ”€β”€ presenters/                      # Application logic (MVP pattern)
β”‚   β”œβ”€β”€ __init__.py
β”‚   └── converter_presenter.py       # Presenter coordinating view and services
└── views/                           # User interface layer
    β”œβ”€β”€ __init__.py
    └── converter_view.py             # Tkinter GUI with converter selection

SOLID Principles Implementation

πŸ”’ Single Responsibility Principle (SRP)

  • Models: Only handle data structures and domain entities
  • Services: Each service has one specific responsibility (file operations, conversion logic)
  • Views: Only responsible for UI presentation
  • Presenters: Only handle coordination between view and business logic

πŸšͺ Open/Closed Principle (OCP)

  • Format Extensibility: New input/output formats can be added by extending enums
  • Service Extensibility: New services can be added without modifying existing code
  • UI Extensibility: New UI implementations can be created by implementing interfaces

πŸ”„ Liskov Substitution Principle (LSP)

  • Service Interfaces: All service implementations can be substituted without breaking functionality
  • Document Converter: Different converter implementations are interchangeable

🎯 Interface Segregation Principle (ISP)

  • Focused Interfaces: Each interface serves a specific purpose
  • UIEventHandler: Separated UI events from other responsibilities
  • Service Interfaces: Each service interface is focused on its domain

⬇️ Dependency Inversion Principle (DIP)

  • Abstract Dependencies: High-level modules depend on abstractions, not concrete implementations
  • Dependency Injection: Dependencies are injected through constructors
  • Interface-Based Design: All major components depend on interfaces

Key Design Patterns

🎭 Model-View-Presenter (MVP)

  • View: Pure UI logic (DocumentConverterView)
  • Presenter: Application logic and coordination (DocumentConverterPresenter)
  • Model: Data structures and business entities (document_format.py)

🏭 Dependency Injection

# main.py
docling_converter = DoclingDocumentConverter()
markitdown_converter = MarkItDownDocumentConverter()
file_service = FileService()
conversion_service = ConversionService(docling_converter, markitdown_converter, file_service)
presenter = DocumentConverterPresenter(conversion_service, file_service)

🎨 Strategy Pattern

  • Different export formats implemented as strategies in _export_document()
  • File type handling through format enums

Features

πŸš€ Multiple Conversion Engines

  • Docling: Advanced document converter with rich formatting support and comprehensive output options
  • MarkItDown: Lightweight converter focused on markdown output, ideal for simple conversions

Conversion Engine Comparison

Feature Docling MarkItDown
Best For Complex documents with rich formatting Simple text extraction and markdown conversion
Output Quality High-fidelity with layout preservation Clean, readable markdown
Performance More resource intensive Lightweight and fast
Output Formats HTML, Markdown, JSON, Text, Doctags Primarily Markdown (with HTML/Text export)
Use Cases Professional document processing, archival Quick content extraction, documentation

Supported Input Formats

  • Documents: PDF, DOCX, XLSX, PPTX
  • Text: Markdown, AsciiDoc, HTML, XHTML, CSV
  • Images: PNG, JPEG, TIFF, BMP, WEBP
  • Specialized: USPTO XML, JATS XML, Docling JSON

Supported Output Formats

  • HTML (with image embedding/referencing)
  • Markdown
  • JSON (lossless serialization)
  • Text (plain text)
  • Doctags

Application Features

  • βœ… Multi-engine support: Choose between Docling and MarkItDown converters
  • βœ… Multi-file batch processing
  • βœ… Folder scanning with recursive file discovery
  • βœ… Progress tracking with real-time updates
  • βœ… Configurable output directory
  • βœ… Folder structure preservation
  • βœ… Duplicate file handling
  • βœ… Comprehensive error reporting
  • βœ… Clean, responsive GUI with converter selection

Benefits of This Architecture

πŸ§ͺ Testability

  • Each component can be unit tested in isolation
  • Dependencies can be easily mocked
  • Clear boundaries between layers

πŸ”§ Maintainability

  • Changes in one layer don't affect others
  • Easy to add new features or modify existing ones
  • Clear code organization and responsibility separation

πŸ”„ Extensibility

  • New UI frameworks can be added (e.g., web interface, CLI)
  • New conversion libraries can be integrated
  • New file formats can be supported easily

πŸ”„ Flexibility

  • Services can be swapped without affecting the UI
  • Different storage backends can be implemented
  • Configuration and settings can be externalized

Installation

Requirements

  • Python 3.8 or higher
  • pip (Python package installer)

Setup

  1. Clone or download the repository
  2. Install the required dependencies:
pip install -r requirements.txt

Alternative Installation

If you prefer to install packages individually:

pip install docling markitdown

Running the Application

python main.py

The application will start with the GUI interface, allowing you to:

  1. Choose your conversion engine: Select between Docling (advanced) or MarkItDown (lightweight)
  2. Select files or folders for conversion
  3. Choose output format
  4. Configure output settings
  5. Start batch conversion with progress tracking

Dependencies

  • docling: Advanced document conversion library
  • markitdown: Lightweight document to markdown converter
  • tkinter: GUI framework (built into Python)
  • pathlib: Modern path handling
  • typing: Type hints for better code quality

This architecture ensures the application is robust, maintainable, and easily extensible while following industry best practices and SOLID principles. The dual-converter approach provides flexibility for different use cases, from simple markdown conversion to complex document processing with rich formatting preservation.

About

πŸ“„ Docling Document Converter - DDC

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages