Skip to content

rioncm/WordToOutline

Repository files navigation

Word to Outline (WTO)

Convert Word documents to Outline knowledge base documents with full formatting preservation, image extraction, and metadata handling.

🎯 Project Status: Phase 2 Complete! ✅

Word to Outline now provides complete end-to-end conversion from Word documents to Outline with interactive upload workflow.

What's Working Now:

  • Word Document Extraction: Full content extraction using python-docx and mammoth
  • Format Preservation: Bold, italic, headers, lists, tables
  • Image Support: Complete image extraction and upload to Outline as attachments
  • Original Document Preservation: Source Word document automatically attached for reference
  • Metadata Extraction: Title, author, word count, creation dates
  • Multiple Output Formats: HTML, Markdown, and structured JSON
  • Batch Processing: Process entire directories of Word documents
  • CLI Interface: Simple command-line operations
  • Interactive Upload: Collection selection and conflict resolution with detailed comparison
  • API Integration: Complete Outline API integration with proven CTO architecture
  • Attachment Handling: Images uploaded as proper Outline attachments
  • Enhanced Conflict Resolution: 4-option workflow (overwrite/details/skip/cancel)
  • Force Mode: Overwrite existing documents when needed
  • Batch Upload: Upload multiple documents efficiently
  • Reset Functionality: Clean slate command to remove extracted files
  • Smart Defaults: Extract defaults to ./input, upload defaults to interactive mode
  • Environment Configuration: Robust .env file support for API credentials

🚀 Quick Start

Installation

  1. Clone and setup:

    cd WordToOutline
    python -m venv .venv
    source .venv/bin/activate  # On Windows: .venv\Scripts\activate
    pip install -r requirements.txt
  2. Configure API access (required for upload):

    cp .env.example .env
    # Edit .env with your Outline API credentials

Basic Usage

  1. Extract Word documents (defaults to ./input directory):

    # Extract from default input directory
    python main.py --extract
    
    # Or specify a specific file/directory
    python main.py --extract path/to/document.docx
    python main.py --extract path/to/documents/
  2. List extracted documents:

    python main.py --list-extracted
  3. Upload to Outline (defaults to interactive mode):

    # Interactive upload (default - select document and collection)
    python main.py --upload
    
    # Upload specific document by ID
    python main.py --upload document_id
    
    # Batch upload all documents
    python main.py --upload batch
  4. Reset to clean state:

    # Remove all extracted files and start fresh
    python main.py --reset

📁 How It Works

Input → Processing → Upload

Word Documents          JSON Holding Files        Outline Documents
    (.docx)         →      (extracted/)         →      (Live in Outline)
      ↓                        ↓                         ↓
[document.docx]    →    [uuid.json]           →    [Outline Document]
      ↓                        ↓                         ↓
   Content              • Document metadata             • Full document created
   Images          →    • HTML content           →      • Collection assignment
   Metadata             • Markdown content              • Image attachments
   Formatting           • Extracted images              • Original document attached
                                                        • Format preservation

Extraction Process

  1. Document Analysis: Uses python-docx to extract metadata and structure
  2. Content Conversion: Uses mammoth for superior HTML conversion
  3. Format Processing: Converts HTML to clean Markdown
  4. Image Extraction: Extracts and saves embedded images
  5. JSON Storage: Creates structured holding files for future upload
  6. Upload Workflow: Creates stub documents, uploads all attachments (images + original), then updates with final content

📂 Project Structure

WordToOutline/
├── input/                  # Place Word documents here
├── extracted/              # Generated JSON holding files
├── images/                 # Extracted images (organized by document ID)
├── libs/
│   ├── word_extractor.py   # Core Word document processing
│   ├── config.py           # Configuration management
│   ├── logger.py           # Logging and progress tracking
│   └── __pycache__/
├── main.py                 # CLI interface
├── requirements.txt        # Python dependencies
├── PLAN.md                 # Detailed implementation plan
└── README.md              # This file

🔧 Configuration

Configuration

API Configuration (Required for Upload)

  1. Get Outline API credentials:

    • Log into your Outline instance
    • Go to Settings > API Tokens
    • Create a new token with appropriate permissions
  2. Configure using .env file (recommended):

    cp .env.example .env
    # Edit .env with your credentials:
    # OUTLINE_API_TOKEN=your_outline_api_token
    # OUTLINE_API_URL=https://your-outline-instance.com
  3. Or set environment variables directly:

    export OUTLINE_API_TOKEN="your_outline_api_token"
    export OUTLINE_API_URL="https://your-outline-instance.com"

Command Line Options

# Main operations (choose one)
--extract [PATH]            # Extract from file/directory (default: ./input)
--list-extracted            # List all extracted documents  
--upload [DOCUMENT_ID]      # Upload to Outline (default: interactive)
--reset                     # Reset local directories to clean state

# Extraction options
--no-images                 # Skip image extraction
--no-formatting             # Skip formatting preservation

# Directory configuration
--input-dir DIR             # Input directory (default: input)
--extracted-dir DIR         # Output directory (default: extracted)
--images-dir DIR            # Images directory (default: images)

# API configuration (alternative to .env file)
--api-key TOKEN             # Outline API token
--api-url URL               # Outline API URL

# Logging
--log-level LEVEL           # DEBUG, INFO, WARNING, ERROR (default: INFO)
--log-file FILE             # Log to file instead of console

🚀 Performance Tips & Rate Limiting

For Large Document Uploads:

  • The system includes automatic rate limiting protection with exponential backoff
  • Standard 1-second delays between image uploads help prevent rate limiting
  • Batch delays of 3 seconds every 10 uploads provide additional protection

Optional Performance Optimization: If you have a private Outline instance and want faster uploads, you can temporarily disable rate limiting by modifying the delay values in libs/api_upload_manager.py:

  • Set regular delay to 0.1 seconds for faster uploads
  • Set batch delay to 1.0 seconds for minimal throttling
  • ⚠️ Warning: Only recommended for private instances to avoid overwhelming public servers

Upload Strategies:

  • Interactive Mode: Best for selective document uploads with control
  • Batch Mode: Efficient for uploading many documents at once
  • Individual Uploads: Use specific document IDs for targeted uploads

⚠️ Known Issues

1. Rate Limiting During Large Uploads

Issue: Large documents with many images may experience significant delays due to API rate limiting.

Impact: Upload times can be extended with exponential backoff delays (5s, 10s, 20s, 40s between retries).

Administrator Solution:

  • Temporarily disable rate limiting on your Outline instance during bulk upload sessions
  • This can reduce upload times from minutes to seconds for image-heavy documents
  • Remember to re-enable rate limiting after bulk operations complete

User Workaround:

  • Upload documents during low-traffic periods
  • Consider breaking very large documents into smaller sections

2. Modified Images in Word Documents

Issue: Images that have been edited or modified within Microsoft Word (such as adding highlights, annotations, or effects) may not display exactly as they appeared in Word.

Behavior:

  • The original unmodified image will be extracted and uploaded
  • Word's modifications (highlights, annotations, effects) will appear as separate overlay images
  • This results in multiple image attachments in Outline instead of a single modified image

Workaround:

  • For critical visual fidelity, edit images in external image editing software before inserting into Word
  • Alternatively, take screenshots of modified images in Word and replace them manually in Outline

Interactive Features

Enhanced Upload Workflow:

  • 📋 Collection Selection: Browse and select target collections
  • 🔍 Document Comparison: Detailed metadata comparison for conflicts
  • Conflict Resolution: 4 options - overwrite, view details, skip, or cancel
  • 📦 Batch Processing: Upload multiple documents with progress tracking
  • 🛡️ Safe Operations: Confirmation prompts for destructive actions

📝 Extracted Content Format

Each extracted document creates a JSON file with:

{
  "document_id": "uuid-string",
  "metadata": {
    "filename": "document.docx",
    "title": "Document Title", 
    "author": "Author Name",
    "word_count": 150,
    "paragraph_count": 12,
    "created": "2024-01-01T12:00:00",
    "modified": "2024-01-02T12:00:00"
  },
  "content_html": "<h1>Title</h1><p>Content...</p>",
  "content_markdown": "# Title\n\nContent...",
  "images": [
    {
      "image_id": "uuid",
      "original_filename": "image.png", 
      "extracted_filename": "uuid.png",
      "file_path": "images/doc-uuid/uuid.png",
      "content_type": "image/png",
      "size_bytes": 12345,
      "width": 800,
      "height": 600
    }
  ],
  "extraction_timestamp": "2024-01-01T12:00:00"
}

🔮 Recent Updates & Bug Fixes

Latest Enhancements ✨

  • Original Document Attachment: Source Word documents now automatically attached to Outline pages
  • Complete Document Preservation: Users get converted content AND access to original source
  • Smart Defaults: --extract now defaults to ./input directory
  • Interactive Default: --upload now defaults to interactive mode
  • Environment Configuration: Fixed .env file parsing for seamless API setup
  • Enhanced Documentation: Complete feature coverage and usage examples
  • Improved User Experience: Streamlined workflows with sensible defaults

Real-World Testing Fixes 🐛

  • Image Upload Fix: Resolved 'None' filename errors causing 400 Bad Request failures
  • Rate Limiting: Added retry logic with exponential backoff for API rate limits
  • Interactive Batch Upload: New workflow for processing documents one-at-a-time with c/s/e options
  • Title Override: Added document title customization step in interactive upload
  • Clean Collection Selection: Simplified display showing only collection titles with NEW option

Phase 3: Advanced Features (Future)

  • Word Template Processing: Handle complex document templates
  • Collaboration Features: Multi-user document processing
  • Advanced Formatting: Enhanced support for complex layouts
  • Integration Options: Web interface, desktop app
  • Workflow Automation: Watch folders, scheduled processing

🏗️ Built With

  • python-docx: Word document parsing and metadata extraction
  • mammoth: Superior HTML conversion from Word documents
  • Pillow: Image processing and optimization
  • requests: HTTP client for future API integration

🤝 Development

Architecture Notes

This project leverages the proven architecture from the Confluence to Outline (CTO) project, reusing:

  • ✅ Configuration management system
  • ✅ Logging and progress tracking
  • ✅ Error handling patterns
  • ✅ Future API integration components (60-70% code reuse)

Testing

# Create test document (if needed)
python create_test_doc.py

# Extract from default input directory
python main.py --extract

# View extracted documents
python main.py --list-extracted

# Test interactive upload (requires API setup)
python main.py --upload

# Reset and start fresh
python main.py --reset

Common Usage Patterns

# Quick workflow: extract and upload
python main.py --extract                    # Extract all from ./input
python main.py --upload                     # Interactive upload

# Batch processing workflow  
python main.py --extract path/to/documents/ # Extract directory
python main.py --upload batch               # Upload all documents

# Clean slate workflow
python main.py --reset                      # Start fresh
python main.py --extract                    # Extract again

📄 License

[Your License Here]

🙋‍♂️ Support

For questions about:

  • Phase 1 (Current): Word document extraction and processing
  • Phase 2 (Planned): Outline API integration and upload functionality

Word to Outline - Making knowledge transfer from Word documents to Outline simple and reliable.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages