Skip to content

markyip/CloneWiper

Repository files navigation

CloneWiper

Version Python Version License Platform Buy Me a Coffee Downloads

CloneWiper is a high-performance, modern duplicate file detection tool built with Python and PySide6 (Qt). It follows Material Design 3 principles to provide a premium, seamless experience for managing your file library.

✨ Features

Core Functionality

  • Smart Duplicate Detection: Five hash modes for flexible duplicate detection
    • MD5 Only: Fast exact duplicate detection using MD5 checksums (best for identical files)
    • Single Perceptual Hash: Detects visually similar images using phash algorithm
    • Multi-Algorithm Perceptual Hashing (Default): Combines four algorithms (average_hash, phash, dhash, whash) with voting mechanism for superior accuracy
      • Uses Hamming distance comparison with voting (requires 3/4 algorithms to agree)
      • Detects duplicates even when images are resized, compressed, or slightly modified
      • Tier-1 pHash screening skips full multi-hash work for images with a unique perceptual hash
      • Optimized with parallel hash calculation, batched size-group prefiltering, and Union-Find similarity grouping
    • Single P-Hash + ORB: Single perceptual hash with ORB feature verification for higher-confidence image matches
    • Multi-Algo P-Hash + ORB: Multi-algorithm voting plus ORB verification (most accurate, slowest)
    • Image Support: Works with common formats (JPEG, PNG, GIF, BMP, TIFF, WebP) and RAW files (CR2, NEF, ARW, etc.)
    • Video Support: Perceptual hashing for video files using keyframe extraction
  • Cross-Platform Support: Works on Windows (macOS support from source code only)
  • High Performance:
    • Asynchronous processing with multi-threaded file scanning
    • Fast Scanning: Uses os.scandir for efficient file system enumeration (up to 20x faster than traditional scanning)
    • Dynamic CPU optimization for hybrid architectures (P-cores/E-cores detection)
    • Adaptive I/O strategy (preloads small files for MD5, chunks large files)
    • Similarity grouping tuned for throughput (tiered pHash screening, LSH-style candidate search, Union-Find clustering, priority-ordered pair comparison)
    • Batch cache writes to reduce database lock contention
  • Persistent Caching:
    • Hash Cache: SQLite-backed cache (p-hash and MD5) for fast re-scans
    • Thumbnail Cache: Local SQLite database persists only expensive previews (documents, videos, and audio artwork); regular images use memory cache only

User Interface

  • Material Design 3 UI: Clean, modern dark-themed interface with rounded corners (when not maximized)
  • Custom Title Bar: Frameless window with native Windows resize, drag-to-snap, and Windows 11 Snap Layout support
  • Smart Thumbnails:
    • Images: Fast previews, including RAW support (.arw, .cr2, .nef, etc.)
    • Video: Frame extraction for common video formats
    • Documents: High-quality PDF, EPUB, MOBI, and AZW3 thumbnails using pypdfium2 and PyMuPDF
    • Music: Album art extraction and rich metadata display using mutagen
  • Interactive File Cards: Hover effects, scrolling text for long filenames, selection management, and visually aligned rounded thumbnail cards
  • Pagination: Efficient handling of large result sets with 100 groups per page and clickable page indicator dropdown
  • Drag & Drop: Drag and drop folders onto the results area for easy folder selection; remove folders with the inline x, Delete, Backspace, or context menu
  • Real-Time Progress: Centered progress indicator with phase detail and percentage; adaptive update intervals for large scans (prefilter, pHash index, and hash phases)
  • Quick Selection Strategies:
    • Keep Newest: Keeps the most recently modified file
    • Keep Oldest: Keeps the oldest file by modification time
    • Keep Best: Keeps the highest resolution image (exact width Γ— height); if multiple match, keeps the largest file size; marks sidecars and non-image files in the group for deletion
    • Keep Smallest: Keeps the highest resolution image; if multiple match, keeps the smallest file size
    • Keep RAW: Prefers RAW files over JPEG when both exist in the same group
  • Quick Actions: Delete Selected, Clear Selection (with scope: Current Page or All Pages)
    • Footer quick-action bar scrolls horizontally on narrow windows and stays hidden during scans
    • Selected quick-selection strategy remains highlighted after use

Advanced Features

  • Multi-Algorithm Perceptual Hashing:
    • Combines four hash algorithms (average, perceptual, difference, wavelet) with parallel calculation
    • Uses Hamming distance comparison with voting mechanism (requires 3/4 algorithms to agree)
    • Two-phase filtering: quick filter with average_hash, then detailed multi-algorithm comparison
    • Detects similar images and videos even if they're slightly modified, resized, or have different compression
  • Hybrid CPU Optimization: Automatically detects and optimizes for hybrid CPU architectures (Intel 12th/13th gen, AMD Ryzen)
    • Dynamically adjusts worker threads based on P-cores and E-cores
    • Optimized thread pool sizes for I/O-intensive and CPU-intensive tasks
  • File Type Grouping: Organize duplicates by file type
  • Multiple Sorting Options: Sort by count, size, name, or date (ascending/descending)
  • Scope Control: Apply actions to current page or all pages
  • Safe Deletion: Uses send2trash to move files to recycle bin/trash
    • Batch recycle-bin operations improve delete speed for large selections
    • Deleted files are removed from memory and thumbnail caches
  • Persistent Cache:
    • Hash Cache: Stores calculated hashes (p-hash and MD5)
    • Thumbnail Cache: Offloads expensive document/video/audio-art thumbnail generation to a local database (thumbnails.db) while keeping regular image thumbnails memory-only
    • Cache persists across sessions for costly media previews without storing thumbnails for every image
    • Automatic cache cleanup removes stale entries and prunes formats that are no longer persistently cached

πŸ“‹ Prerequisites

  • Python 3.8+
  • Windows 10/11 (macOS: run from source code only, executable build not currently supported)

πŸš€ Installation

From Source

  1. Clone the repository:

    git clone https://github.com/markyip/CloneWiper.git
    cd CloneWiper
  2. Install dependencies:

    pip install -r requirements.txt

Optional Dependencies (Recommended)

For full feature support, install optional dependencies:

# Video thumbnails
pip install opencv-python>=4.8.0

# PDF/EPUB/MOBI/AZW3 thumbnails
pip install PyMuPDF>=1.23.0
pip install pypdfium2>=0.20.0

# Music metadata and album art
pip install mutagen>=1.47.0

πŸ’» Usage

Windows

# Using launch script
launch.bat

# Or directly
python main.py

Verbose logging (optional)

By default the app stays quiet on the console. To enable detailed engine and UI debug logs:

set CLONEWIPER_DEBUG=1
python main.py

(On PowerShell: $env:CLONEWIPER_DEBUG=1 then python main.py.)

macOS (Source Code Only)

Note: macOS executable build is currently not supported. You can run from source code:

# Install dependencies
pip3 install -r requirements.txt

# Run directly
python3 main.py

Hash Mode Selection

CloneWiper offers five hash modes for different use cases:

  1. MD5 Only (Fastest)

    • Best for: Finding exact duplicate files
    • Uses: MD5 checksum comparison
    • Pros: Very fast, low CPU usage
    • Cons: Only detects identical files (byte-for-byte)
  2. Single Perceptual Hash (Balanced)

    • Best for: Finding visually similar images with moderate accuracy
    • Uses: phash algorithm
    • Pros: Faster than multi-algorithm, detects resized/compressed images
    • Cons: Less accurate than multi-algorithm mode
  3. Multi-Algorithm Perceptual Hash (Most Accurate β€” Default)

    • Best for: Finding visually similar images with highest accuracy without ORB overhead
    • Uses: Four algorithms (average, perceptual, difference, wavelet) with voting
    • Pros: Highest hash-only accuracy, tier-1 pHash screening speeds up large libraries
    • Cons: Slower than single-hash or MD5 modes
  4. Single P-Hash + ORB (Accurate with verification)

    • Best for: Image libraries where false positives must be minimized
    • Uses: phash plus ORB feature matching on candidate pairs
    • Pros: Strong visual verification on top of perceptual hashing
    • Cons: Slower than hash-only modes
  5. Multi-Algo P-Hash + ORB (Maximum accuracy)

    • Best for: Critical deduplication where accuracy matters more than speed
    • Uses: Multi-algorithm voting plus ORB verification
    • Pros: Strictest matching
    • Cons: Slowest mode

Recommendation: Use Multi-Algorithm Perceptual Hash for most cases. Use an ORB mode when you need extra verification on near-duplicate images.

πŸ”¨ Building Executables

Windows (EXE)

  1. Install PyInstaller:

    pip install pyinstaller
  2. Run the build script:

    build_windows.bat

    This build script will:

    • Check and install PyInstaller if needed
    • Build an optimized executable with all features
    • Exclude unnecessary modules to minimize file size

    Notes:

    • Python 3.12+: The script must not exclude distutils (PyInstaller 6’s distutils hook conflicts with --exclude-module=distutils). The provided build_windows.bat follows this.
    • Application icon: the build bundles icons\favicon.ico into an icons/ folder inside the executable so the taskbar and title bar show the correct icon.
    • If your executable is larger than expected (>300MB), create a clean virtual environment with only the dependencies you need before building.

    Or manually:

    pyinstaller --onefile --windowed --icon=favicon.ico --name=CloneWiper main.py

    The executable will be in dist/CloneWiper.exe

macOS Build

Note: macOS executable build is currently not supported. Please run from source code using python3 main.py.

πŸ“ Project Structure

CloneWiper/
β”œβ”€β”€ core/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ engine.py           # Core scanning and hashing engine
β”‚   └── thumbnail_cache.py  # Persistent SQLite thumbnail cache
β”œβ”€β”€ icons/
β”‚   β”œβ”€β”€ favicon.ico         # Multi-size application icon (Windows)
β”‚   └── README.md           # Icon resources documentation
β”œβ”€β”€ main.py                 # Application entry point
β”œβ”€β”€ qt_app.py               # PySide6 UI implementation
β”œβ”€β”€ verify_thumbnail_cache.py  # Optional utility to inspect thumbnail cache
β”œβ”€β”€ requirements.txt        # Python dependencies
β”œβ”€β”€ favicon.ico             # Application icon (Windows)
β”œβ”€β”€ launch.bat              # Windows launch script
β”œβ”€β”€ build_windows.bat       # Windows PyInstaller build script
β”œβ”€β”€ RELEASE_NOTES_v1.1.md   # Release notes for v1.1
β”œβ”€β”€ RELEASE_NOTES_v1.2.md   # Release notes for v1.2
β”œβ”€β”€ RELEASE_NOTES_v1.3.md   # Release notes for v1.3
β”œβ”€β”€ README.md               # This file
└── LICENSE                 # License file

πŸ“¦ Releases

See RELEASE_NOTES_v1.3.md for the latest changes. Older notes live in RELEASE_NOTES_v1.2.md, RELEASE_NOTES_v1.1.md, and on the GitHub Releases page.

Development

Running Tests

# Add tests when available
python -m pytest

Code Style

This project follows PEP 8 style guidelines.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

  • PySide6 - Qt for Python
  • Pillow - Image processing
  • ImageHash - Perceptual hashing
  • PyMuPDF - PDF/EPUB rendering
  • pypdfium2 - High-quality PDF rendering
  • rawpy - RAW image processing
  • OpenCV - Video processing
  • mutagen - Audio metadata
  • psutil - CPU architecture detection
  • Material Design 3 - Design guidelines

πŸ“§ Contact

For issues, questions, or suggestions, please open an issue on GitHub.


About

CloneWiper is a high-performance, modern duplicate file detection and management tool

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors