Skip to content

dhasterok/WordCloud

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PDF Word Cloud Generator

A desktop application that extracts text from PDF files and generates beautiful, customizable word clouds with interactive frequency analysis. This code is a PyQt6 based GUI that serves as a wrapper for amueller's word_cloud.

Screenshot

Features

  • PDF Text Extraction: Extract and analyze text from multi-page PDF documents
  • Interactive Word Cloud: Generate stunning visualizations with multiple shape options and color schemes
  • Frequency Analysis: View and edit word frequencies in an interactive table
  • Advanced Filtering:
    • Filter by word length
    • Toggle number inclusion
    • Automatic stop word removal (articles, prepositions, pronouns, etc.)
    • Custom word exclusion
  • Customizable Visualization:
    • Multiple shape options (Rectangle, Circle, Heart, Star, Diamond, Hexagon, Triangle)
    • 50+ built-in colormaps plus custom color palette support
    • Adjustable dimensions, font sizes, and background colors
    • Custom image shapes
  • Save & Export:
    • Save word lists as CSV for future editing
    • Export word clouds as PNG or SVG
  • Persistent Settings: Application remembers your preferences between sessions

Requirements

  • Python 3.8 or higher
  • See requirements.txt for Python package dependencies

Installation

1. Clone the Repository

git clone <repository-url>
cd WordCloud

2. Create Virtual Environment (Recommended)

python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

3. Install Dependencies

pip install -r requirements.txt

Usage

Running the Application

python src/pdf_wordcloud.py

Workflow

  1. Open a PDF File

    • Click "Open PDF" or use Ctrl+O
    • Select any PDF file from your computer
    • The application automatically extracts text from all pages
  2. Generate Word Cloud

    • Adjust settings in the left control panel (optional)
    • Click "Generate" to generate the word cloud visualization
    • The frequency table updates automatically
  3. Refine the Results

    • View the "Word Frequency" tab to see all extracted words
    • Double-click frequency values to edit them manually
    • Check the "Exclude" checkbox to remove specific words
    • Click "Update" to regenerate with your changes
  4. Save Your Work

    • Save the word list as CSV: Click "Save List" (Ctrl+S)
    • Export the image: Click "Save Cloud" (Ctrl+Shift+S)
    • Both PNG and SVG formats are supported

Advanced Features

Reset to Original

If you want to clear all manual edits and regenerate from the source:

  • Go to Cloud menu → Reset List (Ctrl+Shift+L)
  • This clears all exclusions and custom frequencies

Load a Previously Saved Word List

  • Click "Open List" or use Ctrl+Shift+O
  • Select a CSV file previously saved by the application
  • The word list and all settings are restored

Randomize Layout

  • Toggle the "Randomize" button to enable/disable random layout generation
  • When enabled: generates a different layout each time you update
  • When disabled: uses a fixed seed for consistent layouts

Configuration

Word Cloud Settings

  • Size: Set width and height of the generated word cloud (default: 800x600)
  • Background Color: Choose the background color
  • Colormap: Select from 50+ matplotlib colormaps or custom palettes
  • Font Size: Set minimum and maximum font sizes (default: 10-150 pt)

Text Filtering

  • Min Word Length: Filter out words shorter than specified length (default: 1)
  • Max Words: Limit the number of words displayed (default: 200)
  • Include Numbers: Toggle whether numeric words are included
  • Ignore Stop Words: Automatically remove common words (articles, prepositions, etc.)

Shape Options

  • Rectangle (default)
  • Circle
  • Heart
  • Star
  • Diamond
  • Hexagon
  • Triangle
  • Custom Image (load your own mask image)

File Structure

WordCloud/
├── README.md                      # This file
├── SPEC.md                        # Detailed specification
├── requirements.txt               # Python dependencies
├── src/
│   └── pdf_wordcloud.py          # Main application
├── resources/
│   ├── custom_colormaps.csv      # Custom color palettes
│   ├── settings.json             # Application settings
│   └── icons/                    # UI icons
└── saved/
    └── GDA_wordlist.csv          # Example saved word list

Keyboard Shortcuts

Shortcut Action
Ctrl+O Open PDF
Ctrl+Shift+O Open Word List
Ctrl+S Save List
Ctrl+Shift+S Save Cloud
Ctrl+G Generate Word Cloud
Ctrl+Shift+L Reset List (menu only)

Creating Custom Color Palettes

Edit resources/custom_colormaps.csv to create custom color schemes:

name,color1,color2,color3,...
ocean,#1a5276,#2874a6,#5dade2
sunset,#d04526,#f39c12,#f9e79f

Each row defines a colormap with a name and 2-5 hex colors.

Troubleshooting

Application won't start

  • Ensure Python 3.8+ is installed: python --version
  • Verify all dependencies are installed: pip install -r requirements.txt
  • Try reinstalling PyQt6: pip install --upgrade PyQt6

PDF text extraction fails

  • Ensure the PDF is not encrypted or corrupted
  • Try opening the PDF with a PDF reader first to confirm it's readable
  • Some scanned PDFs require OCR (not supported by this application)

Word cloud is empty

  • Check your filter settings (minimum word length, stop words, etc.)
  • Ensure the PDF contains extractable text (not an image-only scan)
  • Try disabling stop word filtering to see if that helps

Author

Derrick Hasterok

License

MIT License - see LICENSE file for details

Note: This project uses PyQt6 for the GUI. PyQt6 is licensed under GPLv3 unless a commercial license is purchased, so redistribution of this application may be subject to PyQt6 licensing restrictions.

Contributing

Contributions are welcome! Please feel free to submit pull requests or open issues for bugs and feature requests.

Future Enhancements

  • Real-time preview updates
  • Batch processing for multiple PDFs
  • Word frequency statistics and analytics
  • Theme customization
  • Additional input file types (txt, docx, etc.)
  • Add color picker for backgrounds and custom colormap generation.

About

A PyQt based word cloud generator

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages