A desktop application that extracts text from PDF files and generates beautiful, customizable word clouds with interactive frequency analysis. This code is a PyQt6 based GUI that serves as a wrapper for amueller's word_cloud.
- PDF Text Extraction: Extract and analyze text from multi-page PDF documents
- Interactive Word Cloud: Generate stunning visualizations with multiple shape options and color schemes
- Frequency Analysis: View and edit word frequencies in an interactive table
- Advanced Filtering:
- Filter by word length
- Toggle number inclusion
- Automatic stop word removal (articles, prepositions, pronouns, etc.)
- Custom word exclusion
- Customizable Visualization:
- Multiple shape options (Rectangle, Circle, Heart, Star, Diamond, Hexagon, Triangle)
- 50+ built-in colormaps plus custom color palette support
- Adjustable dimensions, font sizes, and background colors
- Custom image shapes
- Save & Export:
- Save word lists as CSV for future editing
- Export word clouds as PNG or SVG
- Persistent Settings: Application remembers your preferences between sessions
- Python 3.8 or higher
- See
requirements.txtfor Python package dependencies
git clone <repository-url>
cd WordCloudpython -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activatepip install -r requirements.txtpython src/pdf_wordcloud.py-
Open a PDF File
- Click "Open PDF" or use Ctrl+O
- Select any PDF file from your computer
- The application automatically extracts text from all pages
-
Generate Word Cloud
- Adjust settings in the left control panel (optional)
- Click "Generate" to generate the word cloud visualization
- The frequency table updates automatically
-
Refine the Results
- View the "Word Frequency" tab to see all extracted words
- Double-click frequency values to edit them manually
- Check the "Exclude" checkbox to remove specific words
- Click "Update" to regenerate with your changes
-
Save Your Work
- Save the word list as CSV: Click "Save List" (Ctrl+S)
- Export the image: Click "Save Cloud" (Ctrl+Shift+S)
- Both PNG and SVG formats are supported
If you want to clear all manual edits and regenerate from the source:
- Go to Cloud menu → Reset List (Ctrl+Shift+L)
- This clears all exclusions and custom frequencies
- Click "Open List" or use Ctrl+Shift+O
- Select a CSV file previously saved by the application
- The word list and all settings are restored
- Toggle the "Randomize" button to enable/disable random layout generation
- When enabled: generates a different layout each time you update
- When disabled: uses a fixed seed for consistent layouts
- Size: Set width and height of the generated word cloud (default: 800x600)
- Background Color: Choose the background color
- Colormap: Select from 50+ matplotlib colormaps or custom palettes
- Font Size: Set minimum and maximum font sizes (default: 10-150 pt)
- Min Word Length: Filter out words shorter than specified length (default: 1)
- Max Words: Limit the number of words displayed (default: 200)
- Include Numbers: Toggle whether numeric words are included
- Ignore Stop Words: Automatically remove common words (articles, prepositions, etc.)
- Rectangle (default)
- Circle
- Heart
- Star
- Diamond
- Hexagon
- Triangle
- Custom Image (load your own mask image)
WordCloud/
├── README.md # This file
├── SPEC.md # Detailed specification
├── requirements.txt # Python dependencies
├── src/
│ └── pdf_wordcloud.py # Main application
├── resources/
│ ├── custom_colormaps.csv # Custom color palettes
│ ├── settings.json # Application settings
│ └── icons/ # UI icons
└── saved/
└── GDA_wordlist.csv # Example saved word list
| Shortcut | Action |
|---|---|
| Ctrl+O | Open PDF |
| Ctrl+Shift+O | Open Word List |
| Ctrl+S | Save List |
| Ctrl+Shift+S | Save Cloud |
| Ctrl+G | Generate Word Cloud |
| Ctrl+Shift+L | Reset List (menu only) |
Edit resources/custom_colormaps.csv to create custom color schemes:
name,color1,color2,color3,...
ocean,#1a5276,#2874a6,#5dade2
sunset,#d04526,#f39c12,#f9e79fEach row defines a colormap with a name and 2-5 hex colors.
- Ensure Python 3.8+ is installed:
python --version - Verify all dependencies are installed:
pip install -r requirements.txt - Try reinstalling PyQt6:
pip install --upgrade PyQt6
- Ensure the PDF is not encrypted or corrupted
- Try opening the PDF with a PDF reader first to confirm it's readable
- Some scanned PDFs require OCR (not supported by this application)
- Check your filter settings (minimum word length, stop words, etc.)
- Ensure the PDF contains extractable text (not an image-only scan)
- Try disabling stop word filtering to see if that helps
Derrick Hasterok
MIT License - see LICENSE file for details
Note: This project uses PyQt6 for the GUI. PyQt6 is licensed under GPLv3 unless a commercial license is purchased, so redistribution of this application may be subject to PyQt6 licensing restrictions.
Contributions are welcome! Please feel free to submit pull requests or open issues for bugs and feature requests.
- Real-time preview updates
- Batch processing for multiple PDFs
- Word frequency statistics and analytics
- Theme customization
- Additional input file types (txt, docx, etc.)
- Add color picker for backgrounds and custom colormap generation.
