Tired of using sort input.txt | uniq > output.txt I wanted to create a cross OS script that could read any possible file, take each word once, and list them all in a word-list.
| Package | Purpose | Install |
|---|---|---|
python-magic |
Detect plain text files by content | pip install python-magic |
pyxtxt |
Extract text from all document and image formats | pip install pyxtxt |
openai-whisper |
Transcribe audio and video locally | pip install openai-whisper |
colorama |
Colour terminal output on all platforms | pip install colorama |
Install all at once:
pip install python-magic openai-whisper colorama pyxtxt
For broader format coverage install pyxtxt with optional extras:
pip install "pyxtxt[pdf,docx,presentation,spreadsheet,html,ocr]"
| Tool | Purpose | Without it |
|---|---|---|
| Tesseract OCR | Extract text from images (.jpg, .png, .gif, .tif) |
Images will be skipped |
| ffmpeg | Decode audio and video for Whisper | Audio/video will not work |
Pukeko will warn you at startup if any system tool is missing.
-
pip install python-magic openai-whisper colorama pyxtxt -
Install Tesseract OCR for image support
-
Install ffmpeg for audio/video support
- python-magic: as if the World wasn't complicated enough, there are 2 'Magic' libraries. You can find the right one here on GitHub or here on pypi.python.org
- openai-whisper: runs fully locally — no API key, no internet, no token limits. See the GitHub repo for details. A GPU is optional but speeds up transcription significantly.
- pyxtxt: supports many formats out of the box; install with extras (
pyxtxt[ocr],pyxtxt[pdf], etc.) for broader coverage. See the pyxtxt PyPI page for the full list.
If the situation gets tragic open an issue and I will help you troubleshooting
Pukeko can currently parse:
Documents & images (via pyxtxt): '.csv', '.doc', '.docx', '.eml', '.epub', '.gif', '.htm', '.html', '.jpeg', '.jpg', '.json', '.log', '.msg', '.odt', '.pdf', '.png', '.pptx', '.ps', '.psv', '.rtf', '.tff', '.tif', '.tiff', '.tsv', '.txt', '.xls', '.xlsx'
Audio & video (via openai-whisper, transcribed locally): '.mp3', '.wav', '.ogg', '.flac', '.m4a', '.aac', '.wma', '.mp4', '.avi', '.mkv', '.mov', '.wmv', '.flv', '.webm', '.m4v'
Plain text files (via python-magic): any file identified as plain text by the system, e.g. .py, .js, .xml, shell scripts, config files, and other text-based formats.
The -model flag lets you choose the Whisper model size for audio/video transcription:
| Model | Speed | Accuracy |
|---|---|---|
tiny |
fastest | lowest |
base |
fast | low |
small |
balanced | good (default) |
medium |
slow | very good |
large |
slowest | best |
Example: python Pukeko.py -input /path/to/files -output wordlist.txt -model medium
Have a look at my YouTube presentation:
On spare time my TODO list is:
- add option
-URLto create wordlists from a target web page like CeWL - add option
-siteto create wordlists from a target website - add option
Leet(or1337), also known aseleetorleetspeak(so many passwords are week because of leetspeak ) - add multilanguage (
pip install alphabet-detector) - add highlight HotWords in string
- add e-mail to HotWords