This project contains tools and extracted text from the JFK assassination records released by the National Archives. It includes scripts for downloading the original PDF files and converting them to searchable text format. The unorthodox directory structure of the extracted text mirrors the source to make it easier to reference or link files back to the original on archives.gov.
.
├── downloader_scripts/ # Scripts for downloading files from the National Archives
│ ├── csv/ # csv files listing direct download URLs for all files
│ └── xlsx/ # xlsx files from the National Archives with additional details about each file
├── extraction_scripts/ # Scripts for converting PDF to text
│ ├── linux/ # Linux-specific extraction tools
│ ├── macOS/ # macOS-specific extraction tools
│ └── find_missing.py # Utility to find missing conversions
└── extracted_text/ # Extracted text content
└── releases/ # 2017 release
├── additional/ # 2017 release
├── 2018/ # 2018 release
├── 2021/ # 2021 release
├── 2022/ # 2022 release
├── 2023/ # 2023 release
└── 2025/0318/ # 2025 release
| Release Year | Status | Extraction Method | Files Downloaded | Size | Total Files Listed |
|---|---|---|---|---|---|
| 2025 | ✅ Complete | Apple Vision OCR | 2,566 | 8.12GB | 2,566 |
| 2023 | ✅ Complete | Apple Vision OCR | 2,693 | 6.20GB | 2,693 |
| 2022 | ✅ Complete | Apple Vision OCR | 13,199 | 14.15GB | 13,199 |
| 2021 | ✅ Complete | Apple Vision OCR | 1,484 | 1.36GB | 1,484 |
| 2017-2018 | ✅ Complete | Apple Vision OCR | 53,543 | 57.18GB | 53,547 |
Note: 34 files in the 2022 release and 5 files in the 2021 release tie to multiple record numbers listed in the .xlsx files which have more rows than unique file names (13,263 and 1,491 resptively). The 2017-2018 release xlsx file contains 6 bad links, but the 2017-2018 release website lists two files not included in the xlsx in the /additional path. The 2017-2018 release all contains 19 audio files (17 .wav, 2 .mp3). Transcripts of the two .mp3 files are included. The 17 .wav files are very poor quailty with lots of blank space (they may be added later).
- Python 3.6 or later
- System-specific dependencies (see individual script READMEs)
- Clone the repository:
git clone https://github.com/yourusername/jfk-files-text.git
cd jfk-files-text- Install Python dependencies:
pip install -r requirements.txt- Install system dependencies as needed (see individual script READMEs)
Use the appropriate downloader script from downloader_scripts/ based on the release year:
python downloader_scripts/jfk-2025-pdf-downloader.pyChoose the appropriate extraction method based on your operating system:
python extraction_scripts/macOS/apple_vision_ocr/apple_vision_pdf_to_text.pypython extraction_scripts/linux/linux_pdf_to_text.pyTo check for any missing conversions:
python extraction_scripts/find_missing.py-
Release Format Variations
- The release page formats are incosistent
- No .xlsx file is available for the 2025 release
- Previous releases have .xlsx files with inconsistent formats
-
Duplicate Files
- 2017-2018 release contains duplicate file names on the website and in the xslx file
- 2017-2018 .xlsx file contains 54,636 line items including some duplicate filenames
-
Missing Files
- 2017-2018 xlsx has 54,604 line items (6 bad links)
- 2017-2018 website lists 54,601 line items (3 bad links, two additional files not referenced in the xlsx)
-
OCR Errors
- The extracted text contains a substanital amount of OCR errors due to the low quality of many of the input files.
- Total archive size: 87 GB
- Total files: 73,485
- Extracted text available at: jfk-files-text
- Available as a data set on Hugging Face: https://huggingface.co/datasets/mysocratesnote/jfk-files-text/
A simple WebUI to query the archive using the DeepSeek R1 Distill Llama 70B LLM is available at https://jfkfiles.app.
- Fork the repository
- Create your feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- The National Archives for providing the JFK assassination records
- Contributors to the various open-source tools used in this project