JFK Files Text Extraction Project

This project contains tools and extracted text from the JFK assassination records released by the National Archives. It includes scripts for downloading the original PDF files and converting them to searchable text format. The unorthodox directory structure of the extracted text mirrors the source to make it easier to reference or link files back to the original on archives.gov.

Project Structure

.
├── downloader_scripts/        # Scripts for downloading files from the National Archives
│   ├── csv/                   # csv files listing direct download URLs for all files
│   └── xlsx/                  # xlsx files from the National Archives with additional details about each file
├── extraction_scripts/        # Scripts for converting PDF to text
│   ├── linux/                 # Linux-specific extraction tools
│   ├── macOS/                 # macOS-specific extraction tools
│   └── find_missing.py        # Utility to find missing conversions
└── extracted_text/            # Extracted text content               
    └── releases/              # 2017 release 
        ├── additional/        # 2017 release 
        ├── 2018/              # 2018 release 
        ├── 2021/              # 2021 release
        ├── 2022/              # 2022 release
        ├── 2023/              # 2023 release
        └── 2025/0318/         # 2025 release

Current Status

Release Year	Status	Extraction Method	Files Downloaded	Size	Total Files Listed
2025	✅ Complete	Apple Vision OCR	2,566	8.12GB	2,566
2023	✅ Complete	Apple Vision OCR	2,693	6.20GB	2,693
2022	✅ Complete	Apple Vision OCR	13,199	14.15GB	13,199
2021	✅ Complete	Apple Vision OCR	1,484	1.36GB	1,484
2017-2018	✅ Complete	Apple Vision OCR	53,543	57.18GB	53,547

Note: 34 files in the 2022 release and 5 files in the 2021 release tie to multiple record numbers listed in the .xlsx files which have more rows than unique file names (13,263 and 1,491 resptively). The 2017-2018 release xlsx file contains 6 bad links, but the 2017-2018 release website lists two files not included in the xlsx in the /additional path. The 2017-2018 release all contains 19 audio files (17 .wav, 2 .mp3). Transcripts of the two .mp3 files are included. The 17 .wav files are very poor quailty with lots of blank space (they may be added later).

Getting Started

Prerequisites

Python 3.6 or later
System-specific dependencies (see individual script READMEs)

Installation

Clone the repository:

git clone https://github.com/yourusername/jfk-files-text.git
cd jfk-files-text

Install Python dependencies:

pip install -r requirements.txt

Install system dependencies as needed (see individual script READMEs)

Usage

Downloading Files

Use the appropriate downloader script from downloader_scripts/ based on the release year:

python downloader_scripts/jfk-2025-pdf-downloader.py

Extracting Text

Choose the appropriate extraction method based on your operating system:

macOS

python extraction_scripts/macOS/apple_vision_ocr/apple_vision_pdf_to_text.py

Linux

python extraction_scripts/linux/linux_pdf_to_text.py

Finding Missing Files

To check for any missing conversions:

python extraction_scripts/find_missing.py

Documentation

Known Issues

Data Inconsistencies

Release Format Variations
- The release page formats are incosistent
- No .xlsx file is available for the 2025 release
- Previous releases have .xlsx files with inconsistent formats
Duplicate Files
- 2017-2018 release contains duplicate file names on the website and in the xslx file
- 2017-2018 .xlsx file contains 54,636 line items including some duplicate filenames
Missing Files
- 2017-2018 xlsx has 54,604 line items (6 bad links)
- 2017-2018 website lists 54,601 line items (3 bad links, two additional files not referenced in the xlsx)
OCR Errors
- The extracted text contains a substanital amount of OCR errors due to the low quality of many of the input files.

Archive Statistics

Total archive size: 87 GB
Total files: 73,485
Extracted text available at: jfk-files-text
Available as a data set on Hugging Face: https://huggingface.co/datasets/mysocratesnote/jfk-files-text/

WebUI

A simple WebUI to query the archive using the DeepSeek R1 Distill Llama 70B LLM is available at https://jfkfiles.app.

Contributing

Fork the repository
Create your feature branch (git checkout -b feature/AmazingFeature)
Commit your changes (git commit -m 'Add some AmazingFeature')
Push to the branch (git push origin feature/AmazingFeature)
Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

The National Archives for providing the JFK assassination records
Contributors to the various open-source tools used in this project

Name		Name	Last commit message	Last commit date
Latest commit History 88 Commits
downloader_scripts		downloader_scripts
extracted_text		extracted_text
extraction_scripts		extraction_scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

JFK Files Text Extraction Project

Project Structure

Current Status

Getting Started

Prerequisites

Installation

Usage

Downloading Files

Extracting Text

macOS

Linux

Finding Missing Files

Documentation

Known Issues

Data Inconsistencies

Archive Statistics

WebUI

Contributing

License

Acknowledgments

About

Uh oh!

Releases 6

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

noops888/jfk-files-text

Folders and files

Latest commit

History

Repository files navigation

JFK Files Text Extraction Project

Project Structure

Current Status

Getting Started

Prerequisites

Installation

Usage

Downloading Files

Extracting Text

macOS

Linux

Finding Missing Files

Documentation

Known Issues

Data Inconsistencies

Archive Statistics

WebUI

Contributing

License

Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 6

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages