A catalog-only index of a personal digital preservation store of 278 reference works (~854 MiB, 20 collections) covering computer architecture, operating systems, exploitation, cryptography, malware analysis, and Unix/Linux history.
The underlying files are not included in this repository. What is published here is the descriptive and technical metadata that a digital archivist would produce for a held collection — exactly the artifacts an inheriting institution, an auditor, or a transfer partner would need to verify a copy without the repository owner having to redistribute potentially copyrighted material.
This is the catalog and fixity counterpart to:
warc-portfolio— WARC acquisition + BagIt packaging pipelinearchival-dives— archival research methodology and dossiers
catalog/
fixity/
manifest-sha256.tsv # one SHA-256 + size per file, 278 rows
formats/
siegfried.csv # raw Siegfried/DROID format identification output
format-profile.md # PUID and MIME summary
finding-aids/
summary.json # machine-readable collection index
<category>/<collection>.md # human-readable per-collection finding aids
scripts/
build_catalog.py # regenerates everything from the local store
- 20 collections organized into 7 categories: architecture, culture, exploitation, foundations, intelligence, malware, unix-linux
- 278 files, 854.4 MiB
- 180 PDFs, 28 JPEGs, 18 plain-text, 12 XML, 6 markdown, plus bzip2/gzip/zip/rar/7z/sqlite/epub/json/png
- PDF version spread: 1.2 → 1.7, plus PDF/A and PDF/X variants — useful surface for testing normalization workflows
The PUID breakdown is the interesting bit for preservation planning: it tells you immediately which formats have stable long-term support (PDF/A, PNG, plain text, markdown) and which would need migration policies (older PDF versions, proprietary archives).
A preservation portfolio has to demonstrate two things:
- You can describe a collection — provenance, scope, file types, extent, fixity.
- You respect rights — most of the source material here is copyrighted to its original authors and publishers.
Publishing the catalog satisfies (1) without violating (2). It also mirrors how real preservation institutions handle dark or restricted collections: the finding aid is public, the bits are not.
python scripts/build_catalog.py \
--store <path-to-local-store> \
--out <path-to-this-repo>Requirements:
- Python 3.10+
- Siegfried 1.x with the DROID signature file installed
(path to
sfexecutable is--sf <path>; defaults to a Windows-local install)
The script:
- SHA-256s every file in the store and writes
catalog/fixity/manifest-sha256.tsv - Runs Siegfried with
-csvover the entire tree →catalog/formats/siegfried.csv - Summarizes PUIDs and MIME types →
catalog/formats/format-profile.md - Writes one finding aid per collection with file inventory and truncated SHA-256s
Re-running the script after store changes regenerates a fresh catalog, so the manifest stays the single source of truth for fixity.
- Fixity: SHA-256 (BagIt manifest-compatible)
- Format identification: PRONOM PUIDs via Siegfried 1.11.4 (DROID v122 signature file)
- Timestamps: ISO 8601 UTC
- Paths: POSIX-style, relative to the store root, in every artifact
- OAIS mapping: the catalog corresponds to OAIS Descriptive Information + Fixity Information sub-objects of the AIP; the held bits constitute the Content Information
The catalog itself (manifests, finding aids, profile, scripts) is CC0. The underlying preserved works retain their original copyright; this repository makes no claim on them and does not distribute them.