From ed1038580f9acfbbe437b19d070740aa2cb5f95b Mon Sep 17 00:00:00 2001 From: Damian Silbergleith <14797221+ds17f@users.noreply.github.com> Date: Thu, 4 Jun 2026 11:24:27 -0700 Subject: [PATCH 1/2] feat: add make safe lyric-strip pass and deploy the annotation-only site MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit David Dodd licensed the annotations/essays but not the underlying song lyrics. Split the build into two: - `make dist` — the full site (annotations + lyrics), for local self-hosters with a lawful source. Unchanged. - `make safe` — runs the build, then a new standalone post-pass (scripts/safe_build.py) that strips each song's verbatim lyric block and links out to dead.net/songs in its place. The strip pass removes only the lyric
between a song's credit line and the first annotation anchor; essays, and the public-domain poems / dictionary entries / reader emails quoted within them, are preserved. It is byte-preserving (latin-1 round-trip, matching build_site.py), idempotent (via a marker), and clamps every removal at the annotation seam so malformed 1990s markup (e.g. eleven.html's unclosed blockquote) never loses annotation content. CI and the Pages deploy now run `make safe`, so the public site at annotated.thedeadly.app keeps the annotations but omits the lyrics. Verified: 100 song pages stripped, 0 new broken links, all annotation content preserved. The mirror/ history rewrite is intentionally deferred (see PLAN.md); mirror/ remains the committed build source until then. Co-Authored-By: Claude Opus 4.8 --- .github/workflows/ci.yml | 9 +- .github/workflows/deploy-pages.yml | 10 ++- Makefile | 6 +- PLAN.md | 96 ++++++++++++++++++++ README.md | 36 +++++++- scripts/safe_build.py | 135 +++++++++++++++++++++++++++++ 6 files changed, 281 insertions(+), 11 deletions(-) create mode 100644 PLAN.md create mode 100644 scripts/safe_build.py diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml index 776ee6e..882eaf0 100644 --- a/.github/workflows/ci.yml +++ b/.github/workflows/ci.yml @@ -1,7 +1,8 @@ name: CI -# Builds the site from the committed mirror/ and audits link health. -# This is the required status check for pull requests into main. +# Builds the safe public site from the committed mirror/ (full build + lyric +# strip) and audits its link health. This is the required status check for +# pull requests into main, and it gates exactly what deploy-pages.yml ships. on: pull_request: branches: [main] @@ -25,8 +26,8 @@ jobs: - name: Install dependencies run: make install - - name: Build dist/ - run: make dist + - name: Build the safe public site (build + strip lyrics) + run: make safe - name: Audit link health run: make audit diff --git a/.github/workflows/deploy-pages.yml b/.github/workflows/deploy-pages.yml index dcebe37..d69e818 100644 --- a/.github/workflows/deploy-pages.yml +++ b/.github/workflows/deploy-pages.yml @@ -1,8 +1,10 @@ name: Deploy to GitHub Pages # On every push to main (i.e. after a PR merges), build the site from the -# committed mirror/ and deploy dist/ to GitHub Pages. No archive.org crawl is -# needed — mirror/ is the committed source of truth. +# committed mirror/, strip the copyrighted lyrics, and deploy dist/ to GitHub +# Pages. No archive.org crawl is needed — mirror/ is the committed source of +# truth. The public deploy runs `make safe` (build + lyric strip), so the +# hosted site keeps David Dodd's annotations but omits the song lyrics. on: push: branches: [main] @@ -32,8 +34,8 @@ jobs: - name: Install dependencies run: make install - - name: Build dist/ - run: make dist + - name: Build the safe public site (build + strip lyrics) + run: make safe - name: Set custom domain (persists across deploys) run: echo 'annotated.thedeadly.app' > dist/CNAME diff --git a/Makefile b/Makefile index aeb3888..0a030a5 100644 --- a/Makefile +++ b/Makefile @@ -1,4 +1,4 @@ -.PHONY: install mirror mirror-retry dist audit serve-dist all release release-dryrun clean help +.PHONY: install mirror mirror-retry dist safe audit serve-dist all release release-dryrun clean help help: ## Show this help message @echo 'Usage: make [target]' @@ -32,6 +32,10 @@ dist: ## Build the browsable, link-fixed static site into dist/ from mirror/ @test -d mirror || { echo "mirror/ not found — run 'make mirror' first."; exit 1; } uv run python scripts/build_site.py +safe: ## Build dist/, then strip copyrighted lyrics for safe public hosting + @$(MAKE) dist + uv run python scripts/safe_build.py + audit: ## Audit link health of the built dist/ site @test -d dist || { echo "dist/ not found — run 'make dist' first."; exit 1; } uv run python scripts/audit_links.py dist diff --git a/PLAN.md b/PLAN.md new file mode 100644 index 0000000..96dbd69 --- /dev/null +++ b/PLAN.md @@ -0,0 +1,96 @@ +# Project Legal-Cleanup & Safe-Hosting Plan + +## Context +- The repository currently ships a full copy of *The Annotated Grateful Dead Lyrics* in `mirror/` (308 resources), which `build_site.py` turns into the full browsable site in `dist/`. +- David Dodd gave permission for the **annotations/essays** but **not** the underlying song lyrics (the lyrics are separately copyrighted, mostly "Ice Nine Publishing; used by permission"). +- We want to keep the tooling (crawler, builder, auditor) so anyone can self-host the **full** site **if they obtain a lawful source** (e.g., the Wayback snapshot). +- For **our** public deployment we publish a **safe** version: the annotations are kept, the copyrighted lyric blocks are removed, and each song links out to the official lyrics source. + +## Guiding principle (the shape of the design) +**Local = full. Publish = safe.** +- `make dist` always builds the complete site into `dist/`, unchanged. This is the local source of truth a self-hoster gets. +- `make safe` runs a **standalone post-pass** (`scripts/safe_build.py`) that strips the copyrighted lyric blocks **from `dist/` in place** and inserts a link to the official source. CI runs `make safe` before deploying. +- The strip pass is pure HTML→HTML; it has no coupling to the mirror crawl, so it can be reasoned about and tested on its own. + +## Goals +| Goal | Desired outcome | +|------|----------------| +| **A. Standalone safe pass** | `scripts/safe_build.py` + a `make safe` target that strips lyric blocks from the built `dist/` (in place) and links each song to . Keeps **all** annotation markup. | +| **B. CI publishes the safe site** | The existing GitHub-Actions workflow runs `make safe` (build → strip) before deploying to Pages. | +| **C. Placeholder when mirror is absent** | `build_site.py` emits a minimal placeholder `dist/index.html` when `mirror/` is empty/missing, so a fresh clone (post-purge) still builds something coherent. | +| **D. Documentation** | A "Status & Self-hosting" section in `README.md` covering full vs. safe builds and how to obtain a lawful mirror. | +| **E. Ignore the raw mirror** | `.gitignore` ignores `mirror/`. | +| **F. (LAST) Purge mirror from history** | After everything above works and is verified, back up `mirror/` outside the repo, then rewrite history to remove it, force-push, and notify collaborators. **This is the final, irreversible step — done deliberately at the end, not up front.** | + +## How the lyric stripping actually works (corrected) +> The previous draft assumed lyrics lived in `
`. **That element does not exist anywhere** (0 pages). The real markup is 1990s-era and lyrics sit in a `
` — but so do many *non-lyric* annotation quotes (public-domain poems like Lovelace, OED entries, reader emails, nursery rhymes). A blanket blockquote strip would destroy annotation content we are allowed to publish. + +The dependable structural seam, verified across all 124 song pages: + +``` +[byline: David Dodd, UCSC] +"Song Title" +Words by … ; music by … +Copyright Ice Nine Publishing; used by permission. ← song-credit line +
…the copyrighted lyrics…
← THE block to remove +
+ ← annotations begin here; everything from here on STAYS + …essays, which may themselves contain
s we keep… +``` + +Rule: **remove the lyric `
`(s) that sit between the song-credit line and the first `` annotation anchor**, and replace with a short notice linking to . Never touch blockquotes that appear after the first annotation anchor. + +Edge cases the pass must handle: +- **`darkstar.html`** — has lyrics but its first annotation anchor is ``, not `title`. So key on the *first* `` anchor, not the literal string "title". +- **`appl.html` and similar** — only annotate a *title phrase*; there is **no lyric blockquote**. These must be left untouched (the pass finds no block and skips). +- **`Copyright Steve Silberman. Used by permission.`** — an essay credit, not a song lyric. The "blockquote before the first annotation anchor" rule (not a bare "used by permission" text match) avoids mis-stripping here; add a guard/whitelist if needed. +- The pass should be a no-op on pages with no lyric block, and should **report counts** (pages processed, lyric blocks removed) the way `build_site.py` reports its passes. + +## Reuse of existing code +- `scripts/mirror.py` — unchanged; still crawls the Wayback snapshot for self-hosters. +- `scripts/build_site.py` — unchanged for the full build; gains only the empty-mirror **placeholder** behavior (Goal C). +- `scripts/safe_build.py` — **new** standalone post-pass (Goal A). Reads `dist/`, strips lyric blocks in place, injects the dead.net link. +- `scripts/audit_links.py` — unchanged; audits whatever is in `dist/` (full or safe). +- `Makefile` — gains a `safe` target: `make dist` then `python scripts/safe_build.py`. +- `.gitignore` — add `mirror/`. + +## Files to Modify / Add +| Path | Reason | +|------|--------| +| `scripts/safe_build.py` | New standalone strip-and-relink pass over `dist/`. | +| `Makefile` | New `safe` target (build → strip). | +| `scripts/build_site.py` | Emit placeholder `index.html` when `mirror/` is empty. | +| `.github/workflows/…` | Deploy step runs `make safe` instead of `make dist`. | +| `README.md` | "Status & Self-hosting" section (full vs. safe). | +| `.gitignore` | Ignore `mirror/`. | +| `PLAN.md` (this file) | Living plan. | + +## Implementation Steps (ordered; destructive step is LAST) +1. **Write `scripts/safe_build.py`** implementing the boundary rule above; make it idempotent and have it print a count report. Verify on `althea.html`, `darkstar.html` (anchor != "title"), and `appl.html` (no lyric block → untouched). +2. **Add the `make safe` target** (`make dist` then `uv run python scripts/safe_build.py`). +3. **Add the placeholder** to `build_site.py` for the empty-`mirror/` case (Goal C). +4. **Point CI at `make safe`** so Pages serves the safe site; confirm `make audit` passes against the stripped `dist/`. +5. **Docs + `.gitignore`**: add the README section; add `mirror/` to `.gitignore`. +6. **Verify the whole pipeline locally** end to end (full build, safe strip, audit). +7. **(LAST) Purge `mirror/` from history** — only after 1–6 are merged and the safe site is confirmed: + ```bash + cp -r mirror ../annotated-lyrics-mirror-backup # backup OUTSIDE the repo first + git checkout main + git filter-repo --path mirror/ --invert-paths + git rev-list --objects --all | grep mirror/ # must return nothing + git push --force-with-lease + ``` + Then notify collaborators to re-clone (every commit hash changes). + +## Verification Checklist +- [ ] `safe_build.py` strips the lyric blockquote on `althea.html` and `darkstar.html`, leaves `appl.html` untouched, and preserves all annotation-internal blockquotes. +- [ ] `safe_build.py` prints a count report and is idempotent (safe to re-run on an already-stripped `dist/`). +- [ ] `make safe` produces a `dist/` whose song pages link to in place of lyrics. +- [ ] `make audit` passes against the stripped `dist/`. +- [ ] `build_site.py` emits a placeholder `index.html` when `mirror/` is empty. +- [ ] CI deploys the safe site to Pages without errors. +- [ ] `README.md` documents full vs. safe builds; `.gitignore` ignores `mirror/`. +- [ ] **(LAST)** Backup exists at `../annotated-lyrics-mirror-backup`; history contains no `mirror/` objects; force-push done; collaborators notified. + +--- +*Living planning artifact. Step 7 (history rewrite) is intentionally deferred to the very end.* diff --git a/README.md b/README.md index ae5ea0a..4bef5d1 100644 --- a/README.md +++ b/README.md @@ -23,7 +23,8 @@ Lyrics*](https://www.simonandschuster.com/books/The-Complete-Annotated-Grateful- ```bash make install # install deps (uv) make mirror # download the raw archive into mirror/ (~30-45 min, one time) -make dist # build the browsable, link-fixed site into dist/ +make dist # build the full browsable, link-fixed site into dist/ +make safe # build dist/, then strip lyrics for safe public hosting make serve-dist # serve dist/ at http://localhost:8000 make audit # report link health of dist/ ``` @@ -57,6 +58,35 @@ without ever re-hitting archive.org. --- +## Status & self-hosting: full vs. safe builds + +David Dodd generously gave permission to host his **annotations and essays**. +He did **not** (and could not) license the underlying **song lyrics**, which are +separately copyrighted. So this project distinguishes two builds: + +| Build | Command | Contains | Who it's for | +|-------|---------|----------|--------------| +| **Full** | `make dist` | annotations **and** lyrics | local self-hosters with a lawful source | +| **Safe** | `make safe` | annotations only; lyrics replaced with a link to [dead.net/songs](https://www.dead.net/songs) | the public site we deploy | + +**`make safe`** runs the normal build and then a standalone pass +(`scripts/safe_build.py`) that strips each song's verbatim lyric block — the +`
` between the song's credit line and the first annotation anchor — +and drops a link to the official lyrics source in its place. **Everything else +is preserved**: the essays, and the public-domain poems, dictionary entries, and +reader correspondence quoted *within* the annotations. Short lyric fragments +quoted inline for commentary in the essays are left intact (permitted annotation +/ fair use); only the full per-song lyric reproductions are removed. The pass is +idempotent and byte-preserving, and it never deletes a byte of the annotation +section even on pages with malformed 1990s markup. + +The public site at **https://annotated.thedeadly.app/** is the **safe** build — +CI runs `make safe` before deploying. If you have obtained a lawful copy of the +original HTML (e.g. via the Internet Archive, see `make mirror`), `make dist` +gives you the complete site locally. + +--- + ## The pipeline, pass by pass (with real results) ### 1. `make mirror` — the raw crawl (`scripts/mirror.py`) @@ -184,6 +214,7 @@ dist/ # built, link-fixed site (gitignored; regenerate with `make scripts/ mirror.py # the raw crawler (make mirror / mirror-retry) build_site.py # the cleanup build (make dist) + safe_build.py # the lyric-strip pass for public hosting (make safe) audit_links.py # the link auditor (make audit) release.sh # tag a semver release (make release) .github/workflows/ # CI, Pages deploy, release automation @@ -200,7 +231,8 @@ Run `make help` for the full target list. The site is hosted on **GitHub Pages** and deploys automatically: - **Every merge to `main`** runs CI (build + link audit) and, on success, - publishes the site to Pages (`.github/workflows/deploy-pages.yml`). The build + publishes the site to Pages (`.github/workflows/deploy-pages.yml`). The deploy + runs `make safe`, so the **annotation-only** site is what goes live; the build uses the committed `mirror/`, so no archive.org crawl happens in CI. - **Releases are semver-tagged.** `make release` reads [Conventional Commits](https://www.conventionalcommits.org/) since the last diff --git a/scripts/safe_build.py b/scripts/safe_build.py new file mode 100644 index 0000000..05689e7 --- /dev/null +++ b/scripts/safe_build.py @@ -0,0 +1,135 @@ +#!/usr/bin/env python3 +""" +Safe-publish pass: strip copyrighted song-lyric blocks from the built dist/. + +Runs AFTER build_site.py, transforming dist/ IN PLACE so CI can deploy a public +site that keeps David Dodd's annotations but omits the lyrics he was not able to +license. Local self-hosters who never run this keep the full site. + +Pure HTML->HTML and byte-preserving: edits go through the same latin-1 round-trip +build_site.py uses, so untouched bytes of the 1990s markup stay exactly as they +were. The pass is idempotent -- a stripped page carries NOTICE_MARK and is skipped +on re-run. + +What it removes (and, deliberately, what it does NOT): + +A song page is laid out as + + [byline] <- kept + "Song" + Words by ... ; music by ... <- the song-credit line + Copyright ...; used by permission. +
...the lyrics...
<- REMOVED +
+ ...annotations... <- kept (incl. the public-domain + poems, OED entries and reader + emails they quote) + +So we excise only the
(s) between the song-credit line and the first + annotation anchor, and drop a link to the official lyrics source in +their place. The credit line is what tells a real song page apart from a thematic +essay (e.g. goose.html / nonsense.html have no such line) -- those are skipped +untouched. Pages with no annotation anchor at all (bios, bibliographies) are +skipped too. Where a song shows more than one lyric blockquote (alternate verses, +e.g. clem.html), each is removed but any editorial note Dodd wrote between them is +preserved. +""" + +import re +from collections import Counter +from pathlib import Path + +DIST = Path(__file__).parent.parent / "dist" +SONGS_URL = "https://www.dead.net/songs" + +# First annotation anchor: marks where the essay begins. Lyrics +# live before it; everything from it onward is annotation we keep. +SEAM_RE = re.compile(r".*?
", re.I | re.S) + +# Marker left in stripped pages so re-running the pass is a no-op. +NOTICE_MARK = "" +NOTICE = ( + NOTICE_MARK + "\n
\n" + "

Lyrics omitted. The annotations below are reproduced by " + "permission of David Dodd; the song lyrics themselves are copyrighted and " + "are not reproduced here. Read them at the official source: " + f'dead.net/songs.

\n' + "
" +) + + +def strip_page(text): + """Return (new_text, n_blocks_removed) if this is a song page with lyrics to + strip, else None to leave the page untouched.""" + seam_m = SEAM_RE.search(text) + if not seam_m: + return None # no annotation anchor -> not a song page + seam = seam_m.start() + + credit_m = CREDIT_RE.search(text[:seam]) + if not credit_m: + return None # no song-credit line -> essay/bio, skip + lo = credit_m.start() + + # Lyric blockquotes are those starting between the credit line and the seam. + targets = [m for m in BLOCKQUOTE_RE.finditer(text) if lo <= m.start() < seam] + if not targets: + return None # song title page with no reproduced lyrics + + # Replace right-to-left so earlier match offsets stay valid. The first lyric + # block (in document order) becomes the notice; any others are dropped, while + # the prose between them -- Dodd's editorial notes -- is left in place. + # + # Some 1990s pages (e.g. eleven.html) never close the lyric
+ # before the annotations, so the match runs past the seam and would engulf + # the anchor. Clamp every removal at the seam: never delete a + # byte of the annotation section, even at the cost of leaving a stray, + # browser-ignored
behind. + first = targets[0] + out = text + for m in reversed(targets): + end = min(m.end(), seam) + repl = NOTICE if m is first else "" + out = out[: m.start()] + repl + out[end:] + return out, len(targets) + + +def main(): + if not DIST.exists(): + raise SystemExit("dist/ not found -- run 'make dist' first.") + + report = Counter() + for p in sorted(DIST.rglob("*.html")): + text = p.read_bytes().decode("latin-1") + report["scanned"] += 1 + if NOTICE_MARK in text: + report["already"] += 1 + continue + result = strip_page(text) + if result is None: + continue + out, n = result + p.write_bytes(out.encode("latin-1")) + report["stripped"] += 1 + report["blocks"] += n + + print(f"safe_build: scanned {report['scanned']} html pages") + print(f" {report['stripped']:4d} song pages stripped of lyrics") + print(f" {report['blocks']:4d} lyric blocks removed -> link to {SONGS_URL}") + if report["already"]: + print(f" {report['already']:4d} already stripped (idempotent skip)") + + +if __name__ == "__main__": + main() From 2cb1c73c3fb6384593f7e21e1777be405caed64d Mon Sep 17 00:00:00 2001 From: Damian Silbergleith <14797221+ds17f@users.noreply.github.com> Date: Thu, 4 Jun 2026 12:04:06 -0700 Subject: [PATCH 2/2] =?UTF-8?q?docs:=20clarify=20Quick=20start=20=E2=80=94?= =?UTF-8?q?=20clone=20is=20build-ready,=20make=20all=20is=20one-shot?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit mirror/ is committed, so a fresh clone needs no crawl; lead with `make all` as the single command to build + audit + serve the full site, and demote `make mirror` to a refresh-only step. Note how to preview the safe build and that a running server reflects `make safe` on refresh. Co-Authored-By: Claude Opus 4.8 --- README.md | 29 +++++++++++++++++++++++------ 1 file changed, 23 insertions(+), 6 deletions(-) diff --git a/README.md b/README.md index 4bef5d1..21004db 100644 --- a/README.md +++ b/README.md @@ -20,18 +20,35 @@ Lyrics*](https://www.simonandschuster.com/books/The-Complete-Annotated-Grateful- ## Quick start +The raw archive (`mirror/`) is **committed**, so a fresh clone is ready to build +— no crawl needed. The only prerequisite is [uv](https://docs.astral.sh/uv/). + ```bash -make install # install deps (uv) -make mirror # download the raw archive into mirror/ (~30-45 min, one time) -make dist # build the full browsable, link-fixed site into dist/ +make all # build the full site, audit it, and serve at localhost:8000 +``` + +That single command sees `mirror/` is already present (so it **skips** the +~40-minute crawl), builds `dist/`, audits link health, and serves the full site +at **http://localhost:8000**. `uv run` provisions dependencies on first use, so +a separate `make install` isn't required. + +Individual targets: + +```bash +make dist # build the full site (annotations + lyrics) into dist/ make safe # build dist/, then strip lyrics for safe public hosting make serve-dist # serve dist/ at http://localhost:8000 make audit # report link health of dist/ +make mirror # re-crawl the archive into mirror/ (~30-45 min; only to refresh) ``` -You don't strictly need a server — `dist/` is plain static HTML. After -`make dist` you can just open `dist/index.html` (or `dist/gdhome.html`) with a -`file://` URL in a browser and click through. +To preview the **safe** (annotation-only) build locally, run `make safe` then +`make serve-dist`. If a server is already running, `make safe` rewrites `dist/` +in place and a browser refresh shows the stripped version — no restart needed. + +You don't strictly need a server — `dist/` is plain static HTML. After a build +you can just open `dist/index.html` (or `dist/gdhome.html`) with a `file://` URL +in a browser and click through. ---