Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 5 additions & 4 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
name: CI

# Builds the site from the committed mirror/ and audits link health.
# This is the required status check for pull requests into main.
# Builds the safe public site from the committed mirror/ (full build + lyric
# strip) and audits its link health. This is the required status check for
# pull requests into main, and it gates exactly what deploy-pages.yml ships.
on:
pull_request:
branches: [main]
Expand All @@ -25,8 +26,8 @@ jobs:
- name: Install dependencies
run: make install

- name: Build dist/
run: make dist
- name: Build the safe public site (build + strip lyrics)
run: make safe

- name: Audit link health
run: make audit
10 changes: 6 additions & 4 deletions .github/workflows/deploy-pages.yml
Original file line number Diff line number Diff line change
@@ -1,8 +1,10 @@
name: Deploy to GitHub Pages

# On every push to main (i.e. after a PR merges), build the site from the
# committed mirror/ and deploy dist/ to GitHub Pages. No archive.org crawl is
# needed — mirror/ is the committed source of truth.
# committed mirror/, strip the copyrighted lyrics, and deploy dist/ to GitHub
# Pages. No archive.org crawl is needed — mirror/ is the committed source of
# truth. The public deploy runs `make safe` (build + lyric strip), so the
# hosted site keeps David Dodd's annotations but omits the song lyrics.
on:
push:
branches: [main]
Expand Down Expand Up @@ -32,8 +34,8 @@ jobs:
- name: Install dependencies
run: make install

- name: Build dist/
run: make dist
- name: Build the safe public site (build + strip lyrics)
run: make safe

- name: Set custom domain (persists across deploys)
run: echo 'annotated.thedeadly.app' > dist/CNAME
Expand Down
6 changes: 5 additions & 1 deletion Makefile
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
.PHONY: install mirror mirror-retry dist audit serve-dist all release release-dryrun clean help
.PHONY: install mirror mirror-retry dist safe audit serve-dist all release release-dryrun clean help

help: ## Show this help message
@echo 'Usage: make [target]'
Expand Down Expand Up @@ -32,6 +32,10 @@ dist: ## Build the browsable, link-fixed static site into dist/ from mirror/
@test -d mirror || { echo "mirror/ not found — run 'make mirror' first."; exit 1; }
uv run python scripts/build_site.py

safe: ## Build dist/, then strip copyrighted lyrics for safe public hosting
@$(MAKE) dist
uv run python scripts/safe_build.py

audit: ## Audit link health of the built dist/ site
@test -d dist || { echo "dist/ not found — run 'make dist' first."; exit 1; }
uv run python scripts/audit_links.py dist
Expand Down
96 changes: 96 additions & 0 deletions PLAN.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
# Project Legal-Cleanup & Safe-Hosting Plan

## Context
- The repository currently ships a full copy of *The Annotated Grateful Dead Lyrics* in `mirror/` (308 resources), which `build_site.py` turns into the full browsable site in `dist/`.
- David Dodd gave permission for the **annotations/essays** but **not** the underlying song lyrics (the lyrics are separately copyrighted, mostly "Ice Nine Publishing; used by permission").
- We want to keep the tooling (crawler, builder, auditor) so anyone can self-host the **full** site **if they obtain a lawful source** (e.g., the Wayback snapshot).
- For **our** public deployment we publish a **safe** version: the annotations are kept, the copyrighted lyric blocks are removed, and each song links out to the official lyrics source.

## Guiding principle (the shape of the design)
**Local = full. Publish = safe.**
- `make dist` always builds the complete site into `dist/`, unchanged. This is the local source of truth a self-hoster gets.
- `make safe` runs a **standalone post-pass** (`scripts/safe_build.py`) that strips the copyrighted lyric blocks **from `dist/` in place** and inserts a link to the official source. CI runs `make safe` before deploying.
- The strip pass is pure HTML→HTML; it has no coupling to the mirror crawl, so it can be reasoned about and tested on its own.

## Goals
| Goal | Desired outcome |
|------|----------------|
| **A. Standalone safe pass** | `scripts/safe_build.py` + a `make safe` target that strips lyric blocks from the built `dist/` (in place) and links each song to <https://www.dead.net/songs>. Keeps **all** annotation markup. |
| **B. CI publishes the safe site** | The existing GitHub-Actions workflow runs `make safe` (build → strip) before deploying to Pages. |
| **C. Placeholder when mirror is absent** | `build_site.py` emits a minimal placeholder `dist/index.html` when `mirror/` is empty/missing, so a fresh clone (post-purge) still builds something coherent. |
| **D. Documentation** | A "Status & Self-hosting" section in `README.md` covering full vs. safe builds and how to obtain a lawful mirror. |
| **E. Ignore the raw mirror** | `.gitignore` ignores `mirror/`. |
| **F. (LAST) Purge mirror from history** | After everything above works and is verified, back up `mirror/` outside the repo, then rewrite history to remove it, force-push, and notify collaborators. **This is the final, irreversible step — done deliberately at the end, not up front.** |

## How the lyric stripping actually works (corrected)
> The previous draft assumed lyrics lived in `<pre class="lyrics">…</pre>`. **That element does not exist anywhere** (0 pages). The real markup is 1990s-era and lyrics sit in a `<blockquote>` — but so do many *non-lyric* annotation quotes (public-domain poems like Lovelace, OED entries, reader emails, nursery rhymes). A blanket blockquote strip would destroy annotation content we are allowed to publish.

The dependable structural seam, verified across all 124 song pages:

```
[byline: David Dodd, UCSC]
<a href="#title"><b>"Song Title"</b></a>
Words by … ; music by …
Copyright Ice Nine Publishing; used by permission. ← song-credit line
<blockquote> …the copyrighted lyrics… </blockquote> ← THE block to remove
<hr>
<a name="title"> ← annotations begin here; everything from here on STAYS
…essays, which may themselves contain <blockquote>s we keep…
```

Rule: **remove the lyric `<blockquote>`(s) that sit between the song-credit line and the first `<a name="…">` annotation anchor**, and replace with a short notice linking to <https://www.dead.net/songs>. Never touch blockquotes that appear after the first annotation anchor.

Edge cases the pass must handle:
- **`darkstar.html`** — has lyrics but its first annotation anchor is `<a name="dark">`, not `title`. So key on the *first* `<a name=…>` anchor, not the literal string "title".
- **`appl.html` and similar** — only annotate a *title phrase*; there is **no lyric blockquote**. These must be left untouched (the pass finds no block and skips).
- **`Copyright Steve Silberman. Used by permission.`** — an essay credit, not a song lyric. The "blockquote before the first annotation anchor" rule (not a bare "used by permission" text match) avoids mis-stripping here; add a guard/whitelist if needed.
- The pass should be a no-op on pages with no lyric block, and should **report counts** (pages processed, lyric blocks removed) the way `build_site.py` reports its passes.

## Reuse of existing code
- `scripts/mirror.py` — unchanged; still crawls the Wayback snapshot for self-hosters.
- `scripts/build_site.py` — unchanged for the full build; gains only the empty-mirror **placeholder** behavior (Goal C).
- `scripts/safe_build.py` — **new** standalone post-pass (Goal A). Reads `dist/`, strips lyric blocks in place, injects the dead.net link.
- `scripts/audit_links.py` — unchanged; audits whatever is in `dist/` (full or safe).
- `Makefile` — gains a `safe` target: `make dist` then `python scripts/safe_build.py`.
- `.gitignore` — add `mirror/`.

## Files to Modify / Add
| Path | Reason |
|------|--------|
| `scripts/safe_build.py` | New standalone strip-and-relink pass over `dist/`. |
| `Makefile` | New `safe` target (build → strip). |
| `scripts/build_site.py` | Emit placeholder `index.html` when `mirror/` is empty. |
| `.github/workflows/…` | Deploy step runs `make safe` instead of `make dist`. |
| `README.md` | "Status & Self-hosting" section (full vs. safe). |
| `.gitignore` | Ignore `mirror/`. |
| `PLAN.md` (this file) | Living plan. |

## Implementation Steps (ordered; destructive step is LAST)
1. **Write `scripts/safe_build.py`** implementing the boundary rule above; make it idempotent and have it print a count report. Verify on `althea.html`, `darkstar.html` (anchor != "title"), and `appl.html` (no lyric block → untouched).
2. **Add the `make safe` target** (`make dist` then `uv run python scripts/safe_build.py`).
3. **Add the placeholder** to `build_site.py` for the empty-`mirror/` case (Goal C).
4. **Point CI at `make safe`** so Pages serves the safe site; confirm `make audit` passes against the stripped `dist/`.
5. **Docs + `.gitignore`**: add the README section; add `mirror/` to `.gitignore`.
6. **Verify the whole pipeline locally** end to end (full build, safe strip, audit).
7. **(LAST) Purge `mirror/` from history** — only after 1–6 are merged and the safe site is confirmed:
```bash
cp -r mirror ../annotated-lyrics-mirror-backup # backup OUTSIDE the repo first
git checkout main
git filter-repo --path mirror/ --invert-paths
git rev-list --objects --all | grep mirror/ # must return nothing
git push --force-with-lease
```
Then notify collaborators to re-clone (every commit hash changes).

## Verification Checklist
- [ ] `safe_build.py` strips the lyric blockquote on `althea.html` and `darkstar.html`, leaves `appl.html` untouched, and preserves all annotation-internal blockquotes.
- [ ] `safe_build.py` prints a count report and is idempotent (safe to re-run on an already-stripped `dist/`).
- [ ] `make safe` produces a `dist/` whose song pages link to <https://www.dead.net/songs> in place of lyrics.
- [ ] `make audit` passes against the stripped `dist/`.
- [ ] `build_site.py` emits a placeholder `index.html` when `mirror/` is empty.
- [ ] CI deploys the safe site to Pages without errors.
- [ ] `README.md` documents full vs. safe builds; `.gitignore` ignores `mirror/`.
- [ ] **(LAST)** Backup exists at `../annotated-lyrics-mirror-backup`; history contains no `mirror/` objects; force-push done; collaborators notified.

---
*Living planning artifact. Step 7 (history rewrite) is intentionally deferred to the very end.*
63 changes: 56 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,17 +20,35 @@ Lyrics*](https://www.simonandschuster.com/books/The-Complete-Annotated-Grateful-

## Quick start

The raw archive (`mirror/`) is **committed**, so a fresh clone is ready to build
— no crawl needed. The only prerequisite is [uv](https://docs.astral.sh/uv/).

```bash
make all # build the full site, audit it, and serve at localhost:8000
```

That single command sees `mirror/` is already present (so it **skips** the
~40-minute crawl), builds `dist/`, audits link health, and serves the full site
at **http://localhost:8000**. `uv run` provisions dependencies on first use, so
a separate `make install` isn't required.

Individual targets:

```bash
make install # install deps (uv)
make mirror # download the raw archive into mirror/ (~30-45 min, one time)
make dist # build the browsable, link-fixed site into dist/
make dist # build the full site (annotations + lyrics) into dist/
make safe # build dist/, then strip lyrics for safe public hosting
make serve-dist # serve dist/ at http://localhost:8000
make audit # report link health of dist/
make mirror # re-crawl the archive into mirror/ (~30-45 min; only to refresh)
```

You don't strictly need a server — `dist/` is plain static HTML. After
`make dist` you can just open `dist/index.html` (or `dist/gdhome.html`) with a
`file://` URL in a browser and click through.
To preview the **safe** (annotation-only) build locally, run `make safe` then
`make serve-dist`. If a server is already running, `make safe` rewrites `dist/`
in place and a browser refresh shows the stripped version — no restart needed.

You don't strictly need a server — `dist/` is plain static HTML. After a build
you can just open `dist/index.html` (or `dist/gdhome.html`) with a `file://` URL
in a browser and click through.

---

Expand All @@ -57,6 +75,35 @@ without ever re-hitting archive.org.

---

## Status & self-hosting: full vs. safe builds

David Dodd generously gave permission to host his **annotations and essays**.
He did **not** (and could not) license the underlying **song lyrics**, which are
separately copyrighted. So this project distinguishes two builds:

| Build | Command | Contains | Who it's for |
|-------|---------|----------|--------------|
| **Full** | `make dist` | annotations **and** lyrics | local self-hosters with a lawful source |
| **Safe** | `make safe` | annotations only; lyrics replaced with a link to [dead.net/songs](https://www.dead.net/songs) | the public site we deploy |

**`make safe`** runs the normal build and then a standalone pass
(`scripts/safe_build.py`) that strips each song's verbatim lyric block — the
`<blockquote>` between the song's credit line and the first annotation anchor —
and drops a link to the official lyrics source in its place. **Everything else
is preserved**: the essays, and the public-domain poems, dictionary entries, and
reader correspondence quoted *within* the annotations. Short lyric fragments
quoted inline for commentary in the essays are left intact (permitted annotation
/ fair use); only the full per-song lyric reproductions are removed. The pass is
idempotent and byte-preserving, and it never deletes a byte of the annotation
section even on pages with malformed 1990s markup.

The public site at **https://annotated.thedeadly.app/** is the **safe** build —
CI runs `make safe` before deploying. If you have obtained a lawful copy of the
original HTML (e.g. via the Internet Archive, see `make mirror`), `make dist`
gives you the complete site locally.

---

## The pipeline, pass by pass (with real results)

### 1. `make mirror` — the raw crawl (`scripts/mirror.py`)
Expand Down Expand Up @@ -184,6 +231,7 @@ dist/ # built, link-fixed site (gitignored; regenerate with `make
scripts/
mirror.py # the raw crawler (make mirror / mirror-retry)
build_site.py # the cleanup build (make dist)
safe_build.py # the lyric-strip pass for public hosting (make safe)
audit_links.py # the link auditor (make audit)
release.sh # tag a semver release (make release)
.github/workflows/ # CI, Pages deploy, release automation
Expand All @@ -200,7 +248,8 @@ Run `make help` for the full target list.
The site is hosted on **GitHub Pages** and deploys automatically:

- **Every merge to `main`** runs CI (build + link audit) and, on success,
publishes the site to Pages (`.github/workflows/deploy-pages.yml`). The build
publishes the site to Pages (`.github/workflows/deploy-pages.yml`). The deploy
runs `make safe`, so the **annotation-only** site is what goes live; the build
uses the committed `mirror/`, so no archive.org crawl happens in CI.
- **Releases are semver-tagged.** `make release` reads
[Conventional Commits](https://www.conventionalcommits.org/) since the last
Expand Down
Loading
Loading