stophammer-crawler

Unified feed crawler for the stophammer V4V music index. Fetches RSS feeds, hashes content, parses with the stophammer-parser Rust library dependency, and submits results to a stophammer node's /ingest/feed endpoint.

Four subcommands cover every discovery path:

feed — one-shot URL list (file, args, env, or stdin)
import — batch scan a PodcastIndex SQLite snapshot with resume cursor
ndjson — replay cached feed NDJSON into stophammer without re-fetching
gossip — long-running gossip-listener SSE consumer with optional archive replay

Requirements

Rust 1.85+ (edition 2024)
For import mode: a PodcastIndex database snapshot

Installation

Published artifacts

The recommended operator install surfaces are published from the main stophammer release pipeline:

stophammer-crawler-<version>.tar.gz
ghcr.io/<owner>/stophammer-crawler
the Arch stophammer-crawler package built from the main repo packaging assets

The crawler release bundle includes:

stophammer-crawler
systemd units for long-running gossip and optional one-shot feed / import / ndjson runs
example env files
sysusers.d / tmpfiles.d snippets

Build from source

Source builds require a sibling stophammer-parser checkout because the crawler depends on it via a local Cargo path dependency:

git clone https://github.com/inthemorning/stophammer-parser
git clone https://github.com/inthemorning/stophammer-crawler

cd stophammer-crawler
cargo build --release --bins

The container build does not require that sibling checkout in the Docker context. Dockerfile clones stophammer-parser from Git during the builder stage and accepts optional build args:

docker build \
  --build-arg STOPHAMMER_PARSER_REF=main \
  -t stophammer-crawler .

That produces target/release/stophammer-crawler.

Quick checks from this checkout:

cargo run -- feed --help
cargo run -- import --help
cargo run -- ndjson --help
cargo run -- gossip --help

Usage

The examples below assume stophammer-crawler is on your PATH. After a local source build, use ./target/release/stophammer-crawler instead.

feed

Fetch and ingest a list of feed URLs:

# From arguments
CRAWL_TOKEN=secret \
INGEST_URL=http://127.0.0.1:8008/ingest/feed \
stophammer-crawler feed https://example.com/feed.xml

# From a file
CRAWL_TOKEN=secret \
INGEST_URL=http://127.0.0.1:8008/ingest/feed \
stophammer-crawler feed feeds.txt

# From env
CRAWL_TOKEN=secret \
INGEST_URL=http://127.0.0.1:8008/ingest/feed \
FEED_URLS="https://a.com/feed,https://b.com/feed" \
stophammer-crawler feed

# From stdin
cat urls.txt | CRAWL_TOKEN=secret \
  INGEST_URL=http://127.0.0.1:8008/ingest/feed \
  stophammer-crawler feed

Force re-ingestion even if feed content is unchanged:

CRAWL_TOKEN=secret \
INGEST_URL=http://127.0.0.1:8008/ingest/feed \
stophammer-crawler --force feed https://example.com/feed.xml

feed options

Flag	Env	Default	Description
`--concurrency`	`CONCURRENCY`	`5`	Parallel fetch+ingest workers
`--host-delay-ms`	`HOST_DELAY_MS`	`1500`	Minimum ms between fetches to the same host
`--failed-feeds-output`	`FAILED_FEEDS_OUTPUT`	`./failed_feeds.txt`	Plain-text output file for retryable feed URLs

import

Batch-scan a PodcastIndex snapshot for music feeds:

CRAWL_TOKEN=secret \
INGEST_URL=http://127.0.0.1:8008/ingest/feed \
stophammer-crawler import \
  --batch 100 --concurrency 5

Normal snapshot import excludes Wavlake-hosted rows. Use --wavlake-only for the Wavlake-specific pass. The default all_feeds scope now jumps to the music-first PodcastIndex lower bound instead of replaying the full pre-music corpus from 0.

Restart or resume from an explicit PodcastIndex id:

CRAWL_TOKEN=secret \
INGEST_URL=http://127.0.0.1:8008/ingest/feed \
stophammer-crawler import \
  --cursor 5000000 \
  --batch 100 --concurrency 5

Wavlake-only import from the same snapshot:

CRAWL_TOKEN=secret \
INGEST_URL=http://127.0.0.1:8008/ingest/feed \
stophammer-crawler import \
  --wavlake-only \
  --cursor 0

If --db does not exist yet, the importer downloads the latest PodcastIndex snapshot archive from https://public.podcastindex.org/podcastindex_feeds.db.tgz, extracts the .db directly into place, and does not keep the .tgz on disk. Use --refresh-db to check the remote snapshot with If-Modified-Since and download only when it changed. Downloads are streamed through gzip/tar extraction, so RAM usage stays low even for multi-gigabyte snapshots.

Import options

Flag	Default	Description
`--db <path>`	`./podcastindex_feeds.db`	PodcastIndex snapshot path
`--db-url <url>`	(public PI URL)	Snapshot archive URL
`--refresh-db`	off	Conditionally refresh the snapshot if the remote archive changed
`--state <path>`	`./import_state.db`	Progress cursor database
`--skip-db <path>`	`./feed_skip.db`	Shared cross-mode skip database
`--batch <n>`	`100`	Feeds per DB query batch
`--concurrency <n>`	`5`	Parallel fetch+ingest workers
`--audit-output <path>`	off	Optional cached-feed NDJSON dump of successfully ingested `200 OK` feeds
`--audit-replace`	off	Replace `--audit-output` instead of appending to it
`--dry-run`	off	Log without fetching/ingesting
`--skip-known-non-music`	off	Skip rows already known to fail the music/publisher medium gate, including non-`music` mediums and absent `podcast:medium`
`--skip-known-success`	off	Skip rows already known to have reached `accepted`, `no_change`, or `skipped_known_success` in importer memory
`--wavlake-only`	off	Restrict snapshot import to `wavlake.com` / `www.wavlake.com` feeds; without this flag, normal import excludes Wavlake rows
`--cursor <id>`	stored cursor	Start from an explicit PodcastIndex id instead of the stored cursor

Progress is stored in --state. If the process is interrupted, the next run resumes from the last completed batch. A crash mid-batch re-processes that batch -- safe because stophammer deduplicates on content hash. --cursor <id> overrides the stored starting point for that run; in non-dry runs the override is also persisted to the state DB before the batch loop starts. Cursor state is scoped by import mode inside the state DB, so normal import and --wavlake-only do not overwrite each other's resume positions.

--state now stores both the batch cursor and durable per-row importer memory in import_feed_memory, including the latest fetch status, outcome, raw_medium, attempt_duration_ms, and attempt counter for each attempted PodcastIndex row.

Import mode now also guards each feed crawl with a hard deadline. If a feed gets stuck below the normal HTTP timeout layer, the importer logs an explicit timeout error for that row, records a retryable failure in import_feed_memory, and continues. Import mode does not retry failed rows within the same run; the state DB is the retry/memory mechanism. Slow batches emit heartbeat lines with the currently pending row IDs and URLs instead of going silent.

--wavlake-only is intentionally slower than normal import mode. Normal import skips Wavlake-hosted snapshot rows entirely. The Wavlake-specific mode queries only those rows, forces single-flight fetches, applies a small delay between requests, and if Wavlake returns 429 Too Many Requests it backs off using the server's Retry-After value before attempting the next feed. Each Wavlake 429 also increases the ongoing inter-fetch delay by 1 second for the rest of that run. Wavlake scope does not imply any skip policy; use --skip-known-success and/or --skip-known-non-music if you want to trade freshness for speed on reruns.

If --audit-output is set, import mode writes newly accepted feeds and no_change re-submissions of already-ingested feeds to cached-feed NDJSON. Rejected feeds, parse errors, missing or unaccepted podcast:medium values, and non-200 fetches are not written. By default, --audit-output appends and the writer de-dupes by source_db.feed_guid plus fetch.content_sha256 so reruns do not append the same fetched body again. Use --audit-replace to truncate and rewrite the file instead. Import mode also creates an adjacent lock file while writing the NDJSON, so only one importer can target a given output path at a time.

Operational note:

initial auto-download only needs space for the extracted .db
--refresh-db temporarily needs room for both the existing .db and the new replacement .db while the importer swaps them safely

ndjson

Replay cached feed NDJSON rows into stophammer without re-fetching feeds. Parses each row's raw_xml, posts it to /ingest/feed, and tracks progress with a resume cursor.

CRAWL_TOKEN=secret \
INGEST_URL=http://127.0.0.1:8008/ingest/feed \
stophammer-crawler ndjson \
  --input ./stored-feeds.ndjson \
  --concurrency 5

Force re-ingestion of all rows:

CRAWL_TOKEN=secret \
INGEST_URL=http://127.0.0.1:8008/ingest/feed \
stophammer-crawler ndjson \
  --input ./stored-feeds.ndjson \
  --force \
  --reset

Dry-run to preview what would be ingested:

stophammer-crawler ndjson \
  --input ./stored-feeds.ndjson \
  --dry-run

ndjson options

Flag	Env	Default	Description
`--input <path>`		`./stored-feeds.ndjson`	Path to cached feed NDJSON file
`--state <path>`		`./ndjson_state.db`	Resume-cursor state database
`--batch <n>`		`100`	Rows per processing batch
`--limit <n>`		off	Maximum NDJSON rows to process this run
`--concurrency <n>`	`CONCURRENCY`	`5`	Parallel parse+ingest workers
`--force`	`FORCE_REINGEST`	off	Force re-ingestion even if content has not changed
`--dry-run`		off	Log candidates without posting to stophammer
`--reset`		off	Clear resume cursor and start from the first row

Progress is stored in --state. If the process is interrupted, the next run resumes from the last completed batch. --reset clears the cursor.

Rows without raw_xml (e.g. fetch errors captured in the NDJSON) are automatically skipped. Transient ingest failures (429, 5xx) are retried up to 3 times with exponential backoff before counting as an error.

gossip

Listen to a local gossip-listener SSE stream for live podping notifications.

Prerequisites

A running gossip-listener instance from podping.alpha (provides the SSE stream and archive)
The crawler process must have read access to the archive database
CRAWL_TOKEN and INGEST_URL environment variables set

Typical host setup:

Install and start podping.alpha / gossip-listener
Enable archive writing in that service
Confirm the resulting archive.db path on disk
Point --archive-db at that path

If the archive does not live at the default path, pass the actual path:

stophammer-crawler gossip --archive-db /some/other/path/archive.db

For packaged systemd installs, the usual pattern is to add the stophammer-crawler service user to the podping group so it can read the archive written by gossip-listener.

For Docker installs, the root repo's reference docker-compose.yml runs podping.alpha's gossip-listener as a sibling podping-listener service and shares its archive.db with the crawler over a named Docker volume. That default path inside the crawler container is:

GOSSIP_ARCHIVE_DB=/podping/archive.db
GOSSIP_SSE_URL=http://podping-listener:8089/events

Archive-backed mode (recommended)

Uses gossip-listener's archive.db as a durable podping backlog. Survives restarts, SSE disconnects, and process crashes without losing notifications.

First-time bootstrap (catches up on the last 24 hours of podpings):

CRAWL_TOKEN=secret \
INGEST_URL=http://127.0.0.1:8008/ingest/feed \
stophammer-crawler gossip \
  --archive-db /path/to/archive.db \
  --since-hours 24 \
  --skip-known-non-music \
  --skip-ttl-days 30

Subsequent runs resume from the stored cursor automatically:

CRAWL_TOKEN=secret \
INGEST_URL=http://127.0.0.1:8008/ingest/feed \
stophammer-crawler gossip \
  --archive-db /path/to/archive.db \
  --skip-known-non-music \
  --skip-ttl-days 30

On startup, the crawler validates the archive schema, checks cursor continuity (exits fatally if notifications were lost), replays missed notifications in batches of ~500 with backpressure, then connects to SSE for live events. A background reconciliation task queries the archive periodically (10 s, then 60 s steady-state, backing off to 5 min when idle) to catch anything SSE missed.

If the stored cursor is older than the oldest retained archive row, the crawler exits with a fatal error. Delete gossip_state.db to re-bootstrap.

Live-only mode (best-effort)

Without --archive-db, gossip mode streams SSE events only. Not restart-safe — any events that arrive while the process is down are lost.

CRAWL_TOKEN=secret \
INGEST_URL=http://127.0.0.1:8008/ingest/feed \
stophammer-crawler gossip

Feed memory and skip logic

gossip_state.db records the latest crawl result for each feed URL (HTTP status, outcome, medium, attempt count). With --skip-known-non-music, feeds proven irrelevant by a prior crawl are skipped on future notifications:

HTTP 200 with a non-music, non-publisher raw_medium, or
a [medium_music] rejection (including absent podcast:medium)

Fetch errors (404, 429, timeouts), parse errors, and prior successful feeds are never skipped. --skip-ttl-days <n> expires skip decisions after N days so feeds are periodically re-evaluated.

Gossip options

Flag	Default	Description
`--state <path>`	`./gossip_state.db`	Cursor and feed memory database
`--skip-db <path>`	`./feed_skip.db`	Shared cross-mode skip database
`--sse-url <url>`	`http://localhost:8089/events`	SSE endpoint URL
`--archive-db <path>`	off	gossip-listener archive database path
`--since-hours <n>`	off	Bootstrap from N hours ago (requires `--archive-db`)
`--concurrency <n>`	`3`	Parallel fetch+ingest workers
`--skip-known-non-music`	off	Skip feeds proven non-music by prior crawl
`--skip-ttl-days <n>`	off	Re-evaluate skip decisions after N days
`-q, --quiet`	off	Hide non-music medium rejections

Environment variables

CRAWL_TOKEN (required) -- Shared secret for stophammer ingest auth.
INGEST_URL -- Stophammer ingest endpoint. Default: http://localhost:8008/ingest/feed
FORCE_REINGEST -- Force re-ingestion of feeds even if content has not changed. Set to 1 to enable. Can also be passed as --force flag at the top level.
CONCURRENCY -- Worker pool size. Default: 5 (feed/import) / 3 (gossip)
FEED_URLS -- Comma- or newline-separated URLs (feed mode only).
PODCASTINDEX_DB_URL -- Override the PodcastIndex snapshot archive URL for import mode.

Architecture

stophammer-crawler
  src/
    main.rs           CLI dispatcher (clap subcommands)
    crawl.rs          Shared pipeline: fetch → SHA-256 → parse → POST
    pool.rs           Bounded concurrency pool (tokio semaphore)
    dedup.rs          In-memory cooldown map (gossip mode)
    feed_skip.rs      Shared skip-memory database for proven irrelevant feeds
    modes/
      batch.rs        Load URLs from file/env/stdin, run pool
      import.rs       PodcastIndex DB batches, resume cursor, fallback GUID
      ndjson.rs       Replay cached feed NDJSON without re-fetching
      gossip.rs       SSE listener and optional archive replay for gossip-listener

The core pipeline in crawl.rs calls stophammer-parser as a Rust library — no subprocess spawning. Every mode feeds URLs into the same crawl_feed() function:

Fetch the RSS feed via reqwest
Hash the raw response body with SHA-256
Parse the XML with stophammer-parser::profile::stophammer() (or stophammer_with_fallback() for import mode with PodcastIndex GUIDs)
POST the result as JSON to the stophammer /ingest/feed endpoint

Concurrency is bounded by a tokio semaphore — crawl and import modes drain a fixed task list; gossip mode runs an unbounded stream with a permit-based cap.

Docker

Published releases also push a crawler runtime image:

ghcr.io/<owner>/stophammer-crawler

The main stophammer repo's docker-compose.yml includes multi-mode crawler support with shared crawler-data volume:

# One-shot feed crawl
docker compose run --rm stophammer-crawler feed https://example.com/feed.xml

# Force re-ingestion
docker compose run --rm stophammer-crawler --force feed https://example.com/feed.xml

# Replay NDJSON corpus
docker compose run --rm stophammer-crawler ndjson --input /data/stored-feeds.ndjson

# Long-running gossip listener
docker compose up -d gossip

# Batch import from PodcastIndex
docker compose run --rm import

The stophammer-crawler service requires CRAWL_TOKEN and INGEST_URL in packaging/env/crawler-feed.compose.env. The gossip and import services use separate env files configured in the compose stack.

For day-to-day operation, run the binary directly or provide your own scheduler / compose deployment around stophammer-crawler gossip, stophammer-crawler import, or stophammer-crawler feed.

License

AGPL-3.0-only — see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 74 Commits
src		src
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

stophammer-crawler

Requirements

Installation

Published artifacts

Build from source

Usage

feed

feed options

import

Import options

ndjson

ndjson options

gossip

Prerequisites

Archive-backed mode (recommended)

Live-only mode (best-effort)

Feed memory and skip logic

Gossip options

Environment variables

Architecture

Docker

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

stophammer-crawler

Requirements

Installation

Published artifacts

Build from source

Usage

feed

feed options

import

Import options

ndjson

ndjson options

gossip

Prerequisites

Archive-backed mode (recommended)

Live-only mode (best-effort)

Feed memory and skip logic

Gossip options

Environment variables

Architecture

Docker

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages