Unified feed crawler for the
stophammer V4V music index.
Fetches RSS feeds, hashes content, parses with the stophammer-parser Rust
library dependency, and submits results to a stophammer node's
/ingest/feed endpoint.
Four subcommands cover every discovery path:
- feed — one-shot URL list (file, args, env, or stdin)
- import — batch scan a PodcastIndex SQLite snapshot with resume cursor
- ndjson — replay cached feed NDJSON into stophammer without re-fetching
- gossip — long-running gossip-listener SSE consumer with optional archive replay
- Rust 1.85+ (edition 2024)
- For import mode: a PodcastIndex database snapshot
The recommended operator install surfaces are published from the main
stophammer release pipeline:
stophammer-crawler-<version>.tar.gzghcr.io/<owner>/stophammer-crawler- the Arch
stophammer-crawlerpackage built from the main repo packaging assets
The crawler release bundle includes:
stophammer-crawler- systemd units for long-running
gossipand optional one-shotfeed/import/ndjsonruns - example env files
sysusers.d/tmpfiles.dsnippets
Source builds require a sibling stophammer-parser checkout because the crawler
depends on it via a local Cargo path dependency:
git clone https://github.com/inthemorning/stophammer-parser
git clone https://github.com/inthemorning/stophammer-crawler
cd stophammer-crawler
cargo build --release --binsThe container build does not require that sibling checkout in the Docker
context. Dockerfile clones stophammer-parser from Git during
the builder stage and accepts optional build args:
docker build \
--build-arg STOPHAMMER_PARSER_REF=main \
-t stophammer-crawler .That produces target/release/stophammer-crawler.
Quick checks from this checkout:
cargo run -- feed --help
cargo run -- import --help
cargo run -- ndjson --help
cargo run -- gossip --helpThe examples below assume stophammer-crawler is on your PATH. After a local
source build, use ./target/release/stophammer-crawler instead.
Fetch and ingest a list of feed URLs:
# From arguments
CRAWL_TOKEN=secret \
INGEST_URL=http://127.0.0.1:8008/ingest/feed \
stophammer-crawler feed https://example.com/feed.xml
# From a file
CRAWL_TOKEN=secret \
INGEST_URL=http://127.0.0.1:8008/ingest/feed \
stophammer-crawler feed feeds.txt
# From env
CRAWL_TOKEN=secret \
INGEST_URL=http://127.0.0.1:8008/ingest/feed \
FEED_URLS="https://a.com/feed,https://b.com/feed" \
stophammer-crawler feed
# From stdin
cat urls.txt | CRAWL_TOKEN=secret \
INGEST_URL=http://127.0.0.1:8008/ingest/feed \
stophammer-crawler feedForce re-ingestion even if feed content is unchanged:
CRAWL_TOKEN=secret \
INGEST_URL=http://127.0.0.1:8008/ingest/feed \
stophammer-crawler --force feed https://example.com/feed.xml| Flag | Env | Default | Description |
|---|---|---|---|
--concurrency |
CONCURRENCY |
5 |
Parallel fetch+ingest workers |
--host-delay-ms |
HOST_DELAY_MS |
1500 |
Minimum ms between fetches to the same host |
--failed-feeds-output |
FAILED_FEEDS_OUTPUT |
./failed_feeds.txt |
Plain-text output file for retryable feed URLs |
Batch-scan a PodcastIndex snapshot for music feeds:
CRAWL_TOKEN=secret \
INGEST_URL=http://127.0.0.1:8008/ingest/feed \
stophammer-crawler import \
--batch 100 --concurrency 5Normal snapshot import excludes Wavlake-hosted rows. Use --wavlake-only for
the Wavlake-specific pass. The default all_feeds scope now jumps to the
music-first PodcastIndex lower bound instead of replaying the full pre-music
corpus from 0.
Restart or resume from an explicit PodcastIndex id:
CRAWL_TOKEN=secret \
INGEST_URL=http://127.0.0.1:8008/ingest/feed \
stophammer-crawler import \
--cursor 5000000 \
--batch 100 --concurrency 5Wavlake-only import from the same snapshot:
CRAWL_TOKEN=secret \
INGEST_URL=http://127.0.0.1:8008/ingest/feed \
stophammer-crawler import \
--wavlake-only \
--cursor 0If --db does not exist yet, the importer downloads the latest PodcastIndex
snapshot archive from https://public.podcastindex.org/podcastindex_feeds.db.tgz,
extracts the .db directly into place, and does not keep the .tgz on disk.
Use --refresh-db to check the remote snapshot with If-Modified-Since and
download only when it changed.
Downloads are streamed through gzip/tar extraction, so RAM usage stays low even
for multi-gigabyte snapshots.
| Flag | Default | Description |
|---|---|---|
--db <path> |
./podcastindex_feeds.db |
PodcastIndex snapshot path |
--db-url <url> |
(public PI URL) | Snapshot archive URL |
--refresh-db |
off | Conditionally refresh the snapshot if the remote archive changed |
--state <path> |
./import_state.db |
Progress cursor database |
--skip-db <path> |
./feed_skip.db |
Shared cross-mode skip database |
--batch <n> |
100 |
Feeds per DB query batch |
--concurrency <n> |
5 |
Parallel fetch+ingest workers |
--audit-output <path> |
off | Optional cached-feed NDJSON dump of successfully ingested 200 OK feeds |
--audit-replace |
off | Replace --audit-output instead of appending to it |
--dry-run |
off | Log without fetching/ingesting |
--skip-known-non-music |
off | Skip rows already known to fail the music/publisher medium gate, including non-music mediums and absent podcast:medium |
--skip-known-success |
off | Skip rows already known to have reached accepted, no_change, or skipped_known_success in importer memory |
--wavlake-only |
off | Restrict snapshot import to wavlake.com / www.wavlake.com feeds; without this flag, normal import excludes Wavlake rows |
--cursor <id> |
stored cursor | Start from an explicit PodcastIndex id instead of the stored cursor |
Progress is stored in --state. If the process is interrupted,
the next run resumes from the last completed batch. A crash
mid-batch re-processes that batch -- safe because stophammer
deduplicates on content hash. --cursor <id> overrides the stored
starting point for that run; in non-dry runs the override is also
persisted to the state DB before the batch loop starts. Cursor state is
scoped by import mode inside the state DB, so normal import and
--wavlake-only do not overwrite each other's resume positions.
--state now stores both the batch cursor and durable per-row importer memory
in import_feed_memory, including the latest fetch status, outcome,
raw_medium, attempt_duration_ms, and attempt counter for each attempted
PodcastIndex row.
Import mode now also guards each feed crawl with a hard deadline. If a feed
gets stuck below the normal HTTP timeout layer, the importer logs an explicit
timeout error for that row, records a retryable failure in import_feed_memory,
and continues. Import mode does not retry failed rows within the same run; the
state DB is the retry/memory mechanism. Slow batches emit heartbeat lines with
the currently pending row IDs and URLs instead of going silent.
--wavlake-only is intentionally slower than normal import mode. Normal import
skips Wavlake-hosted snapshot rows entirely. The Wavlake-specific mode queries
only those rows, forces single-flight fetches, applies a small delay between
requests, and if Wavlake returns 429 Too Many Requests it backs off using the
server's Retry-After value before attempting the next feed. Each Wavlake
429 also increases the ongoing inter-fetch delay by 1 second for the rest of
that run. Wavlake scope does not imply any skip policy; use
--skip-known-success and/or --skip-known-non-music if you want to trade
freshness for speed on reruns.
If --audit-output is set, import mode writes newly accepted feeds and
no_change re-submissions of already-ingested feeds to cached-feed NDJSON.
Rejected feeds, parse errors, missing or unaccepted podcast:medium values,
and non-200 fetches are not written. By default, --audit-output appends and
the writer de-dupes by source_db.feed_guid plus fetch.content_sha256 so
reruns do not append the same fetched body again. Use --audit-replace to
truncate and rewrite the file instead. Import mode also creates an adjacent
lock file while writing the NDJSON, so only one importer can target a given
output path at a time.
Operational note:
- initial auto-download only needs space for the extracted
.db --refresh-dbtemporarily needs room for both the existing.dband the new replacement.dbwhile the importer swaps them safely
Replay cached feed NDJSON rows into stophammer without re-fetching feeds.
Parses each row's raw_xml, posts it to /ingest/feed, and tracks progress
with a resume cursor.
CRAWL_TOKEN=secret \
INGEST_URL=http://127.0.0.1:8008/ingest/feed \
stophammer-crawler ndjson \
--input ./stored-feeds.ndjson \
--concurrency 5Force re-ingestion of all rows:
CRAWL_TOKEN=secret \
INGEST_URL=http://127.0.0.1:8008/ingest/feed \
stophammer-crawler ndjson \
--input ./stored-feeds.ndjson \
--force \
--resetDry-run to preview what would be ingested:
stophammer-crawler ndjson \
--input ./stored-feeds.ndjson \
--dry-run| Flag | Env | Default | Description |
|---|---|---|---|
--input <path> |
./stored-feeds.ndjson |
Path to cached feed NDJSON file | |
--state <path> |
./ndjson_state.db |
Resume-cursor state database | |
--batch <n> |
100 |
Rows per processing batch | |
--limit <n> |
off | Maximum NDJSON rows to process this run | |
--concurrency <n> |
CONCURRENCY |
5 |
Parallel parse+ingest workers |
--force |
FORCE_REINGEST |
off | Force re-ingestion even if content has not changed |
--dry-run |
off | Log candidates without posting to stophammer | |
--reset |
off | Clear resume cursor and start from the first row |
Progress is stored in --state. If the process is interrupted, the next run
resumes from the last completed batch. --reset clears the cursor.
Rows without raw_xml (e.g. fetch errors captured in the NDJSON) are
automatically skipped. Transient ingest failures (429, 5xx) are retried
up to 3 times with exponential backoff before counting as an error.
Listen to a local gossip-listener SSE stream for live podping notifications.
- A running
gossip-listenerinstance frompodping.alpha(provides the SSE stream and archive) - The crawler process must have read access to the archive database
CRAWL_TOKENandINGEST_URLenvironment variables set
Typical host setup:
- Install and start
podping.alpha/gossip-listener - Enable archive writing in that service
- Confirm the resulting
archive.dbpath on disk - Point
--archive-dbat that path
If the archive does not live at the default path, pass the actual path:
stophammer-crawler gossip --archive-db /some/other/path/archive.dbFor packaged systemd installs, the usual pattern is to add the
stophammer-crawler service user to the podping group so it can read the
archive written by gossip-listener.
For Docker installs, the root repo's reference docker-compose.yml runs
podping.alpha's gossip-listener as a sibling podping-listener service
and shares its archive.db with the crawler over a named Docker volume. That
default path inside the crawler container is:
GOSSIP_ARCHIVE_DB=/podping/archive.dbGOSSIP_SSE_URL=http://podping-listener:8089/events
Uses gossip-listener's archive.db as a durable podping backlog. Survives
restarts, SSE disconnects, and process crashes without losing notifications.
First-time bootstrap (catches up on the last 24 hours of podpings):
CRAWL_TOKEN=secret \
INGEST_URL=http://127.0.0.1:8008/ingest/feed \
stophammer-crawler gossip \
--archive-db /path/to/archive.db \
--since-hours 24 \
--skip-known-non-music \
--skip-ttl-days 30Subsequent runs resume from the stored cursor automatically:
CRAWL_TOKEN=secret \
INGEST_URL=http://127.0.0.1:8008/ingest/feed \
stophammer-crawler gossip \
--archive-db /path/to/archive.db \
--skip-known-non-music \
--skip-ttl-days 30On startup, the crawler validates the archive schema, checks cursor continuity (exits fatally if notifications were lost), replays missed notifications in batches of ~500 with backpressure, then connects to SSE for live events. A background reconciliation task queries the archive periodically (10 s, then 60 s steady-state, backing off to 5 min when idle) to catch anything SSE missed.
If the stored cursor is older than the oldest retained archive row, the crawler
exits with a fatal error. Delete gossip_state.db to re-bootstrap.
Without --archive-db, gossip mode streams SSE events only. Not restart-safe
— any events that arrive while the process is down are lost.
CRAWL_TOKEN=secret \
INGEST_URL=http://127.0.0.1:8008/ingest/feed \
stophammer-crawler gossipgossip_state.db records the latest crawl result for each feed URL (HTTP
status, outcome, medium, attempt count). With --skip-known-non-music, feeds
proven irrelevant by a prior crawl are skipped on future notifications:
- HTTP 200 with a non-music, non-publisher
raw_medium, or - a
[medium_music]rejection (including absentpodcast:medium)
Fetch errors (404, 429, timeouts), parse errors, and prior successful feeds are
never skipped. --skip-ttl-days <n> expires skip decisions after N days so
feeds are periodically re-evaluated.
| Flag | Default | Description |
|---|---|---|
--state <path> |
./gossip_state.db |
Cursor and feed memory database |
--skip-db <path> |
./feed_skip.db |
Shared cross-mode skip database |
--sse-url <url> |
http://localhost:8089/events |
SSE endpoint URL |
--archive-db <path> |
off | gossip-listener archive database path |
--since-hours <n> |
off | Bootstrap from N hours ago (requires --archive-db) |
--concurrency <n> |
3 |
Parallel fetch+ingest workers |
--skip-known-non-music |
off | Skip feeds proven non-music by prior crawl |
--skip-ttl-days <n> |
off | Re-evaluate skip decisions after N days |
-q, --quiet |
off | Hide non-music medium rejections |
CRAWL_TOKEN(required) -- Shared secret for stophammer ingest auth.INGEST_URL-- Stophammer ingest endpoint. Default:http://localhost:8008/ingest/feedFORCE_REINGEST-- Force re-ingestion of feeds even if content has not changed. Set to1to enable. Can also be passed as--forceflag at the top level.CONCURRENCY-- Worker pool size. Default:5(feed/import) /3(gossip)FEED_URLS-- Comma- or newline-separated URLs (feed mode only).PODCASTINDEX_DB_URL-- Override the PodcastIndex snapshot archive URL for import mode.
stophammer-crawler
src/
main.rs CLI dispatcher (clap subcommands)
crawl.rs Shared pipeline: fetch → SHA-256 → parse → POST
pool.rs Bounded concurrency pool (tokio semaphore)
dedup.rs In-memory cooldown map (gossip mode)
feed_skip.rs Shared skip-memory database for proven irrelevant feeds
modes/
batch.rs Load URLs from file/env/stdin, run pool
import.rs PodcastIndex DB batches, resume cursor, fallback GUID
ndjson.rs Replay cached feed NDJSON without re-fetching
gossip.rs SSE listener and optional archive replay for gossip-listener
The core pipeline in crawl.rs calls stophammer-parser as a Rust library — no
subprocess spawning. Every mode feeds URLs into the same crawl_feed() function:
- Fetch the RSS feed via
reqwest - Hash the raw response body with SHA-256
- Parse the XML with
stophammer-parser::profile::stophammer()(orstophammer_with_fallback()for import mode with PodcastIndex GUIDs) - POST the result as JSON to the stophammer
/ingest/feedendpoint
Concurrency is bounded by a tokio semaphore — crawl and import modes drain a fixed task list; gossip mode runs an unbounded stream with a permit-based cap.
Published releases also push a crawler runtime image:
ghcr.io/<owner>/stophammer-crawler
The main stophammer repo's docker-compose.yml includes
multi-mode crawler support with shared crawler-data volume:
# One-shot feed crawl
docker compose run --rm stophammer-crawler feed https://example.com/feed.xml
# Force re-ingestion
docker compose run --rm stophammer-crawler --force feed https://example.com/feed.xml
# Replay NDJSON corpus
docker compose run --rm stophammer-crawler ndjson --input /data/stored-feeds.ndjson
# Long-running gossip listener
docker compose up -d gossip
# Batch import from PodcastIndex
docker compose run --rm importThe stophammer-crawler service requires CRAWL_TOKEN and INGEST_URL in
packaging/env/crawler-feed.compose.env.
The gossip and import services use separate env files configured in the compose stack.
For day-to-day operation, run the binary directly or provide your own scheduler / compose
deployment around stophammer-crawler gossip, stophammer-crawler import, or
stophammer-crawler feed.
AGPL-3.0-only — see LICENSE.