Skip to content

InTheMorning/stophammer-crawler

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

74 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

stophammer-crawler

Unified feed crawler for the stophammer V4V music index. Fetches RSS feeds, hashes content, parses with the stophammer-parser Rust library dependency, and submits results to a stophammer node's /ingest/feed endpoint.

Four subcommands cover every discovery path:

  • feed — one-shot URL list (file, args, env, or stdin)
  • import — batch scan a PodcastIndex SQLite snapshot with resume cursor
  • ndjson — replay cached feed NDJSON into stophammer without re-fetching
  • gossip — long-running gossip-listener SSE consumer with optional archive replay

Requirements

  • Rust 1.85+ (edition 2024)
  • For import mode: a PodcastIndex database snapshot

Installation

Published artifacts

The recommended operator install surfaces are published from the main stophammer release pipeline:

  • stophammer-crawler-<version>.tar.gz
  • ghcr.io/<owner>/stophammer-crawler
  • the Arch stophammer-crawler package built from the main repo packaging assets

The crawler release bundle includes:

  • stophammer-crawler
  • systemd units for long-running gossip and optional one-shot feed / import / ndjson runs
  • example env files
  • sysusers.d / tmpfiles.d snippets

Build from source

Source builds require a sibling stophammer-parser checkout because the crawler depends on it via a local Cargo path dependency:

git clone https://github.com/inthemorning/stophammer-parser
git clone https://github.com/inthemorning/stophammer-crawler

cd stophammer-crawler
cargo build --release --bins

The container build does not require that sibling checkout in the Docker context. Dockerfile clones stophammer-parser from Git during the builder stage and accepts optional build args:

docker build \
  --build-arg STOPHAMMER_PARSER_REF=main \
  -t stophammer-crawler .

That produces target/release/stophammer-crawler.

Quick checks from this checkout:

cargo run -- feed --help
cargo run -- import --help
cargo run -- ndjson --help
cargo run -- gossip --help

Usage

The examples below assume stophammer-crawler is on your PATH. After a local source build, use ./target/release/stophammer-crawler instead.

feed

Fetch and ingest a list of feed URLs:

# From arguments
CRAWL_TOKEN=secret \
INGEST_URL=http://127.0.0.1:8008/ingest/feed \
stophammer-crawler feed https://example.com/feed.xml

# From a file
CRAWL_TOKEN=secret \
INGEST_URL=http://127.0.0.1:8008/ingest/feed \
stophammer-crawler feed feeds.txt

# From env
CRAWL_TOKEN=secret \
INGEST_URL=http://127.0.0.1:8008/ingest/feed \
FEED_URLS="https://a.com/feed,https://b.com/feed" \
stophammer-crawler feed

# From stdin
cat urls.txt | CRAWL_TOKEN=secret \
  INGEST_URL=http://127.0.0.1:8008/ingest/feed \
  stophammer-crawler feed

Force re-ingestion even if feed content is unchanged:

CRAWL_TOKEN=secret \
INGEST_URL=http://127.0.0.1:8008/ingest/feed \
stophammer-crawler --force feed https://example.com/feed.xml

feed options

Flag Env Default Description
--concurrency CONCURRENCY 5 Parallel fetch+ingest workers
--host-delay-ms HOST_DELAY_MS 1500 Minimum ms between fetches to the same host
--failed-feeds-output FAILED_FEEDS_OUTPUT ./failed_feeds.txt Plain-text output file for retryable feed URLs

import

Batch-scan a PodcastIndex snapshot for music feeds:

CRAWL_TOKEN=secret \
INGEST_URL=http://127.0.0.1:8008/ingest/feed \
stophammer-crawler import \
  --batch 100 --concurrency 5

Normal snapshot import excludes Wavlake-hosted rows. Use --wavlake-only for the Wavlake-specific pass. The default all_feeds scope now jumps to the music-first PodcastIndex lower bound instead of replaying the full pre-music corpus from 0.

Restart or resume from an explicit PodcastIndex id:

CRAWL_TOKEN=secret \
INGEST_URL=http://127.0.0.1:8008/ingest/feed \
stophammer-crawler import \
  --cursor 5000000 \
  --batch 100 --concurrency 5

Wavlake-only import from the same snapshot:

CRAWL_TOKEN=secret \
INGEST_URL=http://127.0.0.1:8008/ingest/feed \
stophammer-crawler import \
  --wavlake-only \
  --cursor 0

If --db does not exist yet, the importer downloads the latest PodcastIndex snapshot archive from https://public.podcastindex.org/podcastindex_feeds.db.tgz, extracts the .db directly into place, and does not keep the .tgz on disk. Use --refresh-db to check the remote snapshot with If-Modified-Since and download only when it changed. Downloads are streamed through gzip/tar extraction, so RAM usage stays low even for multi-gigabyte snapshots.

Import options

Flag Default Description
--db <path> ./podcastindex_feeds.db PodcastIndex snapshot path
--db-url <url> (public PI URL) Snapshot archive URL
--refresh-db off Conditionally refresh the snapshot if the remote archive changed
--state <path> ./import_state.db Progress cursor database
--skip-db <path> ./feed_skip.db Shared cross-mode skip database
--batch <n> 100 Feeds per DB query batch
--concurrency <n> 5 Parallel fetch+ingest workers
--audit-output <path> off Optional cached-feed NDJSON dump of successfully ingested 200 OK feeds
--audit-replace off Replace --audit-output instead of appending to it
--dry-run off Log without fetching/ingesting
--skip-known-non-music off Skip rows already known to fail the music/publisher medium gate, including non-music mediums and absent podcast:medium
--skip-known-success off Skip rows already known to have reached accepted, no_change, or skipped_known_success in importer memory
--wavlake-only off Restrict snapshot import to wavlake.com / www.wavlake.com feeds; without this flag, normal import excludes Wavlake rows
--cursor <id> stored cursor Start from an explicit PodcastIndex id instead of the stored cursor

Progress is stored in --state. If the process is interrupted, the next run resumes from the last completed batch. A crash mid-batch re-processes that batch -- safe because stophammer deduplicates on content hash. --cursor <id> overrides the stored starting point for that run; in non-dry runs the override is also persisted to the state DB before the batch loop starts. Cursor state is scoped by import mode inside the state DB, so normal import and --wavlake-only do not overwrite each other's resume positions.

--state now stores both the batch cursor and durable per-row importer memory in import_feed_memory, including the latest fetch status, outcome, raw_medium, attempt_duration_ms, and attempt counter for each attempted PodcastIndex row.

Import mode now also guards each feed crawl with a hard deadline. If a feed gets stuck below the normal HTTP timeout layer, the importer logs an explicit timeout error for that row, records a retryable failure in import_feed_memory, and continues. Import mode does not retry failed rows within the same run; the state DB is the retry/memory mechanism. Slow batches emit heartbeat lines with the currently pending row IDs and URLs instead of going silent.

--wavlake-only is intentionally slower than normal import mode. Normal import skips Wavlake-hosted snapshot rows entirely. The Wavlake-specific mode queries only those rows, forces single-flight fetches, applies a small delay between requests, and if Wavlake returns 429 Too Many Requests it backs off using the server's Retry-After value before attempting the next feed. Each Wavlake 429 also increases the ongoing inter-fetch delay by 1 second for the rest of that run. Wavlake scope does not imply any skip policy; use --skip-known-success and/or --skip-known-non-music if you want to trade freshness for speed on reruns.

If --audit-output is set, import mode writes newly accepted feeds and no_change re-submissions of already-ingested feeds to cached-feed NDJSON. Rejected feeds, parse errors, missing or unaccepted podcast:medium values, and non-200 fetches are not written. By default, --audit-output appends and the writer de-dupes by source_db.feed_guid plus fetch.content_sha256 so reruns do not append the same fetched body again. Use --audit-replace to truncate and rewrite the file instead. Import mode also creates an adjacent lock file while writing the NDJSON, so only one importer can target a given output path at a time.

Operational note:

  • initial auto-download only needs space for the extracted .db
  • --refresh-db temporarily needs room for both the existing .db and the new replacement .db while the importer swaps them safely

ndjson

Replay cached feed NDJSON rows into stophammer without re-fetching feeds. Parses each row's raw_xml, posts it to /ingest/feed, and tracks progress with a resume cursor.

CRAWL_TOKEN=secret \
INGEST_URL=http://127.0.0.1:8008/ingest/feed \
stophammer-crawler ndjson \
  --input ./stored-feeds.ndjson \
  --concurrency 5

Force re-ingestion of all rows:

CRAWL_TOKEN=secret \
INGEST_URL=http://127.0.0.1:8008/ingest/feed \
stophammer-crawler ndjson \
  --input ./stored-feeds.ndjson \
  --force \
  --reset

Dry-run to preview what would be ingested:

stophammer-crawler ndjson \
  --input ./stored-feeds.ndjson \
  --dry-run

ndjson options

Flag Env Default Description
--input <path> ./stored-feeds.ndjson Path to cached feed NDJSON file
--state <path> ./ndjson_state.db Resume-cursor state database
--batch <n> 100 Rows per processing batch
--limit <n> off Maximum NDJSON rows to process this run
--concurrency <n> CONCURRENCY 5 Parallel parse+ingest workers
--force FORCE_REINGEST off Force re-ingestion even if content has not changed
--dry-run off Log candidates without posting to stophammer
--reset off Clear resume cursor and start from the first row

Progress is stored in --state. If the process is interrupted, the next run resumes from the last completed batch. --reset clears the cursor.

Rows without raw_xml (e.g. fetch errors captured in the NDJSON) are automatically skipped. Transient ingest failures (429, 5xx) are retried up to 3 times with exponential backoff before counting as an error.

gossip

Listen to a local gossip-listener SSE stream for live podping notifications.

Prerequisites

  • A running gossip-listener instance from podping.alpha (provides the SSE stream and archive)
  • The crawler process must have read access to the archive database
  • CRAWL_TOKEN and INGEST_URL environment variables set

Typical host setup:

  1. Install and start podping.alpha / gossip-listener
  2. Enable archive writing in that service
  3. Confirm the resulting archive.db path on disk
  4. Point --archive-db at that path

If the archive does not live at the default path, pass the actual path:

stophammer-crawler gossip --archive-db /some/other/path/archive.db

For packaged systemd installs, the usual pattern is to add the stophammer-crawler service user to the podping group so it can read the archive written by gossip-listener.

For Docker installs, the root repo's reference docker-compose.yml runs podping.alpha's gossip-listener as a sibling podping-listener service and shares its archive.db with the crawler over a named Docker volume. That default path inside the crawler container is:

  • GOSSIP_ARCHIVE_DB=/podping/archive.db
  • GOSSIP_SSE_URL=http://podping-listener:8089/events

Archive-backed mode (recommended)

Uses gossip-listener's archive.db as a durable podping backlog. Survives restarts, SSE disconnects, and process crashes without losing notifications.

First-time bootstrap (catches up on the last 24 hours of podpings):

CRAWL_TOKEN=secret \
INGEST_URL=http://127.0.0.1:8008/ingest/feed \
stophammer-crawler gossip \
  --archive-db /path/to/archive.db \
  --since-hours 24 \
  --skip-known-non-music \
  --skip-ttl-days 30

Subsequent runs resume from the stored cursor automatically:

CRAWL_TOKEN=secret \
INGEST_URL=http://127.0.0.1:8008/ingest/feed \
stophammer-crawler gossip \
  --archive-db /path/to/archive.db \
  --skip-known-non-music \
  --skip-ttl-days 30

On startup, the crawler validates the archive schema, checks cursor continuity (exits fatally if notifications were lost), replays missed notifications in batches of ~500 with backpressure, then connects to SSE for live events. A background reconciliation task queries the archive periodically (10 s, then 60 s steady-state, backing off to 5 min when idle) to catch anything SSE missed.

If the stored cursor is older than the oldest retained archive row, the crawler exits with a fatal error. Delete gossip_state.db to re-bootstrap.

Live-only mode (best-effort)

Without --archive-db, gossip mode streams SSE events only. Not restart-safe — any events that arrive while the process is down are lost.

CRAWL_TOKEN=secret \
INGEST_URL=http://127.0.0.1:8008/ingest/feed \
stophammer-crawler gossip

Feed memory and skip logic

gossip_state.db records the latest crawl result for each feed URL (HTTP status, outcome, medium, attempt count). With --skip-known-non-music, feeds proven irrelevant by a prior crawl are skipped on future notifications:

  • HTTP 200 with a non-music, non-publisher raw_medium, or
  • a [medium_music] rejection (including absent podcast:medium)

Fetch errors (404, 429, timeouts), parse errors, and prior successful feeds are never skipped. --skip-ttl-days <n> expires skip decisions after N days so feeds are periodically re-evaluated.

Gossip options

Flag Default Description
--state <path> ./gossip_state.db Cursor and feed memory database
--skip-db <path> ./feed_skip.db Shared cross-mode skip database
--sse-url <url> http://localhost:8089/events SSE endpoint URL
--archive-db <path> off gossip-listener archive database path
--since-hours <n> off Bootstrap from N hours ago (requires --archive-db)
--concurrency <n> 3 Parallel fetch+ingest workers
--skip-known-non-music off Skip feeds proven non-music by prior crawl
--skip-ttl-days <n> off Re-evaluate skip decisions after N days
-q, --quiet off Hide non-music medium rejections

Environment variables

  • CRAWL_TOKEN (required) -- Shared secret for stophammer ingest auth.
  • INGEST_URL -- Stophammer ingest endpoint. Default: http://localhost:8008/ingest/feed
  • FORCE_REINGEST -- Force re-ingestion of feeds even if content has not changed. Set to 1 to enable. Can also be passed as --force flag at the top level.
  • CONCURRENCY -- Worker pool size. Default: 5 (feed/import) / 3 (gossip)
  • FEED_URLS -- Comma- or newline-separated URLs (feed mode only).
  • PODCASTINDEX_DB_URL -- Override the PodcastIndex snapshot archive URL for import mode.

Architecture

stophammer-crawler
  src/
    main.rs           CLI dispatcher (clap subcommands)
    crawl.rs          Shared pipeline: fetch → SHA-256 → parse → POST
    pool.rs           Bounded concurrency pool (tokio semaphore)
    dedup.rs          In-memory cooldown map (gossip mode)
    feed_skip.rs      Shared skip-memory database for proven irrelevant feeds
    modes/
      batch.rs        Load URLs from file/env/stdin, run pool
      import.rs       PodcastIndex DB batches, resume cursor, fallback GUID
      ndjson.rs       Replay cached feed NDJSON without re-fetching
      gossip.rs       SSE listener and optional archive replay for gossip-listener

The core pipeline in crawl.rs calls stophammer-parser as a Rust library — no subprocess spawning. Every mode feeds URLs into the same crawl_feed() function:

  1. Fetch the RSS feed via reqwest
  2. Hash the raw response body with SHA-256
  3. Parse the XML with stophammer-parser::profile::stophammer() (or stophammer_with_fallback() for import mode with PodcastIndex GUIDs)
  4. POST the result as JSON to the stophammer /ingest/feed endpoint

Concurrency is bounded by a tokio semaphore — crawl and import modes drain a fixed task list; gossip mode runs an unbounded stream with a permit-based cap.

Docker

Published releases also push a crawler runtime image:

  • ghcr.io/<owner>/stophammer-crawler

The main stophammer repo's docker-compose.yml includes multi-mode crawler support with shared crawler-data volume:

# One-shot feed crawl
docker compose run --rm stophammer-crawler feed https://example.com/feed.xml

# Force re-ingestion
docker compose run --rm stophammer-crawler --force feed https://example.com/feed.xml

# Replay NDJSON corpus
docker compose run --rm stophammer-crawler ndjson --input /data/stored-feeds.ndjson

# Long-running gossip listener
docker compose up -d gossip

# Batch import from PodcastIndex
docker compose run --rm import

The stophammer-crawler service requires CRAWL_TOKEN and INGEST_URL in packaging/env/crawler-feed.compose.env. The gossip and import services use separate env files configured in the compose stack.

For day-to-day operation, run the binary directly or provide your own scheduler / compose deployment around stophammer-crawler gossip, stophammer-crawler import, or stophammer-crawler feed.

License

AGPL-3.0-only — see LICENSE.

About

RSS feed crawler for the stophammer V4V music index

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Rust 99.3%
  • Dockerfile 0.7%