feat(p2p): relay fleet (3x DigitalOcean) + dnsaddr seed + adoption telemetry#23
Merged
feat(p2p): relay fleet (3x DigitalOcean) + dnsaddr seed + adoption telemetry#23
Conversation
…lemetry Stands up an always-on libp2p relay + bootstrap fleet on DigitalOcean (NYC1 / FRA1 / SGP1) so the network has redundancy beyond the single Railway-hosted bootnode. The April 23 outage that dropped the network from 19 peers to 1 was a Railway-side flake with no code change of ours; since then the network has been stuck at 1 peer because every user dialing in landed on the same proxy and had no other path. What landed * deploy/relay-fleet/main.tf — Terraform module: 3x s-1vcpu-1gb droplets ($6/mo each, $18/mo total, covered by DO's $200 new-account credit), shared firewall (TCP 22/9100/443, UDP 9101, ICMP), per-fleet SSH key. Driven by DIGITALOCEAN_TOKEN env var. * deploy/relay-fleet/cloud-init.yaml — provisioning: rust toolchain, cargo build (with 2 GB swap so 1 GB box doesn't OOM during release build), service user, /var/lib/bitterbot persistent state, auto- generated genesis trust list with the new node's own pubkey, systemd unit running orchestrator as --node-tier management --relay-mode server --bootnode-mode, ufw firewall, peer ID extraction from journald. * deploy/relay-fleet/scripts/update-dnsaddr.sh — Cloudflare TXT publisher: SSHes to each relay, reads /var/lib/bitterbot/peer-id.txt, writes one TXT record per multiaddr under _dnsaddr.<BASE_DOMAIN> (e.g. _dnsaddr.p2p.bitterbot.ai). Idempotent: deletes stale dnsaddr= records before posting fresh ones. Cloudflare-token-driven, dry-run mode for safety. * deploy/relay-fleet/scripts/fleet-stats.sh — adoption telemetry. SSHes to each relay, pulls /api/stats and /api/bootstrap/census from the orchestrator HTTP API on localhost:9847, deduplicates peer pubkeys across the fleet, prints lifetime unique peers, sum of connected, peak concurrent, hole-punch success rate, relay reservations served, and per-relay tear. Three output modes: default human summary, --json for scraping, --csv for time-series piping into a chart. * deploy/relay-fleet/README.md — operator guide: prerequisites, end-to-end provision, dnsaddr publication, day-2 ops (replace region, rotate identity, tear down), what the fleet does NOT do (WSS:443, QUIC:9101, auto-trust-list-update — explicitly noted), troubleshooting, source-of-truth pointer to memory entries. * deploy/relay-fleet/.gitignore — excludes .terraform/ and terraform.tfstate (machine-local; switch to remote backend if multiple operators ever need shared state). What this enables * Network reachability: every new user has 4 dial paths (3 DO relays + Railway fallback) instead of 1. * Geographic coverage: NYC, FRA, SGP. Latency to any user globally is bounded so DCUtR succeeds more often. * Real Circuit Relay v2 servers: NAT'd users can reserve relay slots, unblocking inbound reachability without router config changes. The Railway proxy could not do this. * Default-on adoption telemetry: lifetime unique peers, deduplicated across the fleet, becomes a real metric. Until now we had 2 lifetime peers in our reputation table because only the local laptop counted. * DNS-rotated bootstrap: changing the relay set never requires a client release. Edit Cloudflare TXT records. Why DigitalOcean (vs Hetzner / OVH / Equinix Metal) * CLI-driveability is the deciding factor — Hetzner CX22 was the cheapest fit and OVH had the best EU sovereignty story, but DO's doctl + Terraform provider are the cleanest for autonomous scripted provisioning. 9 regions vs Hetzner's 5 means better global coverage for a relay fleet whose job is to be near every user. * OVH's 3-legged consumer-key auth disqualified it for our use case. * Equinix Metal is what IPFS Foundation runs (~$36/mo per node); overkill until we have 10k+ users. * Decision and rationale captured in memory at project_relay_fleet.md. What this does NOT do (explicitly out of scope) * Track A: NAT-aware downgrade of management nodes that can't actually serve as relay (the user's WSL2 management node is the canonical example). Separate Rust change in orchestrator/src/swarm/mod.rs. * Track C: client default config update consuming the new dnsaddr seed, and dashboard surface showing relay role + tier breakdown. Depends on Track A landing first. * WSS:443 and QUIC:9101 listeners on the relays. Firewall ports reserved; orchestrator does not yet expose these transports. * Auto-add new relay pubkeys to the *client* genesis trust list. Operator manually copies pubkeys into ~/.bitterbot/genesis-trust.txt after provisioning. Future: serve the trust list from a versioned HTTPS URL the orchestrator fetches at startup. Verified * terraform validate clean. * bash -n on both scripts clean. * python3 -c "import yaml; yaml.safe_load(open(cloud-init.yaml))" clean. * terraform apply provisioned all 3 droplets in 33 seconds; cloud-init finished and peer-id.txt populated on all three boxes. Provisioned today (2026-04-28): nyc1 142.93.113.64 12D3KooWRWqC9ha4zvFpLTWdKWr3B8Ea... fra1 46.101.181.98 12D3KooWMnnCHGVtZxyAFaJoEzk2hT1e... sgp1 139.59.233.83 12D3KooWNZdviN1579x6LrLQt78d6VRZ... Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI lint Format check fails on the relay-fleet PR because oxfmt enforces single-space-before-inline-comment in YAML. One-line whitespace fix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Stands up a 3-region libp2p relay + bootstrap fleet on DigitalOcean (NYC1 / FRA1 / SGP1) so the network has redundancy beyond the single Railway-hosted bootnode. The April 23 collapse from 19 peers to 1 was a Railway-side outage with no code change of ours; since then every user dialing in landed on the same proxy and had no other path.
deploy/relay-fleet/main.tf— Terraform module: 3 droplets ($18/mo total, covered by DO's $200 new-account credit), shared firewall, per-fleet SSH key.cloud-init.yaml— first-boot provisioning: rust toolchain, cargo build with 2GB swap to avoid OOM on 1GB box, service user, persistent state, auto-generated genesis trust list with the new node's pubkey, systemd unit, ufw, peer ID extraction.scripts/update-dnsaddr.sh— Cloudflare TXT publisher: SSHes to each relay, builds multiaddrs from peer IDs, writes one TXT per address under_dnsaddr.<base_domain>. Idempotent.scripts/fleet-stats.sh— adoption telemetry: SSHes to each relay, hits/api/statsand/api/bootstrap/census, deduplicates peer pubkeys across the fleet, prints lifetime unique peers + per-relay tear. JSON / CSV output modes.README.md— operator guide (provision, publish DNS, day-2 ops, troubleshooting, what's NOT done).Provisioned and published live today (2026-04-28):
DNS verification:
Why DigitalOcean
CLI driveability is the deciding factor. Hetzner CX22 was the cheapest fit and OVH had the best EU sovereignty story, but DO's
doctl+ Terraform provider are the cleanest for autonomous scripted provisioning. 9 regions vs Hetzner's 5 means better global coverage for a relay fleet whose job is to be near every user. OVH's 3-legged consumer-key auth disqualified it. Equinix Metal (~$36/mo per node) is overkill until 10k+ users. Decision rationale captured in memory.Out of scope (explicitly noted in README)
orchestrator/src/swarm/mod.rs).Test plan
terraform validatecleanbash -non both scripts cleanterraform applyprovisioned 3 droplets in 33s; cloud-init finished andpeer-id.txtpopulated on all threeupdate-dnsaddr.shpublished 3 TXT records; verified via Cloudflare DoHfleet-stats.shreading once relays accumulate organic traffic/dnsaddr/p2p.bitterbot.aiin default config picks up multiple peers within 30s🤖 Generated with Claude Code