Skip to content

feat(p2p): relay fleet (3x DigitalOcean) + dnsaddr seed + adoption telemetry#23

Merged
VGIL77 merged 2 commits intomainfrom
feat/relay-fleet
Apr 29, 2026
Merged

feat(p2p): relay fleet (3x DigitalOcean) + dnsaddr seed + adoption telemetry#23
VGIL77 merged 2 commits intomainfrom
feat/relay-fleet

Conversation

@VGIL77
Copy link
Copy Markdown
Contributor

@VGIL77 VGIL77 commented Apr 29, 2026

Summary

Stands up a 3-region libp2p relay + bootstrap fleet on DigitalOcean (NYC1 / FRA1 / SGP1) so the network has redundancy beyond the single Railway-hosted bootnode. The April 23 collapse from 19 peers to 1 was a Railway-side outage with no code change of ours; since then every user dialing in landed on the same proxy and had no other path.

  • deploy/relay-fleet/main.tf — Terraform module: 3 droplets ($18/mo total, covered by DO's $200 new-account credit), shared firewall, per-fleet SSH key.
  • cloud-init.yaml — first-boot provisioning: rust toolchain, cargo build with 2GB swap to avoid OOM on 1GB box, service user, persistent state, auto-generated genesis trust list with the new node's pubkey, systemd unit, ufw, peer ID extraction.
  • scripts/update-dnsaddr.sh — Cloudflare TXT publisher: SSHes to each relay, builds multiaddrs from peer IDs, writes one TXT per address under _dnsaddr.<base_domain>. Idempotent.
  • scripts/fleet-stats.sh — adoption telemetry: SSHes to each relay, hits /api/stats and /api/bootstrap/census, deduplicates peer pubkeys across the fleet, prints lifetime unique peers + per-relay tear. JSON / CSV output modes.
  • README.md — operator guide (provision, publish DNS, day-2 ops, troubleshooting, what's NOT done).

Provisioned and published live today (2026-04-28):

nyc1  142.93.113.64   12D3KooWRWqC9ha4zvFpLTWdKWr3B8EaiQnWqr2Mp3vyRSNQNPJN
fra1  46.101.181.98   12D3KooWMnnCHGVtZxyAFaJoEzk2hT1eD3SEvjLDiUNwiJsXdRty
sgp1  139.59.233.83   12D3KooWNZdviN1579x6LrLQt78d6VRZczLHbBWhyyXzoun2k2L3

DNS verification:

$ curl -sS 'https://cloudflare-dns.com/dns-query?name=_dnsaddr.p2p.bitterbot.ai&type=TXT' \
       -H 'accept: application/dns-json' | jq -r '.Answer[].data'
"dnsaddr=/ip4/139.59.233.83/tcp/9100/p2p/12D3KooWNZdviN1579x6LrLQt78d6VRZczLHbBWhyyXzoun2k2L3"
"dnsaddr=/ip4/142.93.113.64/tcp/9100/p2p/12D3KooWRWqC9ha4zvFpLTWdKWr3B8EaiQnWqr2Mp3vyRSNQNPJN"
"dnsaddr=/ip4/46.101.181.98/tcp/9100/p2p/12D3KooWMnnCHGVtZxyAFaJoEzk2hT1eD3SEvjLDiUNwiJsXdRty"

Why DigitalOcean

CLI driveability is the deciding factor. Hetzner CX22 was the cheapest fit and OVH had the best EU sovereignty story, but DO's doctl + Terraform provider are the cleanest for autonomous scripted provisioning. 9 regions vs Hetzner's 5 means better global coverage for a relay fleet whose job is to be near every user. OVH's 3-legged consumer-key auth disqualified it. Equinix Metal (~$36/mo per node) is overkill until 10k+ users. Decision rationale captured in memory.

Out of scope (explicitly noted in README)

  • Track A: NAT-aware downgrade for management nodes that can't actually serve as relay (Rust change in orchestrator/src/swarm/mod.rs).
  • Track C: client default config update consuming the new dnsaddr seed + dashboard surface for relay role.
  • WSS:443 and QUIC:9101 listeners on the relays.
  • Auto-add new relay pubkeys to client genesis trust list.

Test plan

  • terraform validate clean
  • bash -n on both scripts clean
  • terraform apply provisioned 3 droplets in 33s; cloud-init finished and peer-id.txt populated on all three
  • update-dnsaddr.sh published 3 TXT records; verified via Cloudflare DoH
  • First fleet-stats.sh reading once relays accumulate organic traffic
  • Smoke test: client cold start with /dnsaddr/p2p.bitterbot.ai in default config picks up multiple peers within 30s

🤖 Generated with Claude Code

VGIL77 and others added 2 commits April 28, 2026 21:44
…lemetry

Stands up an always-on libp2p relay + bootstrap fleet on DigitalOcean
(NYC1 / FRA1 / SGP1) so the network has redundancy beyond the single
Railway-hosted bootnode. The April 23 outage that dropped the network
from 19 peers to 1 was a Railway-side flake with no code change of ours;
since then the network has been stuck at 1 peer because every user
dialing in landed on the same proxy and had no other path.

What landed
* deploy/relay-fleet/main.tf — Terraform module: 3x s-1vcpu-1gb droplets
  ($6/mo each, $18/mo total, covered by DO's $200 new-account credit),
  shared firewall (TCP 22/9100/443, UDP 9101, ICMP), per-fleet SSH key.
  Driven by DIGITALOCEAN_TOKEN env var.
* deploy/relay-fleet/cloud-init.yaml — provisioning: rust toolchain,
  cargo build (with 2 GB swap so 1 GB box doesn't OOM during release
  build), service user, /var/lib/bitterbot persistent state, auto-
  generated genesis trust list with the new node's own pubkey,
  systemd unit running orchestrator as
  --node-tier management --relay-mode server --bootnode-mode,
  ufw firewall, peer ID extraction from journald.
* deploy/relay-fleet/scripts/update-dnsaddr.sh — Cloudflare TXT
  publisher: SSHes to each relay, reads /var/lib/bitterbot/peer-id.txt,
  writes one TXT record per multiaddr under
  _dnsaddr.<BASE_DOMAIN> (e.g. _dnsaddr.p2p.bitterbot.ai).
  Idempotent: deletes stale dnsaddr= records before posting fresh ones.
  Cloudflare-token-driven, dry-run mode for safety.
* deploy/relay-fleet/scripts/fleet-stats.sh — adoption telemetry.
  SSHes to each relay, pulls /api/stats and /api/bootstrap/census from
  the orchestrator HTTP API on localhost:9847, deduplicates peer
  pubkeys across the fleet, prints lifetime unique peers, sum of
  connected, peak concurrent, hole-punch success rate, relay
  reservations served, and per-relay tear. Three output modes:
  default human summary, --json for scraping, --csv for time-series
  piping into a chart.
* deploy/relay-fleet/README.md — operator guide: prerequisites,
  end-to-end provision, dnsaddr publication, day-2 ops (replace
  region, rotate identity, tear down), what the fleet does NOT do
  (WSS:443, QUIC:9101, auto-trust-list-update — explicitly noted),
  troubleshooting, source-of-truth pointer to memory entries.
* deploy/relay-fleet/.gitignore — excludes .terraform/ and
  terraform.tfstate (machine-local; switch to remote backend if
  multiple operators ever need shared state).

What this enables
* Network reachability: every new user has 4 dial paths (3 DO relays +
  Railway fallback) instead of 1.
* Geographic coverage: NYC, FRA, SGP. Latency to any user globally is
  bounded so DCUtR succeeds more often.
* Real Circuit Relay v2 servers: NAT'd users can reserve relay slots,
  unblocking inbound reachability without router config changes. The
  Railway proxy could not do this.
* Default-on adoption telemetry: lifetime unique peers, deduplicated
  across the fleet, becomes a real metric. Until now we had 2 lifetime
  peers in our reputation table because only the local laptop counted.
* DNS-rotated bootstrap: changing the relay set never requires a client
  release. Edit Cloudflare TXT records.

Why DigitalOcean (vs Hetzner / OVH / Equinix Metal)
* CLI-driveability is the deciding factor — Hetzner CX22 was the
  cheapest fit and OVH had the best EU sovereignty story, but DO's
  doctl + Terraform provider are the cleanest for autonomous scripted
  provisioning. 9 regions vs Hetzner's 5 means better global coverage
  for a relay fleet whose job is to be near every user.
* OVH's 3-legged consumer-key auth disqualified it for our use case.
* Equinix Metal is what IPFS Foundation runs (~$36/mo per node);
  overkill until we have 10k+ users.
* Decision and rationale captured in memory at project_relay_fleet.md.

What this does NOT do (explicitly out of scope)
* Track A: NAT-aware downgrade of management nodes that can't actually
  serve as relay (the user's WSL2 management node is the canonical
  example). Separate Rust change in orchestrator/src/swarm/mod.rs.
* Track C: client default config update consuming the new dnsaddr
  seed, and dashboard surface showing relay role + tier breakdown.
  Depends on Track A landing first.
* WSS:443 and QUIC:9101 listeners on the relays. Firewall ports
  reserved; orchestrator does not yet expose these transports.
* Auto-add new relay pubkeys to the *client* genesis trust list.
  Operator manually copies pubkeys into ~/.bitterbot/genesis-trust.txt
  after provisioning. Future: serve the trust list from a versioned
  HTTPS URL the orchestrator fetches at startup.

Verified
* terraform validate clean.
* bash -n on both scripts clean.
* python3 -c "import yaml; yaml.safe_load(open(cloud-init.yaml))" clean.
* terraform apply provisioned all 3 droplets in 33 seconds; cloud-init
  finished and peer-id.txt populated on all three boxes.

Provisioned today (2026-04-28):
  nyc1 142.93.113.64 12D3KooWRWqC9ha4zvFpLTWdKWr3B8Ea...
  fra1 46.101.181.98 12D3KooWMnnCHGVtZxyAFaJoEzk2hT1e...
  sgp1 139.59.233.83 12D3KooWNZdviN1579x6LrLQt78d6VRZ...

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI lint Format check fails on the relay-fleet PR because oxfmt enforces
single-space-before-inline-comment in YAML. One-line whitespace fix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@VGIL77 VGIL77 merged commit 3a95c98 into main Apr 29, 2026
2 checks passed
@VGIL77 VGIL77 deleted the feat/relay-fleet branch April 29, 2026 02:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant