feat(p2p): relay fleet (3x DigitalOcean) + dnsaddr seed + adoption telemetry by VGIL77 · Pull Request #23 · Bitterbot-AI/bitterbot-desktop

VGIL77 · 2026-04-29T01:45:20Z

Summary

Stands up a 3-region libp2p relay + bootstrap fleet on DigitalOcean (NYC1 / FRA1 / SGP1) so the network has redundancy beyond the single Railway-hosted bootnode. The April 23 collapse from 19 peers to 1 was a Railway-side outage with no code change of ours; since then every user dialing in landed on the same proxy and had no other path.

deploy/relay-fleet/main.tf — Terraform module: 3 droplets ($18/mo total, covered by DO's $200 new-account credit), shared firewall, per-fleet SSH key.
cloud-init.yaml — first-boot provisioning: rust toolchain, cargo build with 2GB swap to avoid OOM on 1GB box, service user, persistent state, auto-generated genesis trust list with the new node's pubkey, systemd unit, ufw, peer ID extraction.
scripts/update-dnsaddr.sh — Cloudflare TXT publisher: SSHes to each relay, builds multiaddrs from peer IDs, writes one TXT per address under _dnsaddr.<base_domain>. Idempotent.
scripts/fleet-stats.sh — adoption telemetry: SSHes to each relay, hits /api/stats and /api/bootstrap/census, deduplicates peer pubkeys across the fleet, prints lifetime unique peers + per-relay tear. JSON / CSV output modes.
README.md — operator guide (provision, publish DNS, day-2 ops, troubleshooting, what's NOT done).

Provisioned and published live today (2026-04-28):

nyc1  142.93.113.64   12D3KooWRWqC9ha4zvFpLTWdKWr3B8EaiQnWqr2Mp3vyRSNQNPJN
fra1  46.101.181.98   12D3KooWMnnCHGVtZxyAFaJoEzk2hT1eD3SEvjLDiUNwiJsXdRty
sgp1  139.59.233.83   12D3KooWNZdviN1579x6LrLQt78d6VRZczLHbBWhyyXzoun2k2L3

DNS verification:

$ curl -sS 'https://cloudflare-dns.com/dns-query?name=_dnsaddr.p2p.bitterbot.ai&type=TXT' \
       -H 'accept: application/dns-json' | jq -r '.Answer[].data'
"dnsaddr=/ip4/139.59.233.83/tcp/9100/p2p/12D3KooWNZdviN1579x6LrLQt78d6VRZczLHbBWhyyXzoun2k2L3"
"dnsaddr=/ip4/142.93.113.64/tcp/9100/p2p/12D3KooWRWqC9ha4zvFpLTWdKWr3B8EaiQnWqr2Mp3vyRSNQNPJN"
"dnsaddr=/ip4/46.101.181.98/tcp/9100/p2p/12D3KooWMnnCHGVtZxyAFaJoEzk2hT1eD3SEvjLDiUNwiJsXdRty"

Why DigitalOcean

CLI driveability is the deciding factor. Hetzner CX22 was the cheapest fit and OVH had the best EU sovereignty story, but DO's doctl + Terraform provider are the cleanest for autonomous scripted provisioning. 9 regions vs Hetzner's 5 means better global coverage for a relay fleet whose job is to be near every user. OVH's 3-legged consumer-key auth disqualified it. Equinix Metal (~$36/mo per node) is overkill until 10k+ users. Decision rationale captured in memory.

Out of scope (explicitly noted in README)

Track A: NAT-aware downgrade for management nodes that can't actually serve as relay (Rust change in orchestrator/src/swarm/mod.rs).
Track C: client default config update consuming the new dnsaddr seed + dashboard surface for relay role.
WSS:443 and QUIC:9101 listeners on the relays.
Auto-add new relay pubkeys to client genesis trust list.

Test plan

terraform validate clean
bash -n on both scripts clean
terraform apply provisioned 3 droplets in 33s; cloud-init finished and peer-id.txt populated on all three
update-dnsaddr.sh published 3 TXT records; verified via Cloudflare DoH
First fleet-stats.sh reading once relays accumulate organic traffic
Smoke test: client cold start with /dnsaddr/p2p.bitterbot.ai in default config picks up multiple peers within 30s

🤖 Generated with Claude Code

…lemetry Stands up an always-on libp2p relay + bootstrap fleet on DigitalOcean (NYC1 / FRA1 / SGP1) so the network has redundancy beyond the single Railway-hosted bootnode. The April 23 outage that dropped the network from 19 peers to 1 was a Railway-side flake with no code change of ours; since then the network has been stuck at 1 peer because every user dialing in landed on the same proxy and had no other path. What landed * deploy/relay-fleet/main.tf — Terraform module: 3x s-1vcpu-1gb droplets ($6/mo each, $18/mo total, covered by DO's $200 new-account credit), shared firewall (TCP 22/9100/443, UDP 9101, ICMP), per-fleet SSH key. Driven by DIGITALOCEAN_TOKEN env var. * deploy/relay-fleet/cloud-init.yaml — provisioning: rust toolchain, cargo build (with 2 GB swap so 1 GB box doesn't OOM during release build), service user, /var/lib/bitterbot persistent state, auto- generated genesis trust list with the new node's own pubkey, systemd unit running orchestrator as --node-tier management --relay-mode server --bootnode-mode, ufw firewall, peer ID extraction from journald. * deploy/relay-fleet/scripts/update-dnsaddr.sh — Cloudflare TXT publisher: SSHes to each relay, reads /var/lib/bitterbot/peer-id.txt, writes one TXT record per multiaddr under _dnsaddr.<BASE_DOMAIN> (e.g. _dnsaddr.p2p.bitterbot.ai). Idempotent: deletes stale dnsaddr= records before posting fresh ones. Cloudflare-token-driven, dry-run mode for safety. * deploy/relay-fleet/scripts/fleet-stats.sh — adoption telemetry. SSHes to each relay, pulls /api/stats and /api/bootstrap/census from the orchestrator HTTP API on localhost:9847, deduplicates peer pubkeys across the fleet, prints lifetime unique peers, sum of connected, peak concurrent, hole-punch success rate, relay reservations served, and per-relay tear. Three output modes: default human summary, --json for scraping, --csv for time-series piping into a chart. * deploy/relay-fleet/README.md — operator guide: prerequisites, end-to-end provision, dnsaddr publication, day-2 ops (replace region, rotate identity, tear down), what the fleet does NOT do (WSS:443, QUIC:9101, auto-trust-list-update — explicitly noted), troubleshooting, source-of-truth pointer to memory entries. * deploy/relay-fleet/.gitignore — excludes .terraform/ and terraform.tfstate (machine-local; switch to remote backend if multiple operators ever need shared state). What this enables * Network reachability: every new user has 4 dial paths (3 DO relays + Railway fallback) instead of 1. * Geographic coverage: NYC, FRA, SGP. Latency to any user globally is bounded so DCUtR succeeds more often. * Real Circuit Relay v2 servers: NAT'd users can reserve relay slots, unblocking inbound reachability without router config changes. The Railway proxy could not do this. * Default-on adoption telemetry: lifetime unique peers, deduplicated across the fleet, becomes a real metric. Until now we had 2 lifetime peers in our reputation table because only the local laptop counted. * DNS-rotated bootstrap: changing the relay set never requires a client release. Edit Cloudflare TXT records. Why DigitalOcean (vs Hetzner / OVH / Equinix Metal) * CLI-driveability is the deciding factor — Hetzner CX22 was the cheapest fit and OVH had the best EU sovereignty story, but DO's doctl + Terraform provider are the cleanest for autonomous scripted provisioning. 9 regions vs Hetzner's 5 means better global coverage for a relay fleet whose job is to be near every user. * OVH's 3-legged consumer-key auth disqualified it for our use case. * Equinix Metal is what IPFS Foundation runs (~$36/mo per node); overkill until we have 10k+ users. * Decision and rationale captured in memory at project_relay_fleet.md. What this does NOT do (explicitly out of scope) * Track A: NAT-aware downgrade of management nodes that can't actually serve as relay (the user's WSL2 management node is the canonical example). Separate Rust change in orchestrator/src/swarm/mod.rs. * Track C: client default config update consuming the new dnsaddr seed, and dashboard surface showing relay role + tier breakdown. Depends on Track A landing first. * WSS:443 and QUIC:9101 listeners on the relays. Firewall ports reserved; orchestrator does not yet expose these transports. * Auto-add new relay pubkeys to the *client* genesis trust list. Operator manually copies pubkeys into ~/.bitterbot/genesis-trust.txt after provisioning. Future: serve the trust list from a versioned HTTPS URL the orchestrator fetches at startup. Verified * terraform validate clean. * bash -n on both scripts clean. * python3 -c "import yaml; yaml.safe_load(open(cloud-init.yaml))" clean. * terraform apply provisioned all 3 droplets in 33 seconds; cloud-init finished and peer-id.txt populated on all three boxes. Provisioned today (2026-04-28): nyc1 142.93.113.64 12D3KooWRWqC9ha4zvFpLTWdKWr3B8Ea... fra1 46.101.181.98 12D3KooWMnnCHGVtZxyAFaJoEzk2hT1e... sgp1 139.59.233.83 12D3KooWNZdviN1579x6LrLQt78d6VRZ... Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

CI lint Format check fails on the relay-fleet PR because oxfmt enforces single-space-before-inline-comment in YAML. One-line whitespace fix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

VGIL77 and others added 2 commits April 28, 2026 21:44

chore(p2p): oxfmt cloud-init.yaml comment spacing

f30dabe

CI lint Format check fails on the relay-fleet PR because oxfmt enforces single-space-before-inline-comment in YAML. One-line whitespace fix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

VGIL77 merged commit 3a95c98 into main Apr 29, 2026
2 checks passed

VGIL77 deleted the feat/relay-fleet branch April 29, 2026 02:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(p2p): relay fleet (3x DigitalOcean) + dnsaddr seed + adoption telemetry#23

feat(p2p): relay fleet (3x DigitalOcean) + dnsaddr seed + adoption telemetry#23
VGIL77 merged 2 commits intomainfrom
feat/relay-fleet

VGIL77 commented Apr 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

VGIL77 commented Apr 29, 2026

Summary

Why DigitalOcean

Out of scope (explicitly noted in README)

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant