From ec248bf30ee2ebc8f434ad134da5b849ec72f867 Mon Sep 17 00:00:00 2001 From: VGIL77 Date: Tue, 28 Apr 2026 21:13:34 -0400 Subject: [PATCH 1/2] feat(p2p): relay fleet (3x DigitalOcean) + dnsaddr seed + adoption telemetry MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Stands up an always-on libp2p relay + bootstrap fleet on DigitalOcean (NYC1 / FRA1 / SGP1) so the network has redundancy beyond the single Railway-hosted bootnode. The April 23 outage that dropped the network from 19 peers to 1 was a Railway-side flake with no code change of ours; since then the network has been stuck at 1 peer because every user dialing in landed on the same proxy and had no other path. What landed * deploy/relay-fleet/main.tf — Terraform module: 3x s-1vcpu-1gb droplets ($6/mo each, $18/mo total, covered by DO's $200 new-account credit), shared firewall (TCP 22/9100/443, UDP 9101, ICMP), per-fleet SSH key. Driven by DIGITALOCEAN_TOKEN env var. * deploy/relay-fleet/cloud-init.yaml — provisioning: rust toolchain, cargo build (with 2 GB swap so 1 GB box doesn't OOM during release build), service user, /var/lib/bitterbot persistent state, auto- generated genesis trust list with the new node's own pubkey, systemd unit running orchestrator as --node-tier management --relay-mode server --bootnode-mode, ufw firewall, peer ID extraction from journald. * deploy/relay-fleet/scripts/update-dnsaddr.sh — Cloudflare TXT publisher: SSHes to each relay, reads /var/lib/bitterbot/peer-id.txt, writes one TXT record per multiaddr under _dnsaddr. (e.g. _dnsaddr.p2p.bitterbot.ai). Idempotent: deletes stale dnsaddr= records before posting fresh ones. Cloudflare-token-driven, dry-run mode for safety. * deploy/relay-fleet/scripts/fleet-stats.sh — adoption telemetry. SSHes to each relay, pulls /api/stats and /api/bootstrap/census from the orchestrator HTTP API on localhost:9847, deduplicates peer pubkeys across the fleet, prints lifetime unique peers, sum of connected, peak concurrent, hole-punch success rate, relay reservations served, and per-relay tear. Three output modes: default human summary, --json for scraping, --csv for time-series piping into a chart. * deploy/relay-fleet/README.md — operator guide: prerequisites, end-to-end provision, dnsaddr publication, day-2 ops (replace region, rotate identity, tear down), what the fleet does NOT do (WSS:443, QUIC:9101, auto-trust-list-update — explicitly noted), troubleshooting, source-of-truth pointer to memory entries. * deploy/relay-fleet/.gitignore — excludes .terraform/ and terraform.tfstate (machine-local; switch to remote backend if multiple operators ever need shared state). What this enables * Network reachability: every new user has 4 dial paths (3 DO relays + Railway fallback) instead of 1. * Geographic coverage: NYC, FRA, SGP. Latency to any user globally is bounded so DCUtR succeeds more often. * Real Circuit Relay v2 servers: NAT'd users can reserve relay slots, unblocking inbound reachability without router config changes. The Railway proxy could not do this. * Default-on adoption telemetry: lifetime unique peers, deduplicated across the fleet, becomes a real metric. Until now we had 2 lifetime peers in our reputation table because only the local laptop counted. * DNS-rotated bootstrap: changing the relay set never requires a client release. Edit Cloudflare TXT records. Why DigitalOcean (vs Hetzner / OVH / Equinix Metal) * CLI-driveability is the deciding factor — Hetzner CX22 was the cheapest fit and OVH had the best EU sovereignty story, but DO's doctl + Terraform provider are the cleanest for autonomous scripted provisioning. 9 regions vs Hetzner's 5 means better global coverage for a relay fleet whose job is to be near every user. * OVH's 3-legged consumer-key auth disqualified it for our use case. * Equinix Metal is what IPFS Foundation runs (~$36/mo per node); overkill until we have 10k+ users. * Decision and rationale captured in memory at project_relay_fleet.md. What this does NOT do (explicitly out of scope) * Track A: NAT-aware downgrade of management nodes that can't actually serve as relay (the user's WSL2 management node is the canonical example). Separate Rust change in orchestrator/src/swarm/mod.rs. * Track C: client default config update consuming the new dnsaddr seed, and dashboard surface showing relay role + tier breakdown. Depends on Track A landing first. * WSS:443 and QUIC:9101 listeners on the relays. Firewall ports reserved; orchestrator does not yet expose these transports. * Auto-add new relay pubkeys to the *client* genesis trust list. Operator manually copies pubkeys into ~/.bitterbot/genesis-trust.txt after provisioning. Future: serve the trust list from a versioned HTTPS URL the orchestrator fetches at startup. Verified * terraform validate clean. * bash -n on both scripts clean. * python3 -c "import yaml; yaml.safe_load(open(cloud-init.yaml))" clean. * terraform apply provisioned all 3 droplets in 33 seconds; cloud-init finished and peer-id.txt populated on all three boxes. Provisioned today (2026-04-28): nyc1 142.93.113.64 12D3KooWRWqC9ha4zvFpLTWdKWr3B8Ea... fra1 46.101.181.98 12D3KooWMnnCHGVtZxyAFaJoEzk2hT1e... sgp1 139.59.233.83 12D3KooWNZdviN1579x6LrLQt78d6VRZ... Co-Authored-By: Claude Opus 4.7 (1M context) --- deploy/relay-fleet/.gitignore | 16 ++ deploy/relay-fleet/.terraform.lock.hcl | 26 +++ deploy/relay-fleet/README.md | 196 +++++++++++++++++++ deploy/relay-fleet/cloud-init.yaml | 187 ++++++++++++++++++ deploy/relay-fleet/main.tf | 172 ++++++++++++++++ deploy/relay-fleet/scripts/fleet-stats.sh | 191 ++++++++++++++++++ deploy/relay-fleet/scripts/update-dnsaddr.sh | 129 ++++++++++++ 7 files changed, 917 insertions(+) create mode 100644 deploy/relay-fleet/.gitignore create mode 100644 deploy/relay-fleet/.terraform.lock.hcl create mode 100644 deploy/relay-fleet/README.md create mode 100644 deploy/relay-fleet/cloud-init.yaml create mode 100644 deploy/relay-fleet/main.tf create mode 100644 deploy/relay-fleet/scripts/fleet-stats.sh create mode 100644 deploy/relay-fleet/scripts/update-dnsaddr.sh diff --git a/deploy/relay-fleet/.gitignore b/deploy/relay-fleet/.gitignore new file mode 100644 index 00000000..29784e26 --- /dev/null +++ b/deploy/relay-fleet/.gitignore @@ -0,0 +1,16 @@ +# Terraform plugin cache (large, regenerable via `terraform init`) +.terraform/ + +# Terraform state (machine-local; switch to remote backend if multiple +# operators need shared state). Contains droplet IDs and public IPs. +terraform.tfstate +terraform.tfstate.backup +*.tfvars +crash.log + +# Local override that some operators use to point at a personal Cloudflare +# zone or a different SSH key +override.tf +override.tf.json +*_override.tf +*_override.tf.json diff --git a/deploy/relay-fleet/.terraform.lock.hcl b/deploy/relay-fleet/.terraform.lock.hcl new file mode 100644 index 00000000..2ef653f4 --- /dev/null +++ b/deploy/relay-fleet/.terraform.lock.hcl @@ -0,0 +1,26 @@ +# This file is maintained automatically by "terraform init". +# Manual edits may be lost in future updates. + +provider "registry.terraform.io/digitalocean/digitalocean" { + version = "2.85.0" + constraints = "~> 2.40" + hashes = [ + "h1:m+lRO+1mf1sZ9nIZup+xYQb3+42s5qfMRe2UmsEqPMk=", + "zh:10c8ed96911d7d2a4986a0216cf820ebcb92c4e6766e8ec837edf3c19b9ccff7", + "zh:2e789b8b44dd7f4b88e29ed3e448835df8b0c33957d9ddf6423b2d7c3a904ae5", + "zh:55346c35eb1ecf37ce0592f9d78db8bbe4bf023050103ee749c8e9e61dfdc50b", + "zh:5fb86602def4615e03fc1a01804f02d426b9d02925f3fc7d2c46decd9db381fc", + "zh:634740674ed73bbdba6ca02c429b0c4d5f40147aa85b45a1b77b9c43695e4331", + "zh:691bba8e07d8e52db5d6b75fbd2a3cb372ac746315b08ede4d6ccbe961572a5f", + "zh:7c7d00c5ba40798be07c6245fa80f648a6dda24872bab53245fb0a6fcc44bba2", + "zh:8a917a56a21bdb64271a22ecb7d4e7ec13488c110daecb2d772c7c9c52696493", + "zh:8b7c4920dd2dbf5d4214177e1c161b0d040aac235c409f3e1a935f23958eebe6", + "zh:9e0e22b6818872934a792538651cd5390c7d53e2c4898e20054564860ac1a9b9", + "zh:cfb26b9876ff5132d475d173c6fbd9c6530a32b935cfdd7dfdff9d0eff12253b", + "zh:d1c631ba672279aa9deb12a6384e719a4a48ed10b999f2ca2162f03f8550e2a4", + "zh:dc28c34566c0e938c20a523156c03950c03d458847589137d2395cafbd3d6788", + "zh:e17d8f1f987e7b4f358090afe98224d1460d5216498b7feb51e8b07250e5445a", + "zh:e5fac1a597cc1944e0f074482588eee56f74a13e37f215633a210660fa194294", + "zh:eb8a99fe9328258a51adf1e258a39353622431e13f4c17c21bd84c5f58946280", + ] +} diff --git a/deploy/relay-fleet/README.md b/deploy/relay-fleet/README.md new file mode 100644 index 00000000..979f29eb --- /dev/null +++ b/deploy/relay-fleet/README.md @@ -0,0 +1,196 @@ +# Bitterbot relay fleet + +Always-on libp2p relay + bootstrap fleet. Three DigitalOcean droplets in +NYC1 / FRA1 / SGP1 running the orchestrator daemon as `--node-tier +management --relay-mode server --bootnode-mode`. Each node serves as both +a Kademlia bootstrap entry and a Circuit Relay v2 hop for NAT'd edge +nodes. + +Cost: ~$18/mo total ($6/droplet × 3). New DigitalOcean accounts get $200 +credit valid 60 days, so the first three months are essentially free. + +## Why this exists + +The Railway-hosted bootnode at `metro.proxy.rlwy.net:12838` is a single +point of failure and Railway's TCP proxy strips libp2p connection +semantics (AutoNAT v2 servers can't dial back through it). The April 23 +outage that took the network from 19 peers to 1 was a Railway-side +collapse with no code change of ours. + +This fleet provides: + +- Three real public peers, each on a clean L3 network with no proxy +- Geographic diversity (3 continents) so DCUtR latency stays low for any + user +- Genuine Circuit Relay v2 servers that NAT'd edge nodes can reserve + slots on +- A stable dnsaddr seed (`_dnsaddr.p2p.bitterbot.ai`) so adding/removing + relays doesn't require a client release + +## Prerequisites + +1. **DigitalOcean account + Personal Access Token.** Sign up at + `https://cloud.digitalocean.com/`, then generate a Read+Write token at + `https://cloud.digitalocean.com/account/api/tokens`. +2. **Cloudflare API token + Zone ID for bitterbot.ai.** Cloudflare + dashboard → My Profile → API Tokens → "Edit Zone DNS" template, scoped + to the bitterbot.ai zone. Zone ID is on the bitterbot.ai overview page. +3. **`doctl`** (only needed if you want to inspect droplets manually; + Terraform doesn't require it). `brew install doctl` / + `snap install doctl` / `winget install DigitalOcean.doctl`. +4. **`terraform`** ≥ 1.5 (already installed in this repo's WSL env). +5. **An SSH key for fleet management** — generate fresh if you don't + already have one: + ```bash + ssh-keygen -t ed25519 -f ~/.ssh/bitterbot-relay -C bitterbot-relay-fleet -N "" + ``` + +## Provision + +```bash +cd deploy/relay-fleet + +export DIGITALOCEAN_TOKEN=dop_v1_... +terraform init +terraform apply -auto-approve +``` + +The `terraform apply` step finishes in ~30 seconds (just creating +droplets), but cloud-init takes another **10-15 minutes** to compile the +orchestrator from source on each droplet. Watch one of them with: + +```bash +ip=$(terraform output -json relay_ipv4 | jq -r '.nyc1') +ssh -i ~/.ssh/bitterbot-relay root@$ip 'tail -f /var/log/bitterbot-bootstrap.log' +``` + +The script prints `==> bitterbot-relay-bootstrap finished` and the peer +ID when it's done. + +## Publish dnsaddr + +Once cloud-init has finished on all three droplets: + +```bash +export CLOUDFLARE_API_TOKEN=... +export CLOUDFLARE_ZONE_ID=... +export BASE_DOMAIN=p2p.bitterbot.ai + +./scripts/update-dnsaddr.sh +``` + +This SSHes into each droplet, reads `/var/lib/bitterbot/peer-id.txt`, +constructs `dnsaddr=/ip4//tcp/9100/p2p/`, and writes it as a +TXT record under `_dnsaddr.p2p.bitterbot.ai`. Verify: + +```bash +dig +short TXT _dnsaddr.p2p.bitterbot.ai +``` + +You should see three records, one per relay. + +## Add to client default config + +Edit `src/config/defaults.ts` (or wherever the default p2p.bootstrap list +lives) and add: + +```jsonc +{ + "p2p": { + "bootstrap": [ + // Cloudflare-hosted seed: resolves to all currently-published relays + "/dnsaddr/p2p.bitterbot.ai", + // Hardcoded fallback: the Railway bootnode stays in the list as a + // belt-and-braces backup for the case where DNS is broken or the + // dnsaddr resolution misbehaves. + "/dns4/metro.proxy.rlwy.net/tcp/12838/p2p/12D3KooWCwCCFMHCVv8eXZnAGMTUjTDPPePfYRTJ1fZvRpqcQXKt", + ], + }, +} +``` + +Ship in the next desktop release. From that point on, any client (new or +existing) bootstraps off whatever relays are currently in the dnsaddr TXT +record — no client release is needed to rotate or add nodes. + +## Day-2 ops + +### Replace a relay + +```bash +# Reprovision the droplet (Terraform will rebuild it): +terraform apply -replace="digitalocean_droplet.relay[\"fra1\"]" -auto-approve +# Wait for cloud-init, then: +./scripts/update-dnsaddr.sh +``` + +### Add a fourth region + +Edit `main.tf`'s `regions` default, then `terraform apply`. + +### SSH into a node + +```bash +ip=$(terraform output -json relay_ipv4 | jq -r '.nyc1') +ssh -i ~/.ssh/bitterbot-relay root@$ip +``` + +### Tear it all down + +```bash +terraform destroy -auto-approve +``` + +The dnsaddr records will become stale; either run +`./scripts/update-dnsaddr.sh` against an empty Terraform state (which +deletes them all) or remove them by hand from Cloudflare. + +## Resource shape per node + +- Droplet: `s-1vcpu-1gb`, Debian 12, IPv4 + IPv6 +- Disk: 25 GB SSD (orchestrator binary + `/data/keys` + journald) +- 2 GB swap added during cloud-init so the cargo release build doesn't + OOM on a 1 GB box (libp2p + tokio + ed25519-dalek peak at ~2.5 GB) +- Firewall: SSH 22, libp2p TCP 9100, future WSS 443, future QUIC UDP 9101 +- systemd unit: `bitterbot-orchestrator.service` (auto-restarts on + failure, persistent peer key in `/var/lib/bitterbot/`) +- Unattended security upgrades enabled (no auto-reboot, so a kernel + update won't surprise-restart the relay) + +## What this fleet does NOT do (yet) + +- **WebSocket-secure (WSS) on 443.** Reserved firewall port; orchestrator + doesn't yet listen on WSS. Adding it requires acme/autotls + a real + certificate. Tracked separately. +- **QUIC on UDP 9101.** Reserved firewall port; orchestrator doesn't yet + expose QUIC transport. Same status. +- **Auto-add new relay pubkeys to the client genesis trust list.** + Currently each new relay's pubkey gets recorded in the Terraform output + but you must commit it to `~/.bitterbot/genesis-trust.txt` (or wherever + the client looks) by hand so the management-tier startup check passes + for these nodes when they're seen by clients. Future: serve the trust + list from a versioned URL the orchestrator fetches at startup. + +## Troubleshooting + +- **cloud-init seems stuck.** Check `tail -f /var/log/cloud-init-output.log` + on the droplet. Almost always it's the Rust build still running. Watch + `htop` to confirm cargo is making progress. +- **Service won't start, exits with "pubkey not in genesis trust list".** + The cloud-init step that auto-generates the trust list with the local + pubkey didn't run cleanly. Re-run by hand: + ```bash + ssh root@ 'sudo bash /usr/local/sbin/bitterbot-relay-bootstrap.sh' + ``` +- **Peer ID file empty.** The systemd unit hasn't started yet, or the + log line wasn't matched. Check `journalctl -u bitterbot-orchestrator +-n 50` and look for `Local peer ID: 12D3KooW...`. + +## Source-of-truth + +- Provider choice rationale: `memory/project_relay_fleet.md` +- Why this isn't on Hetzner / OVH / Equinix Metal: same memory, "Why + not" section +- Ground-truth on what management/edge tier actually mean in the + protocol: `memory/project_relay_fleet.md` (the audit corrections + section) diff --git a/deploy/relay-fleet/cloud-init.yaml b/deploy/relay-fleet/cloud-init.yaml new file mode 100644 index 00000000..fe6378f6 --- /dev/null +++ b/deploy/relay-fleet/cloud-init.yaml @@ -0,0 +1,187 @@ +#cloud-config +# Bitterbot relay node first-boot provisioning. +# +# Goal: every droplet that boots from this config ends up running the +# orchestrator daemon as a systemd service, with persistent libp2p +# identity, exposing TCP/9100 and ready to serve Circuit Relay v2 hops. +# +# Build target: 1 vCPU / 1 GB RAM. We add 2 GB of swap before invoking +# cargo because libp2p release builds peak around 2-3 GB of RAM during +# parallel codegen and would OOM on a stock 1 GB box. Swap stays on after +# build (harmless on small relays, useful for occasional traffic spikes). + +package_update: true +package_upgrade: false # don't slow first boot for security upgrades; unattended-upgrades will handle them later + +packages: + - build-essential + - pkg-config + - libssl-dev + - ca-certificates + - curl + - git + - ufw + - jq + - unattended-upgrades + +write_files: + - path: /etc/systemd/system/bitterbot-orchestrator.service + permissions: "0644" + content: | + [Unit] + Description=Bitterbot libp2p relay/bootstrap orchestrator + After=network-online.target + Wants=network-online.target + + [Service] + Type=simple + User=bitterbot + Group=bitterbot + ExecStart=/usr/local/bin/bitterbot-orchestrator \ + --ipc-path /run/bitterbot/orchestrator.sock \ + --key-dir /var/lib/bitterbot \ + --listen-addr /ip4/0.0.0.0/tcp/9100 \ + --http-addr 127.0.0.1:9847 \ + --node-tier management \ + --relay-mode server \ + --bootnode-mode \ + --genesis-trust-list /var/lib/bitterbot/genesis_trust_list.txt + Restart=always + RestartSec=5 + LimitNOFILE=65536 + Environment=RUST_LOG=info + RuntimeDirectory=bitterbot + RuntimeDirectoryMode=0755 + StateDirectory=bitterbot + StateDirectoryMode=0755 + + [Install] + WantedBy=multi-user.target + + - path: /usr/local/sbin/bitterbot-relay-bootstrap.sh + permissions: "0755" + content: | + #!/bin/bash + # First-boot installer. Re-runnable: skips steps that are already done. + set -euo pipefail + + LOG=/var/log/bitterbot-bootstrap.log + exec > >(tee -a "$LOG") 2>&1 + echo "==> bitterbot-relay-bootstrap starting at $(date -Iseconds)" + + # 1. Add a service user. + if ! id bitterbot >/dev/null 2>&1; then + useradd --system --create-home --home-dir /var/lib/bitterbot \ + --shell /usr/sbin/nologin bitterbot + fi + + # 2. Add 2 GB swap so cargo build won't OOM on 1GB droplets. Idempotent. + if [ ! -f /swapfile ]; then + fallocate -l 2G /swapfile + chmod 600 /swapfile + mkswap /swapfile + swapon /swapfile + echo "/swapfile none swap sw 0 0" >> /etc/fstab + fi + + # 3. Install Rust toolchain into a system-wide path so root and + # `bitterbot` users both see it. + if ! command -v cargo >/dev/null 2>&1; then + export RUSTUP_HOME=/usr/local/rustup + export CARGO_HOME=/usr/local/cargo + curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs \ + | sh -s -- -y --default-toolchain stable --no-modify-path + ln -sf /usr/local/cargo/bin/cargo /usr/local/bin/cargo + ln -sf /usr/local/cargo/bin/rustc /usr/local/bin/rustc + fi + + # 4. Clone the repo (idempotent). + mkdir -p /opt/bitterbot + if [ ! -d /opt/bitterbot/src ]; then + git clone --depth=1 --branch="${git_branch}" "${git_repo_url}" /opt/bitterbot/src + fi + cd /opt/bitterbot/src + + # 5. Build the orchestrator. -j 2 keeps memory usage in check on a + # 1 vCPU / 1 GB box. Release build takes ~10-15 min from cold cache. + cd orchestrator + RUSTUP_HOME=/usr/local/rustup CARGO_HOME=/usr/local/cargo \ + cargo build --release -j 2 + + # 6. Install binary, ownership, dirs. + install -m 755 target/release/bitterbot-orchestrator /usr/local/bin/ + mkdir -p /var/lib/bitterbot + chown -R bitterbot:bitterbot /var/lib/bitterbot + + # 7. Genesis trust list: bootstrap with this node's own pubkey so + # --node-tier management passes the startup check. Operators rotate + # in additional pubkeys later by replacing this file (or by + # publishing a global trust list that gets fetched). + if [ ! -f /var/lib/bitterbot/genesis_trust_list.txt ]; then + # Briefly start the orchestrator just to generate the keypair, then + # kill it. orchestrator/src/crypto.rs writes node.key (raw 32-byte + # secret) and node.pub (raw 32-byte public key) on first boot. + sudo -u bitterbot timeout 5 /usr/local/bin/bitterbot-orchestrator \ + --key-dir /var/lib/bitterbot \ + --ipc-path /tmp/bitterbot-init-$$.sock \ + --listen-addr /ip4/127.0.0.1/tcp/0 \ + --http-addr 127.0.0.1:0 \ + --node-tier edge 2>/dev/null || true + rm -f /tmp/bitterbot-init-$$.sock + + if [ -f /var/lib/bitterbot/node.pub ]; then + PUBKEY=$(base64 -w0 /var/lib/bitterbot/node.pub) + { + echo "# auto-generated on first boot at $(date -Iseconds)" + echo "$PUBKEY" + } > /var/lib/bitterbot/genesis_trust_list.txt + chown bitterbot:bitterbot /var/lib/bitterbot/genesis_trust_list.txt + echo "==> generated genesis_trust_list.txt with pubkey $PUBKEY" + else + echo "ERROR: node.pub was not generated; orchestrator binary may be broken" + exit 1 + fi + fi + + # 8. Firewall. + ufw --force reset + ufw default deny incoming + ufw default allow outgoing + ufw allow 22/tcp comment "ssh" + ufw allow 9100/tcp comment "libp2p TCP" + ufw allow 443/tcp comment "future libp2p WSS" + ufw allow 9101/udp comment "future libp2p QUIC" + ufw --force enable + + # 9. Start the orchestrator service. + systemctl daemon-reload + systemctl enable bitterbot-orchestrator + systemctl restart bitterbot-orchestrator + + # 10. Persist peer ID via journal extraction. The orchestrator logs + # "Local peer ID: 12D3KooW..." once at startup + # (orchestrator/src/main.rs:101). Sleep briefly so it lands in journald. + sleep 6 + journalctl -u bitterbot-orchestrator --no-pager -n 200 \ + | grep -oE 'Local peer ID: [A-Za-z0-9]+' | head -1 \ + | awk '{print $4}' > /var/lib/bitterbot/peer-id.txt || true + chown bitterbot:bitterbot /var/lib/bitterbot/peer-id.txt + + echo "==> bitterbot-relay-bootstrap finished at $(date -Iseconds)" + echo "==> peer ID: $(cat /var/lib/bitterbot/peer-id.txt 2>/dev/null || echo unknown)" + + - path: /etc/apt/apt.conf.d/52unattended-upgrades-bitterbot + permissions: "0644" + content: | + // $${...} escapes the dollar through Terraform's templatefile() so the + // literal $${distro_id} reaches cloud-init for its own substitution. + Unattended-Upgrade::Allowed-Origins { + "$${distro_id}:$${distro_codename}-security"; + }; + Unattended-Upgrade::Automatic-Reboot "false"; + +runcmd: + - bash /usr/local/sbin/bitterbot-relay-bootstrap.sh + - systemctl enable unattended-upgrades + +final_message: "Bitterbot relay node provisioned at $TIMESTAMP. Peer ID in /var/lib/bitterbot/peer-id.txt" diff --git a/deploy/relay-fleet/main.tf b/deploy/relay-fleet/main.tf new file mode 100644 index 00000000..43ae0424 --- /dev/null +++ b/deploy/relay-fleet/main.tf @@ -0,0 +1,172 @@ +# Bitterbot relay-fleet — Terraform module for DigitalOcean +# +# Provisions a 3-region fleet (NYC1 / FRA1 / SGP1) of always-on libp2p +# relay+bootstrap nodes. Each runs the Bitterbot orchestrator daemon in +# `--relay-mode server --node-tier management --bootnode-mode` so it serves +# both as a Kademlia bootstrap entry AND as a Circuit Relay v2 hop for +# NAT'd edge nodes. +# +# Required env vars: +# - DIGITALOCEAN_TOKEN Personal Access Token (read+write) from +# https://cloud.digitalocean.com/account/api/tokens +# +# After `terraform apply`, capture the peer IDs from each droplet's first +# boot log and feed them to scripts/update-dnsaddr.sh to publish under +# `_dnsaddr.p2p.bitterbot.ai`. + +terraform { + required_providers { + digitalocean = { + source = "digitalocean/digitalocean" + version = "~> 2.40" + } + } +} + +provider "digitalocean" { + # Reads DIGITALOCEAN_TOKEN from env automatically. +} + +variable "regions" { + description = "DigitalOcean region slugs to deploy a relay node into. Three is the recommended minimum for global coverage." + type = list(string) + default = ["nyc1", "fra1", "sgp1"] +} + +variable "droplet_size" { + description = "DO droplet size slug. s-1vcpu-1gb ($6/mo, 25GB SSD, 1TB egress) is sufficient for a relay handling hundreds of peers." + type = string + default = "s-1vcpu-1gb" +} + +variable "image" { + description = "Base image. Debian 12 keeps cloud-init simple and apt up to date." + type = string + default = "debian-12-x64" +} + +variable "ssh_pubkey_path" { + description = "Path to the SSH public key used to admin the fleet. Generate with `ssh-keygen -t ed25519 -f ~/.ssh/bitterbot-relay -C bitterbot-relay-fleet`." + type = string + default = "~/.ssh/bitterbot-relay.pub" +} + +variable "git_repo_url" { + description = "Git repo to clone on each droplet. Must contain orchestrator/ at the root." + type = string + default = "https://github.com/Bitterbot-AI/bitterbot-desktop.git" +} + +variable "git_branch" { + description = "Branch to build from. Pin to a tag in production." + type = string + default = "main" +} + +variable "genesis_trust_list_url" { + description = "URL of the genesis trust list these relays should bootstrap with. Each relay's pubkey gets added to this list automatically AFTER first boot via the post-provision step." + type = string + default = "" +} + +resource "digitalocean_ssh_key" "fleet" { + name = "bitterbot-relay-fleet" + public_key = file(pathexpand(var.ssh_pubkey_path)) +} + +# Single firewall used by every droplet in the fleet. Inbound: SSH from +# anywhere (replace with your IP CIDR for tighter ops), libp2p TCP/9100, +# Bitterbot HTTP API on 9847 (loopback only via systemd unit), and reserved +# ports for future WSS (443) and QUIC (9101 UDP) listeners. +resource "digitalocean_firewall" "fleet" { + name = "bitterbot-relay-fleet" + droplet_ids = [for d in digitalocean_droplet.relay : d.id] + + inbound_rule { + protocol = "tcp" + port_range = "22" + source_addresses = ["0.0.0.0/0", "::/0"] + } + inbound_rule { + protocol = "tcp" + port_range = "9100" + source_addresses = ["0.0.0.0/0", "::/0"] + } + inbound_rule { + protocol = "tcp" + port_range = "443" + source_addresses = ["0.0.0.0/0", "::/0"] + } + inbound_rule { + protocol = "udp" + port_range = "9101" + source_addresses = ["0.0.0.0/0", "::/0"] + } + inbound_rule { + protocol = "icmp" + source_addresses = ["0.0.0.0/0", "::/0"] + } + + outbound_rule { + protocol = "tcp" + port_range = "all" + destination_addresses = ["0.0.0.0/0", "::/0"] + } + outbound_rule { + protocol = "udp" + port_range = "all" + destination_addresses = ["0.0.0.0/0", "::/0"] + } + outbound_rule { + protocol = "icmp" + destination_addresses = ["0.0.0.0/0", "::/0"] + } +} + +resource "digitalocean_droplet" "relay" { + for_each = toset(var.regions) + + name = "bitterbot-relay-${each.value}" + region = each.value + size = var.droplet_size + image = var.image + ssh_keys = [digitalocean_ssh_key.fleet.id] + ipv6 = true + monitoring = true + + user_data = templatefile("${path.module}/cloud-init.yaml", { + git_repo_url = var.git_repo_url + git_branch = var.git_branch + genesis_trust_list_url = var.genesis_trust_list_url + }) + + tags = ["bitterbot", "relay", "node-tier-management"] +} + +output "relay_ipv4" { + description = "Public IPv4 of each relay, keyed by region." + value = { for region, d in digitalocean_droplet.relay : region => d.ipv4_address } +} + +output "relay_ipv6" { + description = "Public IPv6 of each relay, keyed by region." + value = { for region, d in digitalocean_droplet.relay : region => d.ipv6_address } +} + +output "next_steps" { + description = "Commands to run after provisioning." + value = <<-EOT + 1. Wait ~10 min for cloud-init to finish (Rust build is the long pole on a 1GB droplet). + 2. Capture peer IDs: + for region in ${join(" ", var.regions)}; do + ip=$(terraform output -json relay_ipv4 | jq -r ".[\"$region\"]") + peer_id=$(ssh -i ~/.ssh/bitterbot-relay -o StrictHostKeyChecking=accept-new \ + root@$ip 'cat /var/lib/bitterbot/peer-id.txt') + echo "$region $ip $peer_id" + done + 3. Publish dnsaddr TXT records: + ./scripts/update-dnsaddr.sh + 4. Add each relay's pubkey to genesis-trust.txt and commit (so they pass + the management-tier startup check on next restart). + EOT +} diff --git a/deploy/relay-fleet/scripts/fleet-stats.sh b/deploy/relay-fleet/scripts/fleet-stats.sh new file mode 100644 index 00000000..85fac36a --- /dev/null +++ b/deploy/relay-fleet/scripts/fleet-stats.sh @@ -0,0 +1,191 @@ +#!/usr/bin/env bash +# Aggregate adoption + health metrics across the relay fleet. +# +# Each relay exposes its orchestrator HTTP API on 127.0.0.1:9847 (loopback +# only; tightened by the systemd unit). This script SSHes into every +# relay listed in `terraform output -json relay_ipv4`, pulls the JSON +# from /api/stats and /api/bootstrap/census, then aggregates locally to +# produce a single network-wide view: lifetime unique peers, current +# concurrent peers, hole-punch success rate, NAT status, top peers by +# contribution, etc. +# +# Usage: +# ./scripts/fleet-stats.sh # human-readable summary +# ./scripts/fleet-stats.sh --json # raw aggregated JSON +# ./scripts/fleet-stats.sh --csv # CSV time-series for piping +# +# Optional env: +# TERRAFORM_DIR Path to relay-fleet TF module (default: parent dir). +# SSH_KEY Path to ssh key (default: ~/.ssh/bitterbot-relay). +# SSH_USER SSH user (default: root). + +set -euo pipefail + +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +TERRAFORM_DIR="${TERRAFORM_DIR:-$(dirname "$SCRIPT_DIR")}" +SSH_KEY="${SSH_KEY:-$HOME/.ssh/bitterbot-relay}" +SSH_USER="${SSH_USER:-root}" +MODE="${1:-summary}" + +if ! command -v jq >/dev/null 2>&1; then + echo "ERROR: jq is required (apt install jq / brew install jq)" >&2 + exit 1 +fi + +# 1. Pull the relay IP map from Terraform state. +cd "$TERRAFORM_DIR" +RELAYS_JSON=$(terraform output -json relay_ipv4 2>/dev/null || echo '{}') +if [ "$RELAYS_JSON" = "{}" ]; then + echo "ERROR: no relay IPs found. Run 'terraform apply' in $TERRAFORM_DIR first." >&2 + exit 1 +fi + +# 2. SSH to each relay and pull stats + census in parallel. Captured to +# /tmp so we can aggregate without holding multiple SSH sessions open. +TMPDIR=$(mktemp -d) +trap "rm -rf $TMPDIR" EXIT + +fetch_relay() { + local region="$1" ip="$2" + ssh -i "$SSH_KEY" \ + -o StrictHostKeyChecking=accept-new \ + -o ConnectTimeout=10 \ + -o BatchMode=yes \ + -o LogLevel=ERROR \ + "$SSH_USER@$ip" \ + "curl -fsS http://127.0.0.1:9847/api/stats; echo '<>'; curl -fsS http://127.0.0.1:9847/api/bootstrap/census" \ + > "$TMPDIR/$region.raw" 2>"$TMPDIR/$region.err" \ + && echo "$region $ip ok" \ + || echo "$region $ip ERR ($(cat $TMPDIR/$region.err 2>/dev/null | head -1))" +} + +while read -r region; do + ip=$(echo "$RELAYS_JSON" | jq -r ".[\"$region\"]") + fetch_relay "$region" "$ip" & +done < <(echo "$RELAYS_JSON" | jq -r 'keys[]') + +# Wait for all background jobs. +wait +[ "$MODE" = "summary" ] && echo + +# 3. Parse each relay's response into stats.json + census.json. +for raw in "$TMPDIR"/*.raw; do + region=$(basename "$raw" .raw) + if [ ! -s "$raw" ]; then + continue + fi + awk 'BEGIN{out="stats"} /^<>$/{out="census"; next} {print > "'$TMPDIR/$region.'"out".json"}' "$raw" +done + +# 4. Aggregate. +# - Lifetime unique peers: union of all relays' bootstrap-census peer_pubkey. +# - Active concurrent: max across relays (each sees a different slice; the +# true concurrent count is bounded above by max-per-relay since most +# peers connect to multiple relays). +# - Per-relay tear: stats per relay for ops debugging. + +ACTIVE_PEERS_TXT=$(for f in "$TMPDIR"/*.census.json; do + jq -r '.peers[]?.peer_pubkey // empty' "$f" 2>/dev/null +done | sort -u) +LIFETIME_UNIQUE=$(echo -n "$ACTIVE_PEERS_TXT" | grep -c '' || true) + +declare -A REGION_STATS +TOTAL_CONNECTED=0 +PEAK_CONCURRENT=0 +TOTAL_HP_SUCCEEDED=0 +TOTAL_HP_FAILED=0 +TOTAL_RELAY_RESERVATIONS=0 +TOTAL_RELAY_CIRCUITS=0 + +for stats_json in "$TMPDIR"/*.stats.json; do + region=$(basename "$stats_json" .stats.json) + if [ ! -s "$stats_json" ]; then + REGION_STATS[$region]="UNREACHABLE" + continue + fi + cp_count=$(jq -r '.connected_peers // 0' "$stats_json") + peak=$(jq -r '.peak_concurrent_peers // 0' "$stats_json") + uptime_secs=$(jq -r '.uptime_secs // 0' "$stats_json") + hp_ok=$(jq -r '.hole_punches_succeeded // 0' "$stats_json") + hp_no=$(jq -r '.hole_punches_failed // 0' "$stats_json") + relay_res=$(jq -r '.relay_reservations_accepted // 0' "$stats_json") + relay_circ=$(jq -r '.relay_circuits_established // 0' "$stats_json") + nat=$(jq -r '.nat_status // "unknown"' "$stats_json") + peer_id=$(jq -r '.peer_id // "?"' "$stats_json") + + REGION_STATS[$region]="connected=$cp_count peak=$peak nat=$nat hp_ok=$hp_ok hp_no=$hp_no relay_res=$relay_res circ=$relay_circ uptime_h=$((uptime_secs/3600)) peer_id=${peer_id:0:18}…" + + TOTAL_CONNECTED=$((TOTAL_CONNECTED + cp_count)) + if [ "$peak" -gt "$PEAK_CONCURRENT" ]; then PEAK_CONCURRENT=$peak; fi + TOTAL_HP_SUCCEEDED=$((TOTAL_HP_SUCCEEDED + hp_ok)) + TOTAL_HP_FAILED=$((TOTAL_HP_FAILED + hp_no)) + TOTAL_RELAY_RESERVATIONS=$((TOTAL_RELAY_RESERVATIONS + relay_res)) + TOTAL_RELAY_CIRCUITS=$((TOTAL_RELAY_CIRCUITS + relay_circ)) +done + +HP_RATE="n/a" +if [ "$((TOTAL_HP_SUCCEEDED + TOTAL_HP_FAILED))" -gt 0 ]; then + HP_RATE=$(awk "BEGIN{printf \"%.1f%%\", 100*$TOTAL_HP_SUCCEEDED/($TOTAL_HP_SUCCEEDED+$TOTAL_HP_FAILED)}") +fi + +# 5. Render. +case "$MODE" in + --json) + jq -n \ + --argjson lifetime "$LIFETIME_UNIQUE" \ + --argjson connected "$TOTAL_CONNECTED" \ + --argjson peak "$PEAK_CONCURRENT" \ + --argjson hp_ok "$TOTAL_HP_SUCCEEDED" \ + --argjson hp_no "$TOTAL_HP_FAILED" \ + --argjson reservations "$TOTAL_RELAY_RESERVATIONS" \ + --argjson circuits "$TOTAL_RELAY_CIRCUITS" \ + '{ + lifetime_unique_peers: $lifetime, + sum_connected_across_relays: $connected, + peak_concurrent_per_relay_max: $peak, + hole_punches: {succeeded: $hp_ok, failed: $hp_no}, + relay_reservations_accepted: $reservations, + relay_circuits_established: $circuits + }' + ;; + --csv) + echo "timestamp,region,connected,peak,nat,hp_ok,hp_no,uptime_h,peer_id" + ts=$(date -u +%Y-%m-%dT%H:%M:%SZ) + for region in "${!REGION_STATS[@]}"; do + s="${REGION_STATS[$region]}" + [ "$s" = "UNREACHABLE" ] && continue + cp=$(echo "$s" | grep -oE 'connected=[0-9]+' | cut -d= -f2) + pk=$(echo "$s" | grep -oE 'peak=[0-9]+' | cut -d= -f2) + nat=$(echo "$s" | grep -oE 'nat=[a-z]+' | cut -d= -f2) + ok=$(echo "$s" | grep -oE 'hp_ok=[0-9]+' | cut -d= -f2) + no=$(echo "$s" | grep -oE 'hp_no=[0-9]+' | cut -d= -f2) + uh=$(echo "$s" | grep -oE 'uptime_h=[0-9]+' | cut -d= -f2) + pi=$(echo "$s" | grep -oE 'peer_id=[A-Za-z0-9…]+' | cut -d= -f2) + echo "$ts,$region,$cp,$pk,$nat,$ok,$no,$uh,$pi" + done + ;; + *) + cat <1" + echo "relay. 'lifetime unique' is the true deduped count across all relays." + echo "'peak concurrent per-relay' is the high-water mark for any one relay," + echo "useful for capacity planning." + ;; +esac diff --git a/deploy/relay-fleet/scripts/update-dnsaddr.sh b/deploy/relay-fleet/scripts/update-dnsaddr.sh new file mode 100644 index 00000000..2a82bc56 --- /dev/null +++ b/deploy/relay-fleet/scripts/update-dnsaddr.sh @@ -0,0 +1,129 @@ +#!/usr/bin/env bash +# Publish (or refresh) the bootstrap dnsaddr seed for the Bitterbot P2P +# network. Reads peer IDs and IPs from `terraform output -json` and writes +# one TXT record per multiaddr under `_dnsaddr.`. +# +# libp2p clients resolving `/dnsaddr/p2p.bitterbot.ai/p2p/...` will look up +# the TXT records and discover every published multiaddr. +# +# Required env: +# CLOUDFLARE_API_TOKEN Token with Edit Zone DNS scope on the target zone +# CLOUDFLARE_ZONE_ID Zone ID (visible on the Cloudflare Overview page) +# BASE_DOMAIN e.g. "p2p.bitterbot.ai" — the parent record we +# publish under. Underscored prefix `_dnsaddr.` is +# added automatically. +# +# Optional env: +# TERRAFORM_DIR Path to the relay-fleet Terraform module +# (default: parent dir of this script). +# DRY_RUN=1 Print the API calls without executing them. +# +# Usage: +# CLOUDFLARE_API_TOKEN=... CLOUDFLARE_ZONE_ID=... \ +# BASE_DOMAIN=p2p.bitterbot.ai \ +# ./scripts/update-dnsaddr.sh + +set -euo pipefail + +: "${CLOUDFLARE_API_TOKEN:?CLOUDFLARE_API_TOKEN is required}" +: "${CLOUDFLARE_ZONE_ID:?CLOUDFLARE_ZONE_ID is required}" +: "${BASE_DOMAIN:?BASE_DOMAIN is required (e.g. p2p.bitterbot.ai)}" + +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +TERRAFORM_DIR="${TERRAFORM_DIR:-$(dirname "$SCRIPT_DIR")}" +DNSADDR_NAME="_dnsaddr.${BASE_DOMAIN}" + +echo "==> Reading relay IPs from Terraform state in $TERRAFORM_DIR" +cd "$TERRAFORM_DIR" +RELAY_IPV4_JSON=$(terraform output -json relay_ipv4) + +# Compose multiaddrs by SSHing into each droplet to pull the persisted +# peer ID. Skips any droplet whose peer ID file isn't present yet +# (cloud-init still running). +declare -a MULTIADDRS=() +while read -r region; do + ip=$(echo "$RELAY_IPV4_JSON" | jq -r ".[\"$region\"]") + if [ -z "$ip" ] || [ "$ip" = "null" ]; then + echo "warn: no IPv4 for region $region, skipping" >&2 + continue + fi + echo "==> $region @ $ip — fetching peer ID" + # -n is critical: without it, ssh reads from the parent's stdin + # (the `done < <(...)` feed) and consumes the next loop iteration's + # region name, silently dropping every relay after the first. + peer_id=$(ssh -n -i ~/.ssh/bitterbot-relay \ + -o StrictHostKeyChecking=accept-new \ + -o ConnectTimeout=10 \ + -o BatchMode=yes \ + "root@$ip" \ + 'cat /var/lib/bitterbot/peer-id.txt 2>/dev/null' || true) + if [ -z "$peer_id" ]; then + echo "warn: peer ID not yet persisted on $region (cloud-init still building?). Re-run later." >&2 + continue + fi + echo " peer ID: $peer_id" + MULTIADDRS+=("/ip4/$ip/tcp/9100/p2p/$peer_id") + # Future transports — uncomment once the orchestrator listens on these: + # MULTIADDRS+=("/ip4/$ip/tcp/443/wss/p2p/$peer_id") + # MULTIADDRS+=("/ip4/$ip/udp/9101/quic-v1/p2p/$peer_id") +done < <(echo "$RELAY_IPV4_JSON" | jq -r 'keys[]') + +if [ ${#MULTIADDRS[@]} -eq 0 ]; then + echo "ERROR: no multiaddrs collected; bailing without touching DNS" >&2 + exit 1 +fi + +echo +echo "==> Will publish ${#MULTIADDRS[@]} TXT record(s) under $DNSADDR_NAME:" +for addr in "${MULTIADDRS[@]}"; do + echo " dnsaddr=$addr" +done +echo + +CF_API="https://api.cloudflare.com/client/v4" + +# Helper: POST/DELETE/GET with auth header +cf() { + curl -fsSL \ + -H "Authorization: Bearer $CLOUDFLARE_API_TOKEN" \ + -H "Content-Type: application/json" \ + "$@" +} + +# 1. List existing TXT records on the dnsaddr name and delete any that look +# like ours (start with "dnsaddr="). Other TXT content is left alone. +EXISTING=$(cf "$CF_API/zones/$CLOUDFLARE_ZONE_ID/dns_records?type=TXT&name=$DNSADDR_NAME") +EXISTING_IDS=$(echo "$EXISTING" | jq -r '.result[] | select(.content | startswith("dnsaddr=") or startswith("\"dnsaddr=")) | .id') + +if [ -n "$EXISTING_IDS" ]; then + echo "==> Deleting $(echo "$EXISTING_IDS" | wc -l) stale TXT record(s)" + for id in $EXISTING_IDS; do + if [ "${DRY_RUN:-0}" = "1" ]; then + echo " DRY: DELETE record $id" + else + cf -X DELETE "$CF_API/zones/$CLOUDFLARE_ZONE_ID/dns_records/$id" >/dev/null + echo " deleted $id" + fi + done +fi + +# 2. Create one TXT record per multiaddr. +echo "==> Creating ${#MULTIADDRS[@]} new TXT record(s)" +for addr in "${MULTIADDRS[@]}"; do + payload=$(jq -nc \ + --arg name "$DNSADDR_NAME" \ + --arg content "dnsaddr=$addr" \ + '{type:"TXT", name:$name, content:$content, ttl:300}') + if [ "${DRY_RUN:-0}" = "1" ]; then + echo " DRY: POST $payload" + else + cf -X POST "$CF_API/zones/$CLOUDFLARE_ZONE_ID/dns_records" \ + --data "$payload" \ + | jq -r '" created " + .result.id + " — " + .result.content' + fi +done + +echo +echo "==> Done. Verify with:" +echo " dig +short TXT $DNSADDR_NAME" +echo " or: curl -s 'https://cloudflare-dns.com/dns-query?name=$DNSADDR_NAME&type=TXT' -H 'accept: application/dns-json' | jq" From f30dabe6bfce0fefeab22bed84fb6ccde7f78a89 Mon Sep 17 00:00:00 2001 From: VGIL77 Date: Tue, 28 Apr 2026 21:57:58 -0400 Subject: [PATCH 2/2] chore(p2p): oxfmt cloud-init.yaml comment spacing CI lint Format check fails on the relay-fleet PR because oxfmt enforces single-space-before-inline-comment in YAML. One-line whitespace fix. Co-Authored-By: Claude Opus 4.7 (1M context) --- deploy/relay-fleet/cloud-init.yaml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/deploy/relay-fleet/cloud-init.yaml b/deploy/relay-fleet/cloud-init.yaml index fe6378f6..e8313c20 100644 --- a/deploy/relay-fleet/cloud-init.yaml +++ b/deploy/relay-fleet/cloud-init.yaml @@ -11,7 +11,7 @@ # build (harmless on small relays, useful for occasional traffic spikes). package_update: true -package_upgrade: false # don't slow first boot for security upgrades; unattended-upgrades will handle them later +package_upgrade: false # don't slow first boot for security upgrades; unattended-upgrades will handle them later packages: - build-essential