Conversation
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Implements deploy/test/deploy.yml with plays for building a signed binary locally, deploying to all hosts, discovering bootstrap multiaddrs, deploying final configs with peer bootstraps, and verifying mesh formation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds R2 bucket, Worker custom domain, and DNS record for docs site to OpenTofu config. Worker code deployment stays with wrangler. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ocs infrastructure
Cloudflare provider v5 requires ttl explicitly. Value 1 = auto (managed by Cloudflare) for proxied records. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove docs DNS record (auto-created by Worker custom domain) - Add ttl=1 for proxied DNS records (provider v5 requirement) - Add dependency lock file All 6 resources applied: - bootstraps.opentela.ai: 2x A records + origin rule (port 8092) + SSL rule - docs.opentela.ai: R2 bucket + Worker custom domain Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Test Coverage Report 📊Click to view detailed coverageSummary: |
Contributor
There was a problem hiding this comment.
Pull request overview
This PR focuses on infrastructure/test improvements for OpenTela: adding Cloudflare OpenTofu IaC, new deployment automation (Ansible + Clariden scripts), expanding integration coverage for billing, and tightening protocol behaviors around peer trust/liveness.
Changes:
- Added Cloudflare OpenTofu configuration (DNS/origin rules/Worker custom domain + R2) plus accompanying docs/spec/plan.
- Added/expanded Go integration tests for billing (pipeline + Docker mesh) and improved peer discovery test robustness.
- Updated protocol code to (a) always trust the local node when signed-binary enforcement is enabled and (b) switch peer liveness checks to libp2p ping; removed the legacy
tokens/Anchor project scaffold.
Reviewed changes
Copilot reviewed 36 out of 39 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| tokens/yarn.lock | Removed legacy tokens JS dependency lockfile. |
| tokens/tsconfig.json | Removed legacy tokens TS config. |
| tokens/tests/tokens.ts | Removed legacy Anchor TS test scaffold. |
| tokens/tests/anchor.test.rs | Removed legacy Anchor/Playground test file. |
| tokens/programs/tokens/src/lib.rs | Removed Anchor program implementation. |
| tokens/programs/tokens/Xargo.toml | Removed Anchor/Xargo config. |
| tokens/programs/tokens/Cargo.toml | Removed Anchor program manifest. |
| tokens/package.json | Removed legacy tokens JS package manifest. |
| tokens/migrations/deploy.ts | Removed legacy Anchor migration script. |
| tokens/Cargo.toml | Removed legacy tokens Rust workspace. |
| tokens/Cargo.lock | Removed legacy tokens Rust lockfile. |
| tokens/Anchor.toml | Removed legacy Anchor workspace config. |
| tokens/.prettierignore | Removed legacy formatter ignore list. |
| tokens/.gitignore | Removed legacy tokens gitignore. |
| src/tests/integration/peer_discovery_test.go | Makes integration builds more hermetic and speeds peer ID discovery for bootstrapping. |
| src/tests/integration/billing_pipeline_test.go | Adds end-to-end (non-network) integration tests for billing pipeline components. |
| src/tests/integration/billing_mesh_test.go | Adds a multi-node Docker mesh integration test validating usage tracking through the proxy path. |
| src/internal/protocol/node_table_test.go | Adds coverage ensuring self is accepted even when signed-binary enforcement is enabled. |
| src/internal/protocol/node_table.go | Adjusts signed-binary enforcement to never reject the local node’s own CRDT entry. |
| src/internal/protocol/clock.go | Switches peer liveness verification from dial attempts to libp2p ping-based checks. |
| docs/superpowers/specs/2026-03-22-cloudflare-dns-opentofu-design.md | Design spec for managing Cloudflare infra with OpenTofu. |
| docs/superpowers/plans/2026-03-22-cloudflare-opentofu.md | Step-by-step implementation plan for the Cloudflare OpenTofu setup. |
| deploy/test/templates/otela.service.j2 | New systemd unit template for test node deployment. |
| deploy/test/templates/cfg.yaml.j2 | New templated node config for test deployments. |
| deploy/test/inventory.yml | New Ansible inventory for test nodes. |
| deploy/test/deploy.yml | New Ansible playbook for building, deploying, discovering bootstraps, and verifying mesh. |
| deploy/test/ansible.cfg | Ansible config for the test deployment workflow. |
| deploy/cloudflare/variables.tf | Declares Cloudflare OpenTofu input variables. |
| deploy/cloudflare/outputs.tf | Exposes useful URLs as OpenTofu outputs. |
| deploy/cloudflare/main.tf | Implements DNS + rulesets + R2 + Worker custom domain resources. |
| deploy/cloudflare/README.md | Operator docs for initializing/importing/applying the Cloudflare OpenTofu config. |
| deploy/cloudflare/.terraform.lock.hcl | Locks Cloudflare provider version/hashes for reproducible applies. |
| deploy/cloudflare/.gitignore | Ensures tfstate/tfvars don’t get committed. |
| deploy/clariden/worker.cfg.yaml | Clariden worker node config example. |
| deploy/clariden/start-relay.sh | Script to start a relay node with data dir in /tmp. |
| deploy/clariden/sglang.toml | EDF config for running sglang containerized workload. |
| deploy/clariden/setup.sh | Local helper to cross-build and transfer artifacts to Clariden. |
| deploy/clariden/relay.cfg.yaml | Clariden relay node config example. |
| deploy/clariden/job.sh | SLURM job script to run sglang + OpenTela worker together. |
Files not reviewed (1)
- deploy/cloudflare/.terraform.lock.hcl: Language not supported
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Peers need to carry their own TCP port so bootstrap address construction uses the correct port per-peer (relays use 18905, heads use 43905). PublicPort is propagated in UpdateNodeTable alongside PublicAddress. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Introduces 'role' config (worker/head/relay) so nodes can self-identify. Adds RoleRelay to SWIM metadata for relay membership propagation. Default role is 'worker' for backwards compatibility. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Relays with public-addr automatically advertise as bootstraps. InitializeMyself sets role and PublicPort from config. MarkSelfAsBootstrap triggers for role=relay in addition to public-addr. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ConnectedBootstraps() was using the local node's tcpport for all peers, producing wrong multiaddrs for relays on different ports (e.g., 18905). Extracts BuildBootstrapAddr() as a testable pure function. Falls back to local tcpport for un-upgraded peers without PublicPort. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The ping handler and ConnectedF callback were ignoring peers not yet in the node table (added in 2c069a2 to avoid creating entries without build attestation). This created a chicken-and-egg problem: fresh nodes could never learn about peers because the fast path (gossip messages) ignored them, and the slow path (CRDT DAG walk) could take minutes for deep history. Now, when we receive a gossip message from a peer that is actually connected at the libp2p level, or when a new libp2p connection is established, we create a minimal entry in the node table. The full record (with build attestation) will overwrite this when the CRDT PutHook fires. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…otstraps peer.ID(stringValue) does not produce a valid peer.ID from a string representation — it just casts the raw bytes. Use peer.Decode() which properly parses the base58/CID-encoded peer ID string. This was causing ConnectedBootstraps() to always see NotConnected for remote peers, making the endpoint only return the local node's own address. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The 30-second verification ticker was writing Connected/LastSeen for every peer to the CRDT store, creating ~12,000 DAG entries per day. Fresh nodes joining the network had to walk this entire history via bitswap before they could discover any peers, taking 20+ minutes. Liveness updates (Connected, LastSeen, stale peer marking) now go to the in-memory node table only via UpdateNodeTableHook. Structural changes (join, leave, service registration, bootstrap) still go through CRDT store.Put for replication. Also removes the periodic ReannounceLocalServices() from the 2-minute maintenance ticker — it was writing to CRDT every 2 minutes even when nothing changed. Reannouncement still happens on actual reconnects. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
All deploy configs now use only https://bootstraps.opentela.ai/v1/dnt/bootstraps as their bootstrap source. No hardcoded relay multiaddrs or peer IDs. Nodes discover each other dynamically: - Head nodes register themselves via public-addr - Relay nodes register via role=relay + public-addr - Workers and relays fetch the aggregated list from the domain endpoint This eliminates the need to update configs when relay IPs or peer IDs change. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… reachable When the head node forwards a request to a worker it's not directly connected to (e.g., worker behind HPC firewall), it now tries relay circuit addresses through any connected peer. This enables the flow: client → head node → relay (circuit) → worker EnsureConnected() first tries the peerstore's known addresses, then constructs /p2p/<relay>/p2p-circuit/p2p/<target> multiaddrs through each connected peer until one succeeds. The proxy handler calls EnsureConnected before forwarding, returning 502 Bad Gateway with a clear error if the worker is truly unreachable. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ForceReachabilityPublic() tells libp2p the node is publicly reachable, which prevents it from making relay reservations. Workers behind HPC firewalls need libp2p to detect they're private so it automatically reserves slots on relay nodes, enabling circuit relay connections. Now only nodes with public-addr set (head nodes, relays) force public reachability. Workers without public-addr let AutoNAT detect their reachability, enabling automatic relay reservation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Workers without public-addr are forced to ForceReachabilityPrivate and use EnableAutoRelayWithPeerSource to automatically find relay servers among connected peers and maintain active reservations. This is required for relay circuit routing: the relay returns NO_RESERVATION (204) when a head node tries to connect to a worker through it, unless the worker has previously reserved a slot. The peer source callback offers all connected peers as relay candidates. Peers running EnableRelayService() (all nodes) will accept reservations. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
AutoRelay wasn't making reservations due to timing issues (peer source callback fires before host has connections). Instead, workers now explicitly call client.Reserve() on every connected peer after bootstrap and on reconnects. This ensures the relay has an active reservation so head nodes can connect to workers via circuit relay. MakeRelayReservations() runs 10s after bootstrap (to let connections establish) and 5s after each reconnect. Only runs for workers (nodes without public-addr). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace relay circuit stream approach (which times out due to p2phttp over circuit limitations) with HTTP relay-hop routing: When the head node can't directly reach a worker, it forwards the request through a connected relay node's /v1/p2p/<worker>/ handler: client → head (HTTP) → relay (libp2p) → worker (libp2p) The relay receives the request via its P2P listener and forwards it to the worker using its own direct libp2p connection. No relay circuit or reservation needed for this path — just standard p2phttp between directly connected peers at each hop. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
homedir.Dir() from github.com/mitchellh/go-homedir reads /etc/passwd and ignores the HOME env var. This caused relay nodes (which override HOME=/tmp/opentela-relay in start-relay.sh) to still read/write keys from the real home directory on Lustre, producing inconsistent peer IDs across sessions. os.UserHomeDir() (Go stdlib) respects the HOME environment variable, so the relay's key file is now correctly stored under the overridden home directory. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Two fixes for workers behind relays: 1. GetAllProviders: Include all peers with matching services regardless of connectivity status. Workers behind relays appear as disconnected from the head's perspective but are reachable via relay-hop routing. 2. Stale peer cleanup: Skip peers with registered services when marking as disconnected or removing from the table. They're actively providing workloads and reachable through relays even if we can't ping directly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The hardcoded GossipSub parameters (D=128, Dlo=16, Dhi=256) prevented mesh formation in networks with fewer than 16 peers. With only 3-5 nodes (2 heads + 1 relay + workers), the Dlo=16 threshold was never met, causing GossipSub to never form a proper mesh for the CRDT topic. This meant CRDT data from workers never propagated through the relay to cloud head nodes. Reverts to Go-libp2p default params (D=6, Dlo=4, Dhi=12) which work correctly for any network size. The scalability.crdt_tuned override is preserved for large deployments. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The close-all-peers-and-reconnect pattern during CRDT initialization was disrupting GossipSub mesh topology. After closing all connections, the ping topic (ocf-crdt-net) would reform because it actively publishes every 20s, but the CRDT topic (ocf-crdt) mesh never reformed because it only publishes on data changes. This caused CRDT data from workers to never propagate through the relay to cloud head nodes. Removing the ClosePeer loop preserves the GossipSub mesh that was already established during host creation. The IPFS-lite bootstrap still runs to ensure bitswap connectivity. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…formation GossipSub exchanges topic subscriptions with already-connected peers. When topics were joined before bootstrap (peers not yet connected), the relay never saw cloud nodes as CRDT topic subscribers, creating a unidirectional mesh where CRDT data never propagated from relay to cloud. Fix: bootstrap and wait 3s for connections to establish BEFORE joining GossipSub topics. This ensures peers are connected when topic subscriptions are exchanged, forming a proper bidirectional mesh. Also adds diagnostic logging to PubSubBroadcaster (topic peer count on broadcast, received message source on receive). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
FindRelayFor now prefers peers with role=relay in the node table, since relay nodes bridge network segments and are most likely connected to unreachable workers. Previously it picked any connected peer, which could route through a head node that also can't reach the worker. Also includes GossipSub mesh fix (bootstrap before topic join) and diagnostic logging in PubSubBroadcaster. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Workers now store the relay peer ID in their CRDT entry (RelayPeer field) after successfully making a relay reservation. Head nodes use this to route requests through the correct relay: head → libp2p://<worker's relay>/v1/p2p/<worker>/<path> This solves the multi-relay problem: with Euler, Clariden, etc. relays, the head node needs to know which specific relay can reach each worker. The worker tells it by advertising RelayPeer in the CRDT. FindRelayFor checks the worker's RelayPeer first, falling back to any connected relay-role peer. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds sections covering: - Direct vs relay-hop routing flow diagrams - How the head node detects unreachable workers and routes through relays - The relay_peer CRDT field and automatic relay registration - Zero-config worker bootstrap via domain endpoint Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Expose the node's own Peer record via GetSelf(), provide a test helper SetMyselfForTest() for other packages, and add RegisterRemotePeer() to write a remote peer's entry into the CRDT store using the peer's own ID as the key (with Connected=false and a fresh LastSeen timestamp). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ConnectedBootstraps() now includes relay peers seen within the last 10 minutes (maxBootstrapAge), even when not currently P2P-connected. This ensures newly-registered relay nodes propagated via CRDT are usable as bootstrap addresses before a direct connection is established. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add challengePeer (GET /v1/dnt/challenge) and registerPeer (POST /v1/dnt/register) handlers with nonce-based challenge-response, RSA public key verification, build/identity attestation validation, and CRDT peer registration. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds getSelf (GET /v1/self) returning this node's Peer struct and signData (POST /v1/sign) signing hex-encoded data with the node's libp2p private key, both restricted to loopback clients. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Register /v1/dnt/challenge and /v1/dnt/register under the crdtGroup gated by role=="head", and add /v1/self and /v1/sign to the v1 group. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds a register_relay function and background registration loop to start-relay.sh. After the relay becomes healthy, it fetches a challenge nonce from the bootstrap service, signs it with the relay's libp2p key, and POSTs the full self-info plus challenge_response to register. Uses exponential backoff (5s → 120s cap) on failure and re-registers every 5 minutes on success. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… Clariden All three deploy configs (ocf-1, ocf-2, euler worker) already had no hardcoded relay multiaddrs — only the bootstrap HTTP URL. Updated deploy/clariden/start-relay.sh to match deploy/euler/start-relay.sh: added IP detection, public-addr config patching, health-wait loop, register_relay() challenge-response function, and registration loop with exponential backoff. Adapted for Clariden's binary name (otela) and ports (18092/18905). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…metadata, and invalid ports - Replace c.ClientIP() with c.Request.RemoteAddr in isLoopback() to prevent X-Forwarded-For spoofing on /v1/self and /v1/sign - Add nil check for P2P host in signData() to return 503 when node is not ready - Sanitize peer metadata before CRDT write: clear Owner/ProviderID when no identity attestation, restrict Role to exactly ["relay"], clear Service/Load/Hardware - Validate public_port as a numeric TCP port (1-65535) instead of just non-empty - Add TestGetSelf_SpoofedXForwardedFor to verify spoofed XFF is rejected Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tricted nodes - Add dedicated `wsport` config for a separate WebSocket listener port (43906), avoiding conflict with raw TCP multistream on the main libp2p port - Add `ws_domain` config (e.g., "p2p.opentela.ai") so ConnectedBootstraps() advertises WSS multiaddrs for all peers with a PublicAddress - Add Cloudflare DNS + origin rules for p2p.opentela.ai proxying 443 → 43906 - Nodes behind restrictive firewalls (e.g., JSC/JUWELS) can now connect via wss://p2p.opentela.ai:443, eliminating the need for cloudflared tunnels Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…oadcaster logging - Add nil checks for GetP2PNode in MakeRelayReservations, IsDirectlyConnected, FindRelayFor to prevent panics when host is not ready - Defer nonce consumption in registration until after signature verification so attackers cannot burn legitimate nonces - Add myselfMu RWMutex to protect the global `myself` Peer from concurrent read/write across goroutines (ticker, reannounce, relay reservations) - Remove per-call debug logging from CRDT PubSubBroadcaster (ListPeers on every Broadcast was unnecessary overhead at scale) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.