Skip to content

Plan: link in local OrbStack + GKE/AKS/EKS as first-class fabric clients #116

@stevei101

Description

@stevei101

Link-in plan: connecting local OrbStack k8s + GKE/AKS/EKS to the data fabric

Drafted 2026-05-29 as a design proposal. Not a PR — a plan for review. The MVP shape (this repo) stays the system of record; this issue is about how K8s clusters become first-class fabric clients, and whether a v2 ("new, better DF") is warranted.


What we have today

  • Edge runtime: Rust on Cloudflare Workers — /v1/* API (ingest, query, memory, artifacts, MCP task loop, policy).
  • Storage planes: D1 (silver), R2 (artifacts), KV (hot cache), Vectorize (retrieval).
  • Coordination: Durable Objects (leases, idempotency), Queues (async enrichment).
  • Tenancy: tenant.rs / tenant_security.rs shapes already exist (Fix flake substituter URL, add stevedores-1 key #102 area).

What we don't have yet

  • A K8s-native client surface. Today, any K8s pod that wants to ingest or query must hand-roll an HTTP client, hand-roll auth, hand-roll retries, hand-roll trace propagation. Five separate clusters (orbstack, lornu-aks-hub, gke_…_lornu-gke-prod, steve-1-eks, plus dev clusters) means five hand-rolls.
  • A per-cluster identity story scoped to the fabric tenancy model.
  • A degraded-mode story for when Cloudflare is unreachable from a cluster.
  • A dev / prod split that prevents an OrbStack-running agent from polluting prod tables.

Topology

                ┌──────────────────────────────────────────┐
                │  data-fabric edge (Cloudflare Workers)   │
                │  fabric.lornu.ai  (prod)                 │
                │  fabric-dev.lornu.ai  (dev)              │
                └────────────────┬─────────────────────────┘
                                 │ HTTPS
       ┌─────────────┬───────────┼───────────┬─────────────┐
       │             │           │           │             │
       ▼             ▼           ▼           ▼             ▼
  ┌─────────┐  ┌─────────┐  ┌─────────┐  ┌─────────┐  ┌─────────┐
  │ orbstack│  │ AKS hub │  │  GKE    │  │  EKS    │  │ (future)│
  │ (local) │  │         │  │  prod   │  │ steve-1 │  │  spoke  │
  └────┬────┘  └────┬────┘  └────┬────┘  └────┬────┘  └────┬────┘
       │            │            │            │            │
       ▼            ▼            ▼            ▼            ▼
   pods use    pods use      pods use     pods use     pods use
   lornu-     lornu-       lornu-       lornu-       lornu-
   sdk's      sdk's        sdk's        sdk's        sdk's
   FabricClient (Rust crate)

Single edge API, five (and growing) cluster clients. No fabric pods inside any K8s cluster in v1.

The link-in, end-to-end

1. lornu-sdk::FabricClient (a thin Rust client in bullpen/src/sdk/)

Every bullpen agent already depends on lornu-sdk. Add a FabricClient trait + impl that wraps the fabric edge API:

pub trait FabricClient: Send + Sync {
    async fn ingest(&self, event: IngestEvent) -> Result<IngestAck>;
    async fn query(&self, q: Query) -> Result<QueryResult>;
    async fn memory_index(&self, mem: MemoryItem) -> Result<MemoryAck>;
    async fn memory_retrieve(&self, q: RetrievalQuery) -> Result<Vec<Memory>>;
    async fn mcp_next_task(&self) -> Result<Option<Task>>;
    async fn mcp_respond(&self, task_id: &str, resp: TaskResponse) -> Result<()>;
    async fn artifact_put(&self, key: &str, bytes: Bytes) -> Result<()>;
    async fn artifact_get(&self, key: &str) -> Result<Bytes>;
}

One concrete impl over reqwest with built-in retry (tokio-retry with the same exp-1s/2s/4s + jitter shape the org's harvesters use), structured-trace propagation (tracing + traceparent header), and a swappable LocalBufferingFabricClient for the degraded-mode path (§4).

Agents never write reqwest::Client::new() themselves. They take Arc<dyn FabricClient> in their constructor. Tests use an in-memory fake.

2. Per-cluster identity

Each cluster gets its own fabric API token scoped to its tenant. Fabric's existing tenant.rs is the contract; add:

  • A cluster_id field on tenant_metadata (e.g. orbstack-steven, aks-hub, gke-lornu-prod, eks-steve-1).
  • One token per cluster, minted via a new admin endpoint (or wrangler script).
  • Tokens land in each cluster as a Secret synced from KV via ESO:
# crossplane/<cluster>/fabric/externalsecret.yaml
apiVersion: external-secrets.io/v1
kind: ExternalSecret
metadata:
  name: fabric-credentials
  namespace: lornu-system
spec:
  secretStoreRef: { kind: ClusterSecretStore, name: lornu-keyvault }
  target: { name: fabric-credentials }
  data:
    - secretKey: FABRIC_API_TOKEN
      remoteRef: { key: fabric/<cluster_id>/api-token }
    - secretKey: FABRIC_BASE_URL
      remoteRef: { key: fabric/<cluster_id>/base-url }

Pods pick these env vars up via envFrom: secretRef. FabricClient::from_env() constructs.

The FABRIC_BASE_URL indirection is what makes the dev/prod split (§5) free — point orbstack at fabric-dev.lornu.ai, point the cloud clusters at fabric.lornu.ai.

3. Provenance / observability

Every ingest() call attaches:

  • cluster_id (from env at boot)
  • agent (e.g. relic-courier, boots, librarian)
  • run_id (per-invocation ULID)
  • trace_id (W3C traceparent, propagated up from whatever triggered the agent)

Fabric's silver schema gets a cluster_id column on the relevant tables; the existing tenant_id stays for billing/auth, cluster_id is for forensics ("which cluster wrote this row").

This unblocks a real on-call story: a malformed event in silver can be traced back to the exact pod that wrote it.

4. Degraded-mode (CF unreachable)

Cloudflare outages happen. K8s agents can't block forever. The SDK gets a LocalBufferingFabricClient decorator:

  • On 5xx / network error: queue the event to a local SQLite file (/var/lib/lornu/fabric-buffer.db) per pod.
  • A sidecar (or a tokio task in the agent itself) drains the buffer back to fabric when reachable.
  • Idempotency keys on every event (already a design principle in the fabric MVP) make replay safe.

Query path can't be buffered (it needs an answer now). Document graceful-fail: queries return Err(FabricUnreachable), callers decide whether to block, use stale cache, or proceed without context.

KV cache on the fabric edge already absorbs some of this; the buffering layer covers the case where the edge itself is unreachable.

5. Dev / prod split

Two parallel deployments:

  • fabric-dev.lornu.ai — separate Worker, separate D1 / R2 / Vectorize namespaces, separate KV. orbstack cluster points here. Developers can also point local CLI tools here.
  • fabric.lornu.ai — prod. Cloud clusters (aks-hub, gke-prod, eks-steve-1) point here.

Tenant per (env × cluster) → 8 tokens total at the current cluster count. ESO sync makes this manageable.

6. MCP task loop

The fabric already has /mcp/task/next and /mcp/response for an agent-pull task model. To use this from K8s:

  • Each agent runs FabricClient::mcp_next_task() in a loop with backoff.
  • Tasks come back with a target_cluster field; an agent only claims tasks scoped to its own cluster (or to any).
  • Long-running tasks heartbeat via mcp_response(..., status: InProgress) to refresh the Durable Object lease.

This is the path that makes the fabric not just a passive data store but an active task router — a single dispatcher across all five clusters.

7. Bootstrapping a new cluster (runbook)

1. Provision tenant + cluster_id in fabric (admin script).
2. Mint API token; write to Azure Key Vault at fabric/<cluster_id>/api-token.
3. Add fabric.lornu.ai (or fabric-dev) URL to KV: fabric/<cluster_id>/base-url.
4. Apply crossplane/<cluster>/fabric/externalsecret.yaml.
5. Verify: kubectl exec into any lornu-sdk pod → curl $FABRIC_BASE_URL/health.
6. Tag the cluster_id in the fabric's tenant registry for billing/forensics.

Six steps, scriptable. New cluster onboards in <10 minutes.

8. Local OrbStack specifics

The local cluster doesn't have Azure Key Vault access. Two options:

  • a) Local ESO with a kubernetes-backed SecretStore pointing at a sealed secret committed to a developer-only repo (not great).
  • b) A lornu-cli subcommand that writes the dev fabric credentials directly into the local cluster as a regular Secret (lornu cli fabric link --env dev). Simple, no KV dependency, per-developer scope.

Recommend (b) for local. Cloud clusters use ESO.


The bigger question — extend MVP or build new?

The user noted this is "just the MVP" and "we could make a new, better DF." Three honest options:

Option A — extend the Cloudflare MVP (recommended)

  • Add the cluster_id field + per-cluster token machinery to the existing data-fabric repo.
  • Build FabricClient in bullpen/src/sdk/.
  • Ship in 2-3 weeks. Reuses 100% of existing schema / endpoints / Durable Object logic.
  • Net cost: zero new infra.

Pros: Fast. Honors all the WS1–WS5 work that's already shipped. CF edge gives consistent low-latency egress from any cloud.
Cons: All five clusters depend on the public internet to function. CF outage = all agents blocked beyond their local buffer.

Option B — K8s-native fabric inside the AKS hub

  • New repo (lornu-ai/data-fabric-k8s). Rust pod (axum) + Postgres + S3-compat + Qdrant in-cluster.
  • Clusters in different clouds reach via the AKS hub's public ingress (same egress dependency as A, just to a different endpoint we run).
  • Could federate (one fabric pod per region) but that re-creates the multi-master problem the MVP avoided by being edge-native.

Pros: Fewer external SaaS deps. Could run fully airgapped in principle.
Cons: 10× the ops surface (Postgres backups, Qdrant tuning, pod sizing, etc.). Doesn't actually solve the "CF is unreachable" problem — it just moves the dep from CF to AKS-hub's ingress, which has its own uptime story.

Option C — hybrid (CF system-of-record + per-cluster read cache)

  • Keep CF MVP as the canonical writer.
  • Each K8s cluster runs a small read-cache pod (just Vectorize + KV mirror, no D1) for hot retrieval paths.
  • Cache invalidation via fabric → cluster webhooks (or pulled on schedule).

Pros: Survives CF outage for queries (cluster pods serve from local cache); writes still go to CF when reachable, buffered when not.
Cons: Cache invalidation is the second hardest problem in CS. Worth doing only after CF outages are an observed pain.

Recommendation: Option A now. Get the link-in working end-to-end with all five clusters as clients of the MVP. Revisit Option C if buffered-write / queries-from-stale-cache becomes a real operational concern (probably 3-6 months out, based on actual CF reliability and team pain).

Option B is the wrong shape for now — it ports the edge MVP back to a cluster without any of the edge benefits.


Phased delivery (under Option A)

Phase 1 — SDK + identity (1 week)

  • lornu-sdk::FabricClient trait + reqwest impl + tests
  • ESO manifest template + per-cluster token-mint script
  • Bootstrap one cluster (orbstack-dev) end-to-end; smoke-test ingest + query from a hello-world agent

Phase 2 — provenance + observability (1 week)

  • cluster_id schema migration on silver tables
  • Trace propagation + cluster_id / agent / run_id enforcement at the SDK boundary
  • Operator dashboard pane: "events by cluster, last 24h"

Phase 3 — degraded-mode (1 week)

  • LocalBufferingFabricClient decorator + SQLite buffer
  • Drain task + replay tests
  • Runbook for "fabric unreachable" page

Phase 4 — MCP task loop adoption (gate on first real consumer)

  • Wire FabricClient::mcp_next_task into one bullpen agent (likely boots — it's the simplest)
  • Heartbeat lease pattern proven in prod

Phase 5 — production cluster rollouts (gated per cluster)

  • Bootstrap AKS hub → 7-day soak
  • Bootstrap GKE prod → 7-day soak
  • Bootstrap EKS steve-1 → 7-day soak

Total: ~6 weeks calendar, ~2.5 weeks engineering, mostly serial. Phases 4 and 5 can happen in parallel.


Open questions for the team

  1. Token rotation cadence. Per-cluster tokens — quarterly? On-demand only? Tied to the ciso-agent WS1C: Extract zero-copy connector and federated query semantics #15 token-lifecycle work.
  2. What's the first real consumer? Phase 4 wants a real bullpen agent on the MCP loop. Best candidate: boots (platform-health) or librarian (cross-repo audit). Either way, the agent's existing Cargo.toml needs a lornu-sdk bump.
  3. Multi-tenancy beyond cluster boundary. Today's tenant model is per-project; cluster is orthogonal. Does the org want one tenant per cluster, one tenant per agent-class, or one tenant per project-team? Affects token scoping and billing.
  4. Local OrbStack creds. §8 — lornu cli fabric link --env dev writes a Secret; should this Secret be tied to the developer's GitHub identity (via gh OIDC) or just a static token? OIDC is right but takes another week.
  5. Read-cache (Option C) trigger. What CF outage signal would make us escalate to Option C? Suggest: any outage > 30 min OR three outages in a month above 10 min each.

Cross-references

  • stevedores-org/data-fabric — this repo (MVP)
  • lornu-ai/bullpen — agent roster that will consume the SDK
  • lornu-ai/relic-swarm — the courier already has a fabric-shaped use case (per its data-flow doc §8 aivcs feedback loop)
  • lornu-ai/ciso-agent#15 — token lifecycle automation; per-cluster fabric tokens land in this scope
  • lornu-ai/ciso-agent#17 — multi-session arbitration via aivcs; fabric's MCP loop is a natural place to record session intent
  • lornu-ai/ciso-agent#21 — CISO agent's own security posture; the CISO agent is itself a fabric client (read-everything access via tenant + token)

🤖 Filed as a link-in plan / design proposal. Comments + amendments welcome before any code lands.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions