Plan: link in local OrbStack + GKE/AKS/EKS as first-class fabric clients

# Link-in plan: connecting local OrbStack k8s + GKE/AKS/EKS to the data fabric

> Drafted 2026-05-29 as a design proposal. Not a PR — a plan for review. The MVP shape (this repo) stays the system of record; this issue is about how K8s clusters become first-class fabric clients, and whether a v2 ("new, better DF") is warranted.

---

## What we have today

- **Edge runtime:** Rust on Cloudflare Workers — `/v1/*` API (ingest, query, memory, artifacts, MCP task loop, policy).
- **Storage planes:** D1 (silver), R2 (artifacts), KV (hot cache), Vectorize (retrieval).
- **Coordination:** Durable Objects (leases, idempotency), Queues (async enrichment).
- **Tenancy:** `tenant.rs` / `tenant_security.rs` shapes already exist (#102 area).

## What we don't have yet

- A **K8s-native client surface**. Today, any K8s pod that wants to ingest or query must hand-roll an HTTP client, hand-roll auth, hand-roll retries, hand-roll trace propagation. Five separate clusters (`orbstack`, `lornu-aks-hub`, `gke_…_lornu-gke-prod`, `steve-1-eks`, plus dev clusters) means five hand-rolls.
- A **per-cluster identity story** scoped to the fabric tenancy model.
- A **degraded-mode story** for when Cloudflare is unreachable from a cluster.
- A **dev / prod split** that prevents an OrbStack-running agent from polluting prod tables.

## Topology

```
                ┌──────────────────────────────────────────┐
                │  data-fabric edge (Cloudflare Workers)   │
                │  fabric.lornu.ai  (prod)                 │
                │  fabric-dev.lornu.ai  (dev)              │
                └────────────────┬─────────────────────────┘
                                 │ HTTPS
       ┌─────────────┬───────────┼───────────┬─────────────┐
       │             │           │           │             │
       ▼             ▼           ▼           ▼             ▼
  ┌─────────┐  ┌─────────┐  ┌─────────┐  ┌─────────┐  ┌─────────┐
  │ orbstack│  │ AKS hub │  │  GKE    │  │  EKS    │  │ (future)│
  │ (local) │  │         │  │  prod   │  │ steve-1 │  │  spoke  │
  └────┬────┘  └────┬────┘  └────┬────┘  └────┬────┘  └────┬────┘
       │            │            │            │            │
       ▼            ▼            ▼            ▼            ▼
   pods use    pods use      pods use     pods use     pods use
   lornu-     lornu-       lornu-       lornu-       lornu-
   sdk's      sdk's        sdk's        sdk's        sdk's
   FabricClient (Rust crate)
```

Single edge API, five (and growing) cluster clients. No fabric pods inside any K8s cluster in v1.

## The link-in, end-to-end

### 1. `lornu-sdk::FabricClient` (a thin Rust client in `bullpen/src/sdk/`)

Every bullpen agent already depends on `lornu-sdk`. Add a `FabricClient` trait + impl that wraps the fabric edge API:

```rust
pub trait FabricClient: Send + Sync {
    async fn ingest(&self, event: IngestEvent) -> Result<IngestAck>;
    async fn query(&self, q: Query) -> Result<QueryResult>;
    async fn memory_index(&self, mem: MemoryItem) -> Result<MemoryAck>;
    async fn memory_retrieve(&self, q: RetrievalQuery) -> Result<Vec<Memory>>;
    async fn mcp_next_task(&self) -> Result<Option<Task>>;
    async fn mcp_respond(&self, task_id: &str, resp: TaskResponse) -> Result<()>;
    async fn artifact_put(&self, key: &str, bytes: Bytes) -> Result<()>;
    async fn artifact_get(&self, key: &str) -> Result<Bytes>;
}
```

One concrete impl over `reqwest` with built-in retry (`tokio-retry` with the same exp-1s/2s/4s + jitter shape the org's harvesters use), structured-trace propagation (`tracing` + `traceparent` header), and a swappable `LocalBufferingFabricClient` for the degraded-mode path (§4).

Agents never write `reqwest::Client::new()` themselves. They take `Arc<dyn FabricClient>` in their constructor. Tests use an in-memory fake.

### 2. Per-cluster identity

Each cluster gets its own **fabric API token** scoped to its tenant. Fabric's existing `tenant.rs` is the contract; add:

- A `cluster_id` field on `tenant_metadata` (e.g. `orbstack-steven`, `aks-hub`, `gke-lornu-prod`, `eks-steve-1`).
- One token per cluster, minted via a new admin endpoint (or `wrangler` script).
- Tokens land in each cluster as a Secret synced from KV via ESO:

```yaml
# crossplane/<cluster>/fabric/externalsecret.yaml
apiVersion: external-secrets.io/v1
kind: ExternalSecret
metadata:
  name: fabric-credentials
  namespace: lornu-system
spec:
  secretStoreRef: { kind: ClusterSecretStore, name: lornu-keyvault }
  target: { name: fabric-credentials }
  data:
    - secretKey: FABRIC_API_TOKEN
      remoteRef: { key: fabric/<cluster_id>/api-token }
    - secretKey: FABRIC_BASE_URL
      remoteRef: { key: fabric/<cluster_id>/base-url }
```

Pods pick these env vars up via `envFrom: secretRef`. `FabricClient::from_env()` constructs.

The `FABRIC_BASE_URL` indirection is what makes the dev/prod split (§5) free — point `orbstack` at `fabric-dev.lornu.ai`, point the cloud clusters at `fabric.lornu.ai`.

### 3. Provenance / observability

Every `ingest()` call attaches:

- `cluster_id` (from env at boot)
- `agent` (e.g. `relic-courier`, `boots`, `librarian`)
- `run_id` (per-invocation ULID)
- `trace_id` (W3C traceparent, propagated up from whatever triggered the agent)

Fabric's silver schema gets a `cluster_id` column on the relevant tables; the existing `tenant_id` stays for billing/auth, `cluster_id` is for forensics ("which cluster wrote this row").

This unblocks a real on-call story: a malformed event in silver can be traced back to the exact pod that wrote it.

### 4. Degraded-mode (CF unreachable)

Cloudflare outages happen. K8s agents can't block forever. The SDK gets a `LocalBufferingFabricClient` decorator:

- On 5xx / network error: queue the event to a local SQLite file (`/var/lib/lornu/fabric-buffer.db`) per pod.
- A sidecar (or a tokio task in the agent itself) drains the buffer back to fabric when reachable.
- Idempotency keys on every event (already a design principle in the fabric MVP) make replay safe.

Query path can't be buffered (it needs an answer now). Document graceful-fail: queries return `Err(FabricUnreachable)`, callers decide whether to block, use stale cache, or proceed without context.

KV cache on the fabric edge already absorbs some of this; the buffering layer covers the case where the edge itself is unreachable.

### 5. Dev / prod split

Two parallel deployments:

- **fabric-dev.lornu.ai** — separate Worker, separate D1 / R2 / Vectorize namespaces, separate KV. `orbstack` cluster points here. Developers can also point local CLI tools here.
- **fabric.lornu.ai** — prod. Cloud clusters (`aks-hub`, `gke-prod`, `eks-steve-1`) point here.

Tenant per (env × cluster) → 8 tokens total at the current cluster count. ESO sync makes this manageable.

### 6. MCP task loop

The fabric already has `/mcp/task/next` and `/mcp/response` for an agent-pull task model. To use this from K8s:

- Each agent runs `FabricClient::mcp_next_task()` in a loop with backoff.
- Tasks come back with a `target_cluster` field; an agent only claims tasks scoped to its own cluster (or to `any`).
- Long-running tasks heartbeat via `mcp_response(..., status: InProgress)` to refresh the Durable Object lease.

This is the path that makes the fabric not just a passive data store but an active task router — a single dispatcher across all five clusters.

### 7. Bootstrapping a new cluster (runbook)

```
1. Provision tenant + cluster_id in fabric (admin script).
2. Mint API token; write to Azure Key Vault at fabric/<cluster_id>/api-token.
3. Add fabric.lornu.ai (or fabric-dev) URL to KV: fabric/<cluster_id>/base-url.
4. Apply crossplane/<cluster>/fabric/externalsecret.yaml.
5. Verify: kubectl exec into any lornu-sdk pod → curl $FABRIC_BASE_URL/health.
6. Tag the cluster_id in the fabric's tenant registry for billing/forensics.
```

Six steps, scriptable. New cluster onboards in <10 minutes.

### 8. Local OrbStack specifics

The local cluster doesn't have Azure Key Vault access. Two options:

- **a)** Local ESO with a `kubernetes`-backed `SecretStore` pointing at a sealed secret committed to a developer-only repo (not great).
- **b)** A `lornu-cli` subcommand that writes the dev fabric credentials directly into the local cluster as a regular Secret (`lornu cli fabric link --env dev`). Simple, no KV dependency, per-developer scope.

Recommend **(b)** for local. Cloud clusters use ESO.

---

## The bigger question — extend MVP or build new?

The user noted this is "just the MVP" and "we could make a new, better DF." Three honest options:

### Option A — extend the Cloudflare MVP (recommended)

- Add the `cluster_id` field + per-cluster token machinery to the existing data-fabric repo.
- Build `FabricClient` in `bullpen/src/sdk/`.
- Ship in 2-3 weeks. Reuses 100% of existing schema / endpoints / Durable Object logic.
- Net cost: zero new infra.

**Pros:** Fast. Honors all the WS1–WS5 work that's already shipped. CF edge gives consistent low-latency egress from any cloud.
**Cons:** All five clusters depend on the public internet to function. CF outage = all agents blocked beyond their local buffer.

### Option B — K8s-native fabric inside the AKS hub

- New repo (`lornu-ai/data-fabric-k8s`). Rust pod (axum) + Postgres + S3-compat + Qdrant in-cluster.
- Clusters in different clouds reach via the AKS hub's public ingress (same egress dependency as A, just to a different endpoint we run).
- Could federate (one fabric pod per region) but that re-creates the multi-master problem the MVP avoided by being edge-native.

**Pros:** Fewer external SaaS deps. Could run fully airgapped in principle.
**Cons:** 10× the ops surface (Postgres backups, Qdrant tuning, pod sizing, etc.). Doesn't actually solve the "CF is unreachable" problem — it just moves the dep from CF to AKS-hub's ingress, which has its own uptime story.

### Option C — hybrid (CF system-of-record + per-cluster read cache)

- Keep CF MVP as the canonical writer.
- Each K8s cluster runs a small read-cache pod (just Vectorize + KV mirror, no D1) for hot retrieval paths.
- Cache invalidation via fabric → cluster webhooks (or pulled on schedule).

**Pros:** Survives CF outage for queries (cluster pods serve from local cache); writes still go to CF when reachable, buffered when not.
**Cons:** Cache invalidation is the second hardest problem in CS. Worth doing only after CF outages are an observed pain.

**Recommendation:** Option A now. Get the link-in working end-to-end with all five clusters as clients of the MVP. Revisit Option C if buffered-write / queries-from-stale-cache becomes a real operational concern (probably 3-6 months out, based on actual CF reliability and team pain).

Option B is the wrong shape for now — it ports the edge MVP back to a cluster without any of the edge benefits.

---

## Phased delivery (under Option A)

**Phase 1 — SDK + identity** (1 week)
- `lornu-sdk::FabricClient` trait + reqwest impl + tests
- ESO manifest template + per-cluster token-mint script
- Bootstrap one cluster (orbstack-dev) end-to-end; smoke-test ingest + query from a hello-world agent

**Phase 2 — provenance + observability** (1 week)
- `cluster_id` schema migration on silver tables
- Trace propagation + `cluster_id` / `agent` / `run_id` enforcement at the SDK boundary
- Operator dashboard pane: "events by cluster, last 24h"

**Phase 3 — degraded-mode** (1 week)
- `LocalBufferingFabricClient` decorator + SQLite buffer
- Drain task + replay tests
- Runbook for "fabric unreachable" page

**Phase 4 — MCP task loop adoption** (gate on first real consumer)
- Wire `FabricClient::mcp_next_task` into one bullpen agent (likely `boots` — it's the simplest)
- Heartbeat lease pattern proven in prod

**Phase 5 — production cluster rollouts** (gated per cluster)
- Bootstrap AKS hub → 7-day soak
- Bootstrap GKE prod → 7-day soak
- Bootstrap EKS steve-1 → 7-day soak

Total: ~6 weeks calendar, ~2.5 weeks engineering, mostly serial. Phases 4 and 5 can happen in parallel.

---

## Open questions for the team

1. **Token rotation cadence.** Per-cluster tokens — quarterly? On-demand only? Tied to the ciso-agent #15 token-lifecycle work.
2. **What's the first real consumer?** Phase 4 wants a real bullpen agent on the MCP loop. Best candidate: `boots` (platform-health) or `librarian` (cross-repo audit). Either way, the agent's existing Cargo.toml needs a `lornu-sdk` bump.
3. **Multi-tenancy beyond cluster boundary.** Today's `tenant` model is per-project; cluster is orthogonal. Does the org want one tenant per cluster, one tenant per agent-class, or one tenant per project-team? Affects token scoping and billing.
4. **Local OrbStack creds.** §8 — `lornu cli fabric link --env dev` writes a Secret; should this Secret be tied to the developer's GitHub identity (via gh OIDC) or just a static token? OIDC is right but takes another week.
5. **Read-cache (Option C) trigger.** What CF outage signal would make us escalate to Option C? Suggest: any outage > 30 min OR three outages in a month above 10 min each.

---

## Cross-references

- `stevedores-org/data-fabric` — this repo (MVP)
- `lornu-ai/bullpen` — agent roster that will consume the SDK
- `lornu-ai/relic-swarm` — the courier already has a fabric-shaped use case (per its data-flow doc §8 aivcs feedback loop)
- `lornu-ai/ciso-agent#15` — token lifecycle automation; per-cluster fabric tokens land in this scope
- `lornu-ai/ciso-agent#17` — multi-session arbitration via aivcs; fabric's MCP loop is a natural place to record session intent
- `lornu-ai/ciso-agent#21` — CISO agent's own security posture; the CISO agent is itself a fabric client (read-everything access via tenant + token)

🤖 Filed as a link-in plan / design proposal. Comments + amendments welcome before any code lands.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Plan: link in local OrbStack + GKE/AKS/EKS as first-class fabric clients #116

Link-in plan: connecting local OrbStack k8s + GKE/AKS/EKS to the data fabric

What we have today

What we don't have yet

Topology

The link-in, end-to-end

1. `lornu-sdk::FabricClient` (a thin Rust client in `bullpen/src/sdk/`)

2. Per-cluster identity

3. Provenance / observability

4. Degraded-mode (CF unreachable)

5. Dev / prod split

6. MCP task loop

7. Bootstrapping a new cluster (runbook)

8. Local OrbStack specifics

The bigger question — extend MVP or build new?

Option A — extend the Cloudflare MVP (recommended)

Option B — K8s-native fabric inside the AKS hub

Option C — hybrid (CF system-of-record + per-cluster read cache)

Phased delivery (under Option A)

Open questions for the team

Cross-references

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Plan: link in local OrbStack + GKE/AKS/EKS as first-class fabric clients #116

Description

Link-in plan: connecting local OrbStack k8s + GKE/AKS/EKS to the data fabric

What we have today

What we don't have yet

Topology

The link-in, end-to-end

1. lornu-sdk::FabricClient (a thin Rust client in bullpen/src/sdk/)

2. Per-cluster identity

3. Provenance / observability

4. Degraded-mode (CF unreachable)

5. Dev / prod split

6. MCP task loop

7. Bootstrapping a new cluster (runbook)

8. Local OrbStack specifics

The bigger question — extend MVP or build new?

Option A — extend the Cloudflare MVP (recommended)

Option B — K8s-native fabric inside the AKS hub

Option C — hybrid (CF system-of-record + per-cluster read cache)

Phased delivery (under Option A)

Open questions for the team

Cross-references

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

1. `lornu-sdk::FabricClient` (a thin Rust client in `bullpen/src/sdk/`)