You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Link-in plan: connecting local OrbStack k8s + GKE/AKS/EKS to the data fabric
Drafted 2026-05-29 as a design proposal. Not a PR — a plan for review. The MVP shape (this repo) stays the system of record; this issue is about how K8s clusters become first-class fabric clients, and whether a v2 ("new, better DF") is warranted.
What we have today
Edge runtime: Rust on Cloudflare Workers — /v1/* API (ingest, query, memory, artifacts, MCP task loop, policy).
A K8s-native client surface. Today, any K8s pod that wants to ingest or query must hand-roll an HTTP client, hand-roll auth, hand-roll retries, hand-roll trace propagation. Five separate clusters (orbstack, lornu-aks-hub, gke_…_lornu-gke-prod, steve-1-eks, plus dev clusters) means five hand-rolls.
A per-cluster identity story scoped to the fabric tenancy model.
A degraded-mode story for when Cloudflare is unreachable from a cluster.
A dev / prod split that prevents an OrbStack-running agent from polluting prod tables.
One concrete impl over reqwest with built-in retry (tokio-retry with the same exp-1s/2s/4s + jitter shape the org's harvesters use), structured-trace propagation (tracing + traceparent header), and a swappable LocalBufferingFabricClient for the degraded-mode path (§4).
Agents never write reqwest::Client::new() themselves. They take Arc<dyn FabricClient> in their constructor. Tests use an in-memory fake.
2. Per-cluster identity
Each cluster gets its own fabric API token scoped to its tenant. Fabric's existing tenant.rs is the contract; add:
A cluster_id field on tenant_metadata (e.g. orbstack-steven, aks-hub, gke-lornu-prod, eks-steve-1).
One token per cluster, minted via a new admin endpoint (or wrangler script).
Tokens land in each cluster as a Secret synced from KV via ESO:
Pods pick these env vars up via envFrom: secretRef. FabricClient::from_env() constructs.
The FABRIC_BASE_URL indirection is what makes the dev/prod split (§5) free — point orbstack at fabric-dev.lornu.ai, point the cloud clusters at fabric.lornu.ai.
3. Provenance / observability
Every ingest() call attaches:
cluster_id (from env at boot)
agent (e.g. relic-courier, boots, librarian)
run_id (per-invocation ULID)
trace_id (W3C traceparent, propagated up from whatever triggered the agent)
Fabric's silver schema gets a cluster_id column on the relevant tables; the existing tenant_id stays for billing/auth, cluster_id is for forensics ("which cluster wrote this row").
This unblocks a real on-call story: a malformed event in silver can be traced back to the exact pod that wrote it.
4. Degraded-mode (CF unreachable)
Cloudflare outages happen. K8s agents can't block forever. The SDK gets a LocalBufferingFabricClient decorator:
On 5xx / network error: queue the event to a local SQLite file (/var/lib/lornu/fabric-buffer.db) per pod.
A sidecar (or a tokio task in the agent itself) drains the buffer back to fabric when reachable.
Idempotency keys on every event (already a design principle in the fabric MVP) make replay safe.
Query path can't be buffered (it needs an answer now). Document graceful-fail: queries return Err(FabricUnreachable), callers decide whether to block, use stale cache, or proceed without context.
KV cache on the fabric edge already absorbs some of this; the buffering layer covers the case where the edge itself is unreachable.
5. Dev / prod split
Two parallel deployments:
fabric-dev.lornu.ai — separate Worker, separate D1 / R2 / Vectorize namespaces, separate KV. orbstack cluster points here. Developers can also point local CLI tools here.
fabric.lornu.ai — prod. Cloud clusters (aks-hub, gke-prod, eks-steve-1) point here.
Tenant per (env × cluster) → 8 tokens total at the current cluster count. ESO sync makes this manageable.
6. MCP task loop
The fabric already has /mcp/task/next and /mcp/response for an agent-pull task model. To use this from K8s:
Each agent runs FabricClient::mcp_next_task() in a loop with backoff.
Tasks come back with a target_cluster field; an agent only claims tasks scoped to its own cluster (or to any).
Long-running tasks heartbeat via mcp_response(..., status: InProgress) to refresh the Durable Object lease.
This is the path that makes the fabric not just a passive data store but an active task router — a single dispatcher across all five clusters.
7. Bootstrapping a new cluster (runbook)
1. Provision tenant + cluster_id in fabric (admin script).
2. Mint API token; write to Azure Key Vault at fabric/<cluster_id>/api-token.
3. Add fabric.lornu.ai (or fabric-dev) URL to KV: fabric/<cluster_id>/base-url.
4. Apply crossplane/<cluster>/fabric/externalsecret.yaml.
5. Verify: kubectl exec into any lornu-sdk pod → curl $FABRIC_BASE_URL/health.
6. Tag the cluster_id in the fabric's tenant registry for billing/forensics.
Six steps, scriptable. New cluster onboards in <10 minutes.
8. Local OrbStack specifics
The local cluster doesn't have Azure Key Vault access. Two options:
a) Local ESO with a kubernetes-backed SecretStore pointing at a sealed secret committed to a developer-only repo (not great).
b) A lornu-cli subcommand that writes the dev fabric credentials directly into the local cluster as a regular Secret (lornu cli fabric link --env dev). Simple, no KV dependency, per-developer scope.
Recommend (b) for local. Cloud clusters use ESO.
The bigger question — extend MVP or build new?
The user noted this is "just the MVP" and "we could make a new, better DF." Three honest options:
Option A — extend the Cloudflare MVP (recommended)
Add the cluster_id field + per-cluster token machinery to the existing data-fabric repo.
Build FabricClient in bullpen/src/sdk/.
Ship in 2-3 weeks. Reuses 100% of existing schema / endpoints / Durable Object logic.
Net cost: zero new infra.
Pros: Fast. Honors all the WS1–WS5 work that's already shipped. CF edge gives consistent low-latency egress from any cloud. Cons: All five clusters depend on the public internet to function. CF outage = all agents blocked beyond their local buffer.
Option B — K8s-native fabric inside the AKS hub
New repo (lornu-ai/data-fabric-k8s). Rust pod (axum) + Postgres + S3-compat + Qdrant in-cluster.
Clusters in different clouds reach via the AKS hub's public ingress (same egress dependency as A, just to a different endpoint we run).
Could federate (one fabric pod per region) but that re-creates the multi-master problem the MVP avoided by being edge-native.
Pros: Fewer external SaaS deps. Could run fully airgapped in principle. Cons: 10× the ops surface (Postgres backups, Qdrant tuning, pod sizing, etc.). Doesn't actually solve the "CF is unreachable" problem — it just moves the dep from CF to AKS-hub's ingress, which has its own uptime story.
Option C — hybrid (CF system-of-record + per-cluster read cache)
Keep CF MVP as the canonical writer.
Each K8s cluster runs a small read-cache pod (just Vectorize + KV mirror, no D1) for hot retrieval paths.
Cache invalidation via fabric → cluster webhooks (or pulled on schedule).
Pros: Survives CF outage for queries (cluster pods serve from local cache); writes still go to CF when reachable, buffered when not. Cons: Cache invalidation is the second hardest problem in CS. Worth doing only after CF outages are an observed pain.
Recommendation: Option A now. Get the link-in working end-to-end with all five clusters as clients of the MVP. Revisit Option C if buffered-write / queries-from-stale-cache becomes a real operational concern (probably 3-6 months out, based on actual CF reliability and team pain).
Option B is the wrong shape for now — it ports the edge MVP back to a cluster without any of the edge benefits.
What's the first real consumer? Phase 4 wants a real bullpen agent on the MCP loop. Best candidate: boots (platform-health) or librarian (cross-repo audit). Either way, the agent's existing Cargo.toml needs a lornu-sdk bump.
Multi-tenancy beyond cluster boundary. Today's tenant model is per-project; cluster is orthogonal. Does the org want one tenant per cluster, one tenant per agent-class, or one tenant per project-team? Affects token scoping and billing.
Local OrbStack creds. §8 — lornu cli fabric link --env dev writes a Secret; should this Secret be tied to the developer's GitHub identity (via gh OIDC) or just a static token? OIDC is right but takes another week.
Read-cache (Option C) trigger. What CF outage signal would make us escalate to Option C? Suggest: any outage > 30 min OR three outages in a month above 10 min each.
Cross-references
stevedores-org/data-fabric — this repo (MVP)
lornu-ai/bullpen — agent roster that will consume the SDK
lornu-ai/relic-swarm — the courier already has a fabric-shaped use case (per its data-flow doc §8 aivcs feedback loop)
lornu-ai/ciso-agent#15 — token lifecycle automation; per-cluster fabric tokens land in this scope
lornu-ai/ciso-agent#17 — multi-session arbitration via aivcs; fabric's MCP loop is a natural place to record session intent
lornu-ai/ciso-agent#21 — CISO agent's own security posture; the CISO agent is itself a fabric client (read-everything access via tenant + token)
🤖 Filed as a link-in plan / design proposal. Comments + amendments welcome before any code lands.
Link-in plan: connecting local OrbStack k8s + GKE/AKS/EKS to the data fabric
What we have today
/v1/*API (ingest, query, memory, artifacts, MCP task loop, policy).tenant.rs/tenant_security.rsshapes already exist (Fix flake substituter URL, add stevedores-1 key #102 area).What we don't have yet
orbstack,lornu-aks-hub,gke_…_lornu-gke-prod,steve-1-eks, plus dev clusters) means five hand-rolls.Topology
Single edge API, five (and growing) cluster clients. No fabric pods inside any K8s cluster in v1.
The link-in, end-to-end
1.
lornu-sdk::FabricClient(a thin Rust client inbullpen/src/sdk/)Every bullpen agent already depends on
lornu-sdk. Add aFabricClienttrait + impl that wraps the fabric edge API:One concrete impl over
reqwestwith built-in retry (tokio-retrywith the same exp-1s/2s/4s + jitter shape the org's harvesters use), structured-trace propagation (tracing+traceparentheader), and a swappableLocalBufferingFabricClientfor the degraded-mode path (§4).Agents never write
reqwest::Client::new()themselves. They takeArc<dyn FabricClient>in their constructor. Tests use an in-memory fake.2. Per-cluster identity
Each cluster gets its own fabric API token scoped to its tenant. Fabric's existing
tenant.rsis the contract; add:cluster_idfield ontenant_metadata(e.g.orbstack-steven,aks-hub,gke-lornu-prod,eks-steve-1).wranglerscript).Pods pick these env vars up via
envFrom: secretRef.FabricClient::from_env()constructs.The
FABRIC_BASE_URLindirection is what makes the dev/prod split (§5) free — pointorbstackatfabric-dev.lornu.ai, point the cloud clusters atfabric.lornu.ai.3. Provenance / observability
Every
ingest()call attaches:cluster_id(from env at boot)agent(e.g.relic-courier,boots,librarian)run_id(per-invocation ULID)trace_id(W3C traceparent, propagated up from whatever triggered the agent)Fabric's silver schema gets a
cluster_idcolumn on the relevant tables; the existingtenant_idstays for billing/auth,cluster_idis for forensics ("which cluster wrote this row").This unblocks a real on-call story: a malformed event in silver can be traced back to the exact pod that wrote it.
4. Degraded-mode (CF unreachable)
Cloudflare outages happen. K8s agents can't block forever. The SDK gets a
LocalBufferingFabricClientdecorator:/var/lib/lornu/fabric-buffer.db) per pod.Query path can't be buffered (it needs an answer now). Document graceful-fail: queries return
Err(FabricUnreachable), callers decide whether to block, use stale cache, or proceed without context.KV cache on the fabric edge already absorbs some of this; the buffering layer covers the case where the edge itself is unreachable.
5. Dev / prod split
Two parallel deployments:
orbstackcluster points here. Developers can also point local CLI tools here.aks-hub,gke-prod,eks-steve-1) point here.Tenant per (env × cluster) → 8 tokens total at the current cluster count. ESO sync makes this manageable.
6. MCP task loop
The fabric already has
/mcp/task/nextand/mcp/responsefor an agent-pull task model. To use this from K8s:FabricClient::mcp_next_task()in a loop with backoff.target_clusterfield; an agent only claims tasks scoped to its own cluster (or toany).mcp_response(..., status: InProgress)to refresh the Durable Object lease.This is the path that makes the fabric not just a passive data store but an active task router — a single dispatcher across all five clusters.
7. Bootstrapping a new cluster (runbook)
Six steps, scriptable. New cluster onboards in <10 minutes.
8. Local OrbStack specifics
The local cluster doesn't have Azure Key Vault access. Two options:
kubernetes-backedSecretStorepointing at a sealed secret committed to a developer-only repo (not great).lornu-clisubcommand that writes the dev fabric credentials directly into the local cluster as a regular Secret (lornu cli fabric link --env dev). Simple, no KV dependency, per-developer scope.Recommend (b) for local. Cloud clusters use ESO.
The bigger question — extend MVP or build new?
The user noted this is "just the MVP" and "we could make a new, better DF." Three honest options:
Option A — extend the Cloudflare MVP (recommended)
cluster_idfield + per-cluster token machinery to the existing data-fabric repo.FabricClientinbullpen/src/sdk/.Pros: Fast. Honors all the WS1–WS5 work that's already shipped. CF edge gives consistent low-latency egress from any cloud.
Cons: All five clusters depend on the public internet to function. CF outage = all agents blocked beyond their local buffer.
Option B — K8s-native fabric inside the AKS hub
lornu-ai/data-fabric-k8s). Rust pod (axum) + Postgres + S3-compat + Qdrant in-cluster.Pros: Fewer external SaaS deps. Could run fully airgapped in principle.
Cons: 10× the ops surface (Postgres backups, Qdrant tuning, pod sizing, etc.). Doesn't actually solve the "CF is unreachable" problem — it just moves the dep from CF to AKS-hub's ingress, which has its own uptime story.
Option C — hybrid (CF system-of-record + per-cluster read cache)
Pros: Survives CF outage for queries (cluster pods serve from local cache); writes still go to CF when reachable, buffered when not.
Cons: Cache invalidation is the second hardest problem in CS. Worth doing only after CF outages are an observed pain.
Recommendation: Option A now. Get the link-in working end-to-end with all five clusters as clients of the MVP. Revisit Option C if buffered-write / queries-from-stale-cache becomes a real operational concern (probably 3-6 months out, based on actual CF reliability and team pain).
Option B is the wrong shape for now — it ports the edge MVP back to a cluster without any of the edge benefits.
Phased delivery (under Option A)
Phase 1 — SDK + identity (1 week)
lornu-sdk::FabricClienttrait + reqwest impl + testsPhase 2 — provenance + observability (1 week)
cluster_idschema migration on silver tablescluster_id/agent/run_idenforcement at the SDK boundaryPhase 3 — degraded-mode (1 week)
LocalBufferingFabricClientdecorator + SQLite bufferPhase 4 — MCP task loop adoption (gate on first real consumer)
FabricClient::mcp_next_taskinto one bullpen agent (likelyboots— it's the simplest)Phase 5 — production cluster rollouts (gated per cluster)
Total: ~6 weeks calendar, ~2.5 weeks engineering, mostly serial. Phases 4 and 5 can happen in parallel.
Open questions for the team
boots(platform-health) orlibrarian(cross-repo audit). Either way, the agent's existing Cargo.toml needs alornu-sdkbump.tenantmodel is per-project; cluster is orthogonal. Does the org want one tenant per cluster, one tenant per agent-class, or one tenant per project-team? Affects token scoping and billing.lornu cli fabric link --env devwrites a Secret; should this Secret be tied to the developer's GitHub identity (via gh OIDC) or just a static token? OIDC is right but takes another week.Cross-references
stevedores-org/data-fabric— this repo (MVP)lornu-ai/bullpen— agent roster that will consume the SDKlornu-ai/relic-swarm— the courier already has a fabric-shaped use case (per its data-flow doc §8 aivcs feedback loop)lornu-ai/ciso-agent#15— token lifecycle automation; per-cluster fabric tokens land in this scopelornu-ai/ciso-agent#17— multi-session arbitration via aivcs; fabric's MCP loop is a natural place to record session intentlornu-ai/ciso-agent#21— CISO agent's own security posture; the CISO agent is itself a fabric client (read-everything access via tenant + token)🤖 Filed as a link-in plan / design proposal. Comments + amendments welcome before any code lands.