From 72b738964a3f779b328871efafa1f464bd9adc5b Mon Sep 17 00:00:00 2001
From: Centaur AI <ai@centaur.local>
Date: Fri, 12 Jun 2026 18:38:32 +0000
Subject: [PATCH] docs: document 1password shared quota handling

Amp-Thread-ID: https://ampcode.com/threads/T-019ebd1a-327f-764a-aba3-3664cf9c41bc
---
 contrib/chart/values.yaml                   |  5 +-
 docs/pages/deploying-in-production.mdx      |  7 +-
 docs/pages/operate/onepassword-quota.mdx    | 84 +++++++++++++++++++++
 docs/pages/secrets/onepassword.mdx          | 16 +++-
 docs/public/md/deploying-in-production.md   |  7 +-
 docs/public/md/operate/onepassword-quota.md | 84 +++++++++++++++++++++
 docs/public/md/secrets/onepassword.md       | 16 +++-
 docs/sidebar.ts                             |  1 +
 8 files changed, 211 insertions(+), 9 deletions(-)
 create mode 100644 docs/pages/operate/onepassword-quota.mdx
 create mode 100644 docs/public/md/operate/onepassword-quota.md

diff --git a/contrib/chart/values.yaml b/contrib/chart/values.yaml
index b1667b165..011733db4 100644
--- a/contrib/chart/values.yaml
+++ b/contrib/chart/values.yaml
@@ -29,7 +29,10 @@ ironProxy:
     # one port so callers can reach it.
     pgPort: 5432
   secretSource: onepassword
-  secretTtl: 10m
+  # Deployment default. The binary fallback remains 10m, but a longer chart
+  # value keeps steady-state 1Password reads proportional to live proxies
+  # instead of every proxy refreshing on a short fixed cadence.
+  secretTtl: 1h
 
 # Base tools delivery for api-rs sandboxes. When enabled, repo-cache keeps `repo`
 # fresh on each node and a tools-bootstrap init container copies `subdir` into
diff --git a/docs/pages/deploying-in-production.mdx b/docs/pages/deploying-in-production.mdx
index 38374abbb..fe8eb42c3 100644
--- a/docs/pages/deploying-in-production.mdx
+++ b/docs/pages/deploying-in-production.mdx
@@ -230,7 +230,7 @@ api:
 
 ironProxy:
   secretSource: onepassword-connect
-  secretTtl: 10m
+  secretTtl: 1h
 
 onepasswordConnect:
   connect:
@@ -249,6 +249,11 @@ sandbox:
 The Kubernetes sandbox backend is the active runtime backend; there is no chart
 switch named `api.sandboxBackend`.
 
+`1h` is the chart's steady-state default because shorter TTLs make every live
+proxy re-read secrets more often. If you run the `onepassword` service-account
+path, treat that budget as shared across all service accounts on the 1Password
+account.
+
 Install or upgrade:
 
 ```bash
diff --git a/docs/pages/operate/onepassword-quota.mdx b/docs/pages/operate/onepassword-quota.mdx
new file mode 100644
index 000000000..235124bbb
--- /dev/null
+++ b/docs/pages/operate/onepassword-quota.mdx
@@ -0,0 +1,84 @@
+---
+title: Recover from 1Password quota exhaustion
+description: Runbook for sandbox failures caused by 1Password service-account throttling.
+---
+
+# Recover from 1Password quota exhaustion
+
+Centaur's `onepassword` secret source reads `op://...` refs directly from the
+1Password service-account API. That read budget is **account-wide**, not per
+service account. Separate service accounts still help with identity separation
+and audit trails, but they do not isolate quota.
+
+:::warning[Shared budget]
+If operator CLIs, background proxy churn, and the cluster all read through the
+same 1Password account, they consume one shared rolling-window budget.
+Creating another service account during an incident will not reset it.
+:::
+
+## Symptom signature
+
+When the budget is exhausted, the failure shows up in two places:
+
+| Surface | What you see |
+|---------|---------------|
+| iron-proxy logs | `secret_unavailable` and `rate limit exceeded` while resolving `op://...` refs. |
+| Agent or harness boot | New runs fail early and crash-loop with `Invalid or missing API key` because the proxy cannot swap the placeholder credential for the real secret. |
+
+Useful checks:
+
+```bash
+kubectl logs -n centaur -l centaur.ai/iron-proxy=true --since=15m | \
+  rg 'secret_unavailable|rate limit exceeded'
+```
+
+```bash
+kubectl get pods -n centaur -l centaur.ai/iron-proxy=true
+```
+
+## Immediate recovery
+
+1. Stop the bleed.
+   Pause any operator or CLI workflows that are repeatedly reading from
+   1Password, and clean up stale per-sandbox proxies that no longer correspond
+   to live work. REV-14's terminal-run garbage collection is the primary fix
+   for this class of incident.
+2. Wait for the rolling window to clear.
+   Do not rotate to another service account expecting fresh quota; the limit is
+   shared across the account.
+3. Verify the error signature stops.
+   Re-check the proxy logs and confirm a fresh sandbox can start without the
+   `Invalid or missing API key` loop.
+
+## Reduce steady-state load
+
+Apply the levers in this order:
+
+1. Eliminate background proxy churn.
+   Orphaned sandboxes and proxies keep refreshing secrets even after the user
+   work is over. Keep REV-14 deployed anywhere this incident matters.
+2. Keep the proxy secret TTL long enough for steady state.
+   The chart default is `ironProxy.secretTtl: 1h`, which cuts 1Password reads
+   by 6x versus the old `10m` default. Override it only when you need faster
+   propagation of secret changes.
+3. Separate identities, but do not count on quota isolation.
+   Use one service account for cluster secret resolution and another for
+   operator reads if you want cleaner audit trails. Assume they still share one
+   1Password budget.
+4. Revisit the architecture as concurrency grows.
+   If live sandbox count keeps climbing, prefer `onepassword-connect`, move
+   cluster boot secrets off live 1Password reads, or evaluate the 1Password
+   plan tier that changes service-account limits.
+
+## Verify the fix
+
+After changing TTLs or cleaning up leaked proxies, verify that request volume is
+driven by live work rather than background churn:
+
+1. Count live iron-proxy pods and compare that with active sandboxes.
+2. Check recent proxy logs for the absence of `rate limit exceeded`.
+3. Start one new sandbox and confirm its first provider call succeeds.
+
+If the rate-limit signature returns while pod counts stay flat, the remaining
+load is likely coming from operator or external readers rather than sandbox
+lifecycle leaks.
diff --git a/docs/pages/secrets/onepassword.mdx b/docs/pages/secrets/onepassword.mdx
index caf91494b..f7c8990d7 100644
--- a/docs/pages/secrets/onepassword.mdx
+++ b/docs/pages/secrets/onepassword.mdx
@@ -26,7 +26,7 @@ There are two source modes:
 ```yaml
 ironProxy:
   secretSource: onepassword-connect
-  secretTtl: 10m
+  secretTtl: 1h
 
 onepasswordConnect:
   connect:
@@ -48,12 +48,16 @@ OP_CONNECT_TOKEN
 OP_VAULT
 ```
 
+Centaur's chart defaults `ironProxy.secretTtl` to `1h` so live proxies refresh
+their cached secret material less often. If you shorten it, expect more
+background 1Password traffic.
+
 ## Configure the chart (service account)
 
 ```yaml
 ironProxy:
   secretSource: onepassword
-  secretTtl: 10m
+  secretTtl: 1h
 
 secretManager:
   existingSecretName: centaur-infra-env
@@ -67,6 +71,10 @@ OP_SERVICE_ACCOUNT_TOKEN
 OP_VAULT
 ```
 
+1Password's service-account rate limit is account-wide, not per service
+account. A second service account helps separate operator and cluster identity
+or audit trails, but it does **not** buy a second read budget.
+
 It must also include infrastructure secrets such as:
 
 ```text
@@ -134,4 +142,6 @@ kubectl get secret -n centaur-system centaur-infra-env -o jsonpath='{.data.OP_CO
 
 Then run a tool or harness call that reaches an allowed host. If injection
 fails, check the secret entry's `hosts` and `match_*` fields, the 1Password
-item name, `OP_VAULT`, and whether the item has a `credential` field.
+item name, `OP_VAULT`, and whether the item has a `credential` field. If the
+proxy logs `secret_unavailable` with `rate limit exceeded`, see
+[Recover from 1Password quota exhaustion](/operate/onepassword-quota).
diff --git a/docs/public/md/deploying-in-production.md b/docs/public/md/deploying-in-production.md
index 1cf0a945c..025c27cd1 100644
--- a/docs/public/md/deploying-in-production.md
+++ b/docs/public/md/deploying-in-production.md
@@ -231,7 +231,7 @@ api:
 
 ironProxy:
   secretSource: onepassword-connect
-  secretTtl: 10m
+  secretTtl: 1h
 
 onepasswordConnect:
   connect:
@@ -250,6 +250,11 @@ sandbox:
 The Kubernetes sandbox backend is the active runtime backend; there is no chart
 switch named `api.sandboxBackend`.
 
+`1h` is the chart's steady-state default because shorter TTLs make every live
+proxy re-read secrets more often. If you run the `onepassword` service-account
+path, treat that budget as shared across all service accounts on the 1Password
+account.
+
 Install or upgrade:
 
 ```bash
diff --git a/docs/public/md/operate/onepassword-quota.md b/docs/public/md/operate/onepassword-quota.md
new file mode 100644
index 000000000..235124bbb
--- /dev/null
+++ b/docs/public/md/operate/onepassword-quota.md
@@ -0,0 +1,84 @@
+---
+title: Recover from 1Password quota exhaustion
+description: Runbook for sandbox failures caused by 1Password service-account throttling.
+---
+
+# Recover from 1Password quota exhaustion
+
+Centaur's `onepassword` secret source reads `op://...` refs directly from the
+1Password service-account API. That read budget is **account-wide**, not per
+service account. Separate service accounts still help with identity separation
+and audit trails, but they do not isolate quota.
+
+:::warning[Shared budget]
+If operator CLIs, background proxy churn, and the cluster all read through the
+same 1Password account, they consume one shared rolling-window budget.
+Creating another service account during an incident will not reset it.
+:::
+
+## Symptom signature
+
+When the budget is exhausted, the failure shows up in two places:
+
+| Surface | What you see |
+|---------|---------------|
+| iron-proxy logs | `secret_unavailable` and `rate limit exceeded` while resolving `op://...` refs. |
+| Agent or harness boot | New runs fail early and crash-loop with `Invalid or missing API key` because the proxy cannot swap the placeholder credential for the real secret. |
+
+Useful checks:
+
+```bash
+kubectl logs -n centaur -l centaur.ai/iron-proxy=true --since=15m | \
+  rg 'secret_unavailable|rate limit exceeded'
+```
+
+```bash
+kubectl get pods -n centaur -l centaur.ai/iron-proxy=true
+```
+
+## Immediate recovery
+
+1. Stop the bleed.
+   Pause any operator or CLI workflows that are repeatedly reading from
+   1Password, and clean up stale per-sandbox proxies that no longer correspond
+   to live work. REV-14's terminal-run garbage collection is the primary fix
+   for this class of incident.
+2. Wait for the rolling window to clear.
+   Do not rotate to another service account expecting fresh quota; the limit is
+   shared across the account.
+3. Verify the error signature stops.
+   Re-check the proxy logs and confirm a fresh sandbox can start without the
+   `Invalid or missing API key` loop.
+
+## Reduce steady-state load
+
+Apply the levers in this order:
+
+1. Eliminate background proxy churn.
+   Orphaned sandboxes and proxies keep refreshing secrets even after the user
+   work is over. Keep REV-14 deployed anywhere this incident matters.
+2. Keep the proxy secret TTL long enough for steady state.
+   The chart default is `ironProxy.secretTtl: 1h`, which cuts 1Password reads
+   by 6x versus the old `10m` default. Override it only when you need faster
+   propagation of secret changes.
+3. Separate identities, but do not count on quota isolation.
+   Use one service account for cluster secret resolution and another for
+   operator reads if you want cleaner audit trails. Assume they still share one
+   1Password budget.
+4. Revisit the architecture as concurrency grows.
+   If live sandbox count keeps climbing, prefer `onepassword-connect`, move
+   cluster boot secrets off live 1Password reads, or evaluate the 1Password
+   plan tier that changes service-account limits.
+
+## Verify the fix
+
+After changing TTLs or cleaning up leaked proxies, verify that request volume is
+driven by live work rather than background churn:
+
+1. Count live iron-proxy pods and compare that with active sandboxes.
+2. Check recent proxy logs for the absence of `rate limit exceeded`.
+3. Start one new sandbox and confirm its first provider call succeeds.
+
+If the rate-limit signature returns while pod counts stay flat, the remaining
+load is likely coming from operator or external readers rather than sandbox
+lifecycle leaks.
diff --git a/docs/public/md/secrets/onepassword.md b/docs/public/md/secrets/onepassword.md
index caf91494b..f7c8990d7 100644
--- a/docs/public/md/secrets/onepassword.md
+++ b/docs/public/md/secrets/onepassword.md
@@ -26,7 +26,7 @@ There are two source modes:
 ```yaml
 ironProxy:
   secretSource: onepassword-connect
-  secretTtl: 10m
+  secretTtl: 1h
 
 onepasswordConnect:
   connect:
@@ -48,12 +48,16 @@ OP_CONNECT_TOKEN
 OP_VAULT
 ```
 
+Centaur's chart defaults `ironProxy.secretTtl` to `1h` so live proxies refresh
+their cached secret material less often. If you shorten it, expect more
+background 1Password traffic.
+
 ## Configure the chart (service account)
 
 ```yaml
 ironProxy:
   secretSource: onepassword
-  secretTtl: 10m
+  secretTtl: 1h
 
 secretManager:
   existingSecretName: centaur-infra-env
@@ -67,6 +71,10 @@ OP_SERVICE_ACCOUNT_TOKEN
 OP_VAULT
 ```
 
+1Password's service-account rate limit is account-wide, not per service
+account. A second service account helps separate operator and cluster identity
+or audit trails, but it does **not** buy a second read budget.
+
 It must also include infrastructure secrets such as:
 
 ```text
@@ -134,4 +142,6 @@ kubectl get secret -n centaur-system centaur-infra-env -o jsonpath='{.data.OP_CO
 
 Then run a tool or harness call that reaches an allowed host. If injection
 fails, check the secret entry's `hosts` and `match_*` fields, the 1Password
-item name, `OP_VAULT`, and whether the item has a `credential` field.
+item name, `OP_VAULT`, and whether the item has a `credential` field. If the
+proxy logs `secret_unavailable` with `rate limit exceeded`, see
+[Recover from 1Password quota exhaustion](/operate/onepassword-quota).
diff --git a/docs/sidebar.ts b/docs/sidebar.ts
index 9bb284d7c..c46c16030 100644
--- a/docs/sidebar.ts
+++ b/docs/sidebar.ts
@@ -14,6 +14,7 @@ export const sidebar = [
   {
     text: 'Operate',
     items: [
+      { text: 'Recover from 1Password quota exhaustion', link: '/operate/onepassword-quota' },
       { text: 'Slack ETL', link: '/operate/slack-etl' },
       { text: 'Expose Slackbot with Tailscale Funnel', link: '/operate/tailscale-funnel' },
     ],