From 72b738964a3f779b328871efafa1f464bd9adc5b Mon Sep 17 00:00:00 2001 From: Centaur AI Date: Fri, 12 Jun 2026 18:38:32 +0000 Subject: [PATCH] docs: document 1password shared quota handling Amp-Thread-ID: https://ampcode.com/threads/T-019ebd1a-327f-764a-aba3-3664cf9c41bc --- contrib/chart/values.yaml | 5 +- docs/pages/deploying-in-production.mdx | 7 +- docs/pages/operate/onepassword-quota.mdx | 84 +++++++++++++++++++++ docs/pages/secrets/onepassword.mdx | 16 +++- docs/public/md/deploying-in-production.md | 7 +- docs/public/md/operate/onepassword-quota.md | 84 +++++++++++++++++++++ docs/public/md/secrets/onepassword.md | 16 +++- docs/sidebar.ts | 1 + 8 files changed, 211 insertions(+), 9 deletions(-) create mode 100644 docs/pages/operate/onepassword-quota.mdx create mode 100644 docs/public/md/operate/onepassword-quota.md diff --git a/contrib/chart/values.yaml b/contrib/chart/values.yaml index b1667b165..011733db4 100644 --- a/contrib/chart/values.yaml +++ b/contrib/chart/values.yaml @@ -29,7 +29,10 @@ ironProxy: # one port so callers can reach it. pgPort: 5432 secretSource: onepassword - secretTtl: 10m + # Deployment default. The binary fallback remains 10m, but a longer chart + # value keeps steady-state 1Password reads proportional to live proxies + # instead of every proxy refreshing on a short fixed cadence. + secretTtl: 1h # Base tools delivery for api-rs sandboxes. When enabled, repo-cache keeps `repo` # fresh on each node and a tools-bootstrap init container copies `subdir` into diff --git a/docs/pages/deploying-in-production.mdx b/docs/pages/deploying-in-production.mdx index 38374abbb..fe8eb42c3 100644 --- a/docs/pages/deploying-in-production.mdx +++ b/docs/pages/deploying-in-production.mdx @@ -230,7 +230,7 @@ api: ironProxy: secretSource: onepassword-connect - secretTtl: 10m + secretTtl: 1h onepasswordConnect: connect: @@ -249,6 +249,11 @@ sandbox: The Kubernetes sandbox backend is the active runtime backend; there is no chart switch named `api.sandboxBackend`. +`1h` is the chart's steady-state default because shorter TTLs make every live +proxy re-read secrets more often. If you run the `onepassword` service-account +path, treat that budget as shared across all service accounts on the 1Password +account. + Install or upgrade: ```bash diff --git a/docs/pages/operate/onepassword-quota.mdx b/docs/pages/operate/onepassword-quota.mdx new file mode 100644 index 000000000..235124bbb --- /dev/null +++ b/docs/pages/operate/onepassword-quota.mdx @@ -0,0 +1,84 @@ +--- +title: Recover from 1Password quota exhaustion +description: Runbook for sandbox failures caused by 1Password service-account throttling. +--- + +# Recover from 1Password quota exhaustion + +Centaur's `onepassword` secret source reads `op://...` refs directly from the +1Password service-account API. That read budget is **account-wide**, not per +service account. Separate service accounts still help with identity separation +and audit trails, but they do not isolate quota. + +:::warning[Shared budget] +If operator CLIs, background proxy churn, and the cluster all read through the +same 1Password account, they consume one shared rolling-window budget. +Creating another service account during an incident will not reset it. +::: + +## Symptom signature + +When the budget is exhausted, the failure shows up in two places: + +| Surface | What you see | +|---------|---------------| +| iron-proxy logs | `secret_unavailable` and `rate limit exceeded` while resolving `op://...` refs. | +| Agent or harness boot | New runs fail early and crash-loop with `Invalid or missing API key` because the proxy cannot swap the placeholder credential for the real secret. | + +Useful checks: + +```bash +kubectl logs -n centaur -l centaur.ai/iron-proxy=true --since=15m | \ + rg 'secret_unavailable|rate limit exceeded' +``` + +```bash +kubectl get pods -n centaur -l centaur.ai/iron-proxy=true +``` + +## Immediate recovery + +1. Stop the bleed. + Pause any operator or CLI workflows that are repeatedly reading from + 1Password, and clean up stale per-sandbox proxies that no longer correspond + to live work. REV-14's terminal-run garbage collection is the primary fix + for this class of incident. +2. Wait for the rolling window to clear. + Do not rotate to another service account expecting fresh quota; the limit is + shared across the account. +3. Verify the error signature stops. + Re-check the proxy logs and confirm a fresh sandbox can start without the + `Invalid or missing API key` loop. + +## Reduce steady-state load + +Apply the levers in this order: + +1. Eliminate background proxy churn. + Orphaned sandboxes and proxies keep refreshing secrets even after the user + work is over. Keep REV-14 deployed anywhere this incident matters. +2. Keep the proxy secret TTL long enough for steady state. + The chart default is `ironProxy.secretTtl: 1h`, which cuts 1Password reads + by 6x versus the old `10m` default. Override it only when you need faster + propagation of secret changes. +3. Separate identities, but do not count on quota isolation. + Use one service account for cluster secret resolution and another for + operator reads if you want cleaner audit trails. Assume they still share one + 1Password budget. +4. Revisit the architecture as concurrency grows. + If live sandbox count keeps climbing, prefer `onepassword-connect`, move + cluster boot secrets off live 1Password reads, or evaluate the 1Password + plan tier that changes service-account limits. + +## Verify the fix + +After changing TTLs or cleaning up leaked proxies, verify that request volume is +driven by live work rather than background churn: + +1. Count live iron-proxy pods and compare that with active sandboxes. +2. Check recent proxy logs for the absence of `rate limit exceeded`. +3. Start one new sandbox and confirm its first provider call succeeds. + +If the rate-limit signature returns while pod counts stay flat, the remaining +load is likely coming from operator or external readers rather than sandbox +lifecycle leaks. diff --git a/docs/pages/secrets/onepassword.mdx b/docs/pages/secrets/onepassword.mdx index caf91494b..f7c8990d7 100644 --- a/docs/pages/secrets/onepassword.mdx +++ b/docs/pages/secrets/onepassword.mdx @@ -26,7 +26,7 @@ There are two source modes: ```yaml ironProxy: secretSource: onepassword-connect - secretTtl: 10m + secretTtl: 1h onepasswordConnect: connect: @@ -48,12 +48,16 @@ OP_CONNECT_TOKEN OP_VAULT ``` +Centaur's chart defaults `ironProxy.secretTtl` to `1h` so live proxies refresh +their cached secret material less often. If you shorten it, expect more +background 1Password traffic. + ## Configure the chart (service account) ```yaml ironProxy: secretSource: onepassword - secretTtl: 10m + secretTtl: 1h secretManager: existingSecretName: centaur-infra-env @@ -67,6 +71,10 @@ OP_SERVICE_ACCOUNT_TOKEN OP_VAULT ``` +1Password's service-account rate limit is account-wide, not per service +account. A second service account helps separate operator and cluster identity +or audit trails, but it does **not** buy a second read budget. + It must also include infrastructure secrets such as: ```text @@ -134,4 +142,6 @@ kubectl get secret -n centaur-system centaur-infra-env -o jsonpath='{.data.OP_CO Then run a tool or harness call that reaches an allowed host. If injection fails, check the secret entry's `hosts` and `match_*` fields, the 1Password -item name, `OP_VAULT`, and whether the item has a `credential` field. +item name, `OP_VAULT`, and whether the item has a `credential` field. If the +proxy logs `secret_unavailable` with `rate limit exceeded`, see +[Recover from 1Password quota exhaustion](/operate/onepassword-quota). diff --git a/docs/public/md/deploying-in-production.md b/docs/public/md/deploying-in-production.md index 1cf0a945c..025c27cd1 100644 --- a/docs/public/md/deploying-in-production.md +++ b/docs/public/md/deploying-in-production.md @@ -231,7 +231,7 @@ api: ironProxy: secretSource: onepassword-connect - secretTtl: 10m + secretTtl: 1h onepasswordConnect: connect: @@ -250,6 +250,11 @@ sandbox: The Kubernetes sandbox backend is the active runtime backend; there is no chart switch named `api.sandboxBackend`. +`1h` is the chart's steady-state default because shorter TTLs make every live +proxy re-read secrets more often. If you run the `onepassword` service-account +path, treat that budget as shared across all service accounts on the 1Password +account. + Install or upgrade: ```bash diff --git a/docs/public/md/operate/onepassword-quota.md b/docs/public/md/operate/onepassword-quota.md new file mode 100644 index 000000000..235124bbb --- /dev/null +++ b/docs/public/md/operate/onepassword-quota.md @@ -0,0 +1,84 @@ +--- +title: Recover from 1Password quota exhaustion +description: Runbook for sandbox failures caused by 1Password service-account throttling. +--- + +# Recover from 1Password quota exhaustion + +Centaur's `onepassword` secret source reads `op://...` refs directly from the +1Password service-account API. That read budget is **account-wide**, not per +service account. Separate service accounts still help with identity separation +and audit trails, but they do not isolate quota. + +:::warning[Shared budget] +If operator CLIs, background proxy churn, and the cluster all read through the +same 1Password account, they consume one shared rolling-window budget. +Creating another service account during an incident will not reset it. +::: + +## Symptom signature + +When the budget is exhausted, the failure shows up in two places: + +| Surface | What you see | +|---------|---------------| +| iron-proxy logs | `secret_unavailable` and `rate limit exceeded` while resolving `op://...` refs. | +| Agent or harness boot | New runs fail early and crash-loop with `Invalid or missing API key` because the proxy cannot swap the placeholder credential for the real secret. | + +Useful checks: + +```bash +kubectl logs -n centaur -l centaur.ai/iron-proxy=true --since=15m | \ + rg 'secret_unavailable|rate limit exceeded' +``` + +```bash +kubectl get pods -n centaur -l centaur.ai/iron-proxy=true +``` + +## Immediate recovery + +1. Stop the bleed. + Pause any operator or CLI workflows that are repeatedly reading from + 1Password, and clean up stale per-sandbox proxies that no longer correspond + to live work. REV-14's terminal-run garbage collection is the primary fix + for this class of incident. +2. Wait for the rolling window to clear. + Do not rotate to another service account expecting fresh quota; the limit is + shared across the account. +3. Verify the error signature stops. + Re-check the proxy logs and confirm a fresh sandbox can start without the + `Invalid or missing API key` loop. + +## Reduce steady-state load + +Apply the levers in this order: + +1. Eliminate background proxy churn. + Orphaned sandboxes and proxies keep refreshing secrets even after the user + work is over. Keep REV-14 deployed anywhere this incident matters. +2. Keep the proxy secret TTL long enough for steady state. + The chart default is `ironProxy.secretTtl: 1h`, which cuts 1Password reads + by 6x versus the old `10m` default. Override it only when you need faster + propagation of secret changes. +3. Separate identities, but do not count on quota isolation. + Use one service account for cluster secret resolution and another for + operator reads if you want cleaner audit trails. Assume they still share one + 1Password budget. +4. Revisit the architecture as concurrency grows. + If live sandbox count keeps climbing, prefer `onepassword-connect`, move + cluster boot secrets off live 1Password reads, or evaluate the 1Password + plan tier that changes service-account limits. + +## Verify the fix + +After changing TTLs or cleaning up leaked proxies, verify that request volume is +driven by live work rather than background churn: + +1. Count live iron-proxy pods and compare that with active sandboxes. +2. Check recent proxy logs for the absence of `rate limit exceeded`. +3. Start one new sandbox and confirm its first provider call succeeds. + +If the rate-limit signature returns while pod counts stay flat, the remaining +load is likely coming from operator or external readers rather than sandbox +lifecycle leaks. diff --git a/docs/public/md/secrets/onepassword.md b/docs/public/md/secrets/onepassword.md index caf91494b..f7c8990d7 100644 --- a/docs/public/md/secrets/onepassword.md +++ b/docs/public/md/secrets/onepassword.md @@ -26,7 +26,7 @@ There are two source modes: ```yaml ironProxy: secretSource: onepassword-connect - secretTtl: 10m + secretTtl: 1h onepasswordConnect: connect: @@ -48,12 +48,16 @@ OP_CONNECT_TOKEN OP_VAULT ``` +Centaur's chart defaults `ironProxy.secretTtl` to `1h` so live proxies refresh +their cached secret material less often. If you shorten it, expect more +background 1Password traffic. + ## Configure the chart (service account) ```yaml ironProxy: secretSource: onepassword - secretTtl: 10m + secretTtl: 1h secretManager: existingSecretName: centaur-infra-env @@ -67,6 +71,10 @@ OP_SERVICE_ACCOUNT_TOKEN OP_VAULT ``` +1Password's service-account rate limit is account-wide, not per service +account. A second service account helps separate operator and cluster identity +or audit trails, but it does **not** buy a second read budget. + It must also include infrastructure secrets such as: ```text @@ -134,4 +142,6 @@ kubectl get secret -n centaur-system centaur-infra-env -o jsonpath='{.data.OP_CO Then run a tool or harness call that reaches an allowed host. If injection fails, check the secret entry's `hosts` and `match_*` fields, the 1Password -item name, `OP_VAULT`, and whether the item has a `credential` field. +item name, `OP_VAULT`, and whether the item has a `credential` field. If the +proxy logs `secret_unavailable` with `rate limit exceeded`, see +[Recover from 1Password quota exhaustion](/operate/onepassword-quota). diff --git a/docs/sidebar.ts b/docs/sidebar.ts index 9bb284d7c..c46c16030 100644 --- a/docs/sidebar.ts +++ b/docs/sidebar.ts @@ -14,6 +14,7 @@ export const sidebar = [ { text: 'Operate', items: [ + { text: 'Recover from 1Password quota exhaustion', link: '/operate/onepassword-quota' }, { text: 'Slack ETL', link: '/operate/slack-etl' }, { text: 'Expose Slackbot with Tailscale Funnel', link: '/operate/tailscale-funnel' }, ],