Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 4 additions & 1 deletion contrib/chart/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,10 @@ ironProxy:
# one port so callers can reach it.
pgPort: 5432
secretSource: onepassword
secretTtl: 10m
# Deployment default. The binary fallback remains 10m, but a longer chart
# value keeps steady-state 1Password reads proportional to live proxies
# instead of every proxy refreshing on a short fixed cadence.
secretTtl: 1h

# Base tools delivery for api-rs sandboxes. When enabled, repo-cache keeps `repo`
# fresh on each node and a tools-bootstrap init container copies `subdir` into
Expand Down
7 changes: 6 additions & 1 deletion docs/pages/deploying-in-production.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -230,7 +230,7 @@ api:

ironProxy:
secretSource: onepassword-connect
secretTtl: 10m
secretTtl: 1h

onepasswordConnect:
connect:
Expand All @@ -249,6 +249,11 @@ sandbox:
The Kubernetes sandbox backend is the active runtime backend; there is no chart
switch named `api.sandboxBackend`.

`1h` is the chart's steady-state default because shorter TTLs make every live
proxy re-read secrets more often. If you run the `onepassword` service-account
path, treat that budget as shared across all service accounts on the 1Password
account.

Install or upgrade:

```bash
Expand Down
84 changes: 84 additions & 0 deletions docs/pages/operate/onepassword-quota.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
---
title: Recover from 1Password quota exhaustion
description: Runbook for sandbox failures caused by 1Password service-account throttling.
---

# Recover from 1Password quota exhaustion

Centaur's `onepassword` secret source reads `op://...` refs directly from the
1Password service-account API. That read budget is **account-wide**, not per
service account. Separate service accounts still help with identity separation
and audit trails, but they do not isolate quota.

:::warning[Shared budget]
If operator CLIs, background proxy churn, and the cluster all read through the
same 1Password account, they consume one shared rolling-window budget.
Creating another service account during an incident will not reset it.
:::

## Symptom signature

When the budget is exhausted, the failure shows up in two places:

| Surface | What you see |
|---------|---------------|
| iron-proxy logs | `secret_unavailable` and `rate limit exceeded` while resolving `op://...` refs. |
| Agent or harness boot | New runs fail early and crash-loop with `Invalid or missing API key` because the proxy cannot swap the placeholder credential for the real secret. |

Useful checks:

```bash
kubectl logs -n centaur -l centaur.ai/iron-proxy=true --since=15m | \
rg 'secret_unavailable|rate limit exceeded'
```

```bash
kubectl get pods -n centaur -l centaur.ai/iron-proxy=true
```

## Immediate recovery

1. Stop the bleed.
Pause any operator or CLI workflows that are repeatedly reading from
1Password, and clean up stale per-sandbox proxies that no longer correspond
to live work. REV-14's terminal-run garbage collection is the primary fix
for this class of incident.
2. Wait for the rolling window to clear.
Do not rotate to another service account expecting fresh quota; the limit is
shared across the account.
3. Verify the error signature stops.
Re-check the proxy logs and confirm a fresh sandbox can start without the
`Invalid or missing API key` loop.

## Reduce steady-state load

Apply the levers in this order:

1. Eliminate background proxy churn.
Orphaned sandboxes and proxies keep refreshing secrets even after the user
work is over. Keep REV-14 deployed anywhere this incident matters.
2. Keep the proxy secret TTL long enough for steady state.
The chart default is `ironProxy.secretTtl: 1h`, which cuts 1Password reads
by 6x versus the old `10m` default. Override it only when you need faster
propagation of secret changes.
3. Separate identities, but do not count on quota isolation.
Use one service account for cluster secret resolution and another for
operator reads if you want cleaner audit trails. Assume they still share one
1Password budget.
4. Revisit the architecture as concurrency grows.
If live sandbox count keeps climbing, prefer `onepassword-connect`, move
cluster boot secrets off live 1Password reads, or evaluate the 1Password
plan tier that changes service-account limits.

## Verify the fix

After changing TTLs or cleaning up leaked proxies, verify that request volume is
driven by live work rather than background churn:

1. Count live iron-proxy pods and compare that with active sandboxes.
2. Check recent proxy logs for the absence of `rate limit exceeded`.
3. Start one new sandbox and confirm its first provider call succeeds.

If the rate-limit signature returns while pod counts stay flat, the remaining
load is likely coming from operator or external readers rather than sandbox
lifecycle leaks.
16 changes: 13 additions & 3 deletions docs/pages/secrets/onepassword.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ There are two source modes:
```yaml
ironProxy:
secretSource: onepassword-connect
secretTtl: 10m
secretTtl: 1h

onepasswordConnect:
connect:
Expand All @@ -48,12 +48,16 @@ OP_CONNECT_TOKEN
OP_VAULT
```

Centaur's chart defaults `ironProxy.secretTtl` to `1h` so live proxies refresh
their cached secret material less often. If you shorten it, expect more
background 1Password traffic.

## Configure the chart (service account)

```yaml
ironProxy:
secretSource: onepassword
secretTtl: 10m
secretTtl: 1h

secretManager:
existingSecretName: centaur-infra-env
Expand All @@ -67,6 +71,10 @@ OP_SERVICE_ACCOUNT_TOKEN
OP_VAULT
```

1Password's service-account rate limit is account-wide, not per service
account. A second service account helps separate operator and cluster identity
or audit trails, but it does **not** buy a second read budget.

It must also include infrastructure secrets such as:

```text
Expand Down Expand Up @@ -134,4 +142,6 @@ kubectl get secret -n centaur-system centaur-infra-env -o jsonpath='{.data.OP_CO

Then run a tool or harness call that reaches an allowed host. If injection
fails, check the secret entry's `hosts` and `match_*` fields, the 1Password
item name, `OP_VAULT`, and whether the item has a `credential` field.
item name, `OP_VAULT`, and whether the item has a `credential` field. If the
proxy logs `secret_unavailable` with `rate limit exceeded`, see
[Recover from 1Password quota exhaustion](/operate/onepassword-quota).
7 changes: 6 additions & 1 deletion docs/public/md/deploying-in-production.md
Original file line number Diff line number Diff line change
Expand Up @@ -231,7 +231,7 @@ api:

ironProxy:
secretSource: onepassword-connect
secretTtl: 10m
secretTtl: 1h

onepasswordConnect:
connect:
Expand All @@ -250,6 +250,11 @@ sandbox:
The Kubernetes sandbox backend is the active runtime backend; there is no chart
switch named `api.sandboxBackend`.

`1h` is the chart's steady-state default because shorter TTLs make every live
proxy re-read secrets more often. If you run the `onepassword` service-account
path, treat that budget as shared across all service accounts on the 1Password
account.

Install or upgrade:

```bash
Expand Down
84 changes: 84 additions & 0 deletions docs/public/md/operate/onepassword-quota.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
---
title: Recover from 1Password quota exhaustion
description: Runbook for sandbox failures caused by 1Password service-account throttling.
---

# Recover from 1Password quota exhaustion

Centaur's `onepassword` secret source reads `op://...` refs directly from the
1Password service-account API. That read budget is **account-wide**, not per
service account. Separate service accounts still help with identity separation
and audit trails, but they do not isolate quota.

:::warning[Shared budget]
If operator CLIs, background proxy churn, and the cluster all read through the
same 1Password account, they consume one shared rolling-window budget.
Creating another service account during an incident will not reset it.
:::

## Symptom signature

When the budget is exhausted, the failure shows up in two places:

| Surface | What you see |
|---------|---------------|
| iron-proxy logs | `secret_unavailable` and `rate limit exceeded` while resolving `op://...` refs. |
| Agent or harness boot | New runs fail early and crash-loop with `Invalid or missing API key` because the proxy cannot swap the placeholder credential for the real secret. |

Useful checks:

```bash
kubectl logs -n centaur -l centaur.ai/iron-proxy=true --since=15m | \
rg 'secret_unavailable|rate limit exceeded'
```

```bash
kubectl get pods -n centaur -l centaur.ai/iron-proxy=true
```

## Immediate recovery

1. Stop the bleed.
Pause any operator or CLI workflows that are repeatedly reading from
1Password, and clean up stale per-sandbox proxies that no longer correspond
to live work. REV-14's terminal-run garbage collection is the primary fix
for this class of incident.
2. Wait for the rolling window to clear.
Do not rotate to another service account expecting fresh quota; the limit is
shared across the account.
3. Verify the error signature stops.
Re-check the proxy logs and confirm a fresh sandbox can start without the
`Invalid or missing API key` loop.

## Reduce steady-state load

Apply the levers in this order:

1. Eliminate background proxy churn.
Orphaned sandboxes and proxies keep refreshing secrets even after the user
work is over. Keep REV-14 deployed anywhere this incident matters.
2. Keep the proxy secret TTL long enough for steady state.
The chart default is `ironProxy.secretTtl: 1h`, which cuts 1Password reads
by 6x versus the old `10m` default. Override it only when you need faster
propagation of secret changes.
3. Separate identities, but do not count on quota isolation.
Use one service account for cluster secret resolution and another for
operator reads if you want cleaner audit trails. Assume they still share one
1Password budget.
4. Revisit the architecture as concurrency grows.
If live sandbox count keeps climbing, prefer `onepassword-connect`, move
cluster boot secrets off live 1Password reads, or evaluate the 1Password
plan tier that changes service-account limits.

## Verify the fix

After changing TTLs or cleaning up leaked proxies, verify that request volume is
driven by live work rather than background churn:

1. Count live iron-proxy pods and compare that with active sandboxes.
2. Check recent proxy logs for the absence of `rate limit exceeded`.
3. Start one new sandbox and confirm its first provider call succeeds.

If the rate-limit signature returns while pod counts stay flat, the remaining
load is likely coming from operator or external readers rather than sandbox
lifecycle leaks.
16 changes: 13 additions & 3 deletions docs/public/md/secrets/onepassword.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ There are two source modes:
```yaml
ironProxy:
secretSource: onepassword-connect
secretTtl: 10m
secretTtl: 1h

onepasswordConnect:
connect:
Expand All @@ -48,12 +48,16 @@ OP_CONNECT_TOKEN
OP_VAULT
```

Centaur's chart defaults `ironProxy.secretTtl` to `1h` so live proxies refresh
their cached secret material less often. If you shorten it, expect more
background 1Password traffic.

## Configure the chart (service account)

```yaml
ironProxy:
secretSource: onepassword
secretTtl: 10m
secretTtl: 1h

secretManager:
existingSecretName: centaur-infra-env
Expand All @@ -67,6 +71,10 @@ OP_SERVICE_ACCOUNT_TOKEN
OP_VAULT
```

1Password's service-account rate limit is account-wide, not per service
account. A second service account helps separate operator and cluster identity
or audit trails, but it does **not** buy a second read budget.

It must also include infrastructure secrets such as:

```text
Expand Down Expand Up @@ -134,4 +142,6 @@ kubectl get secret -n centaur-system centaur-infra-env -o jsonpath='{.data.OP_CO

Then run a tool or harness call that reaches an allowed host. If injection
fails, check the secret entry's `hosts` and `match_*` fields, the 1Password
item name, `OP_VAULT`, and whether the item has a `credential` field.
item name, `OP_VAULT`, and whether the item has a `credential` field. If the
proxy logs `secret_unavailable` with `rate limit exceeded`, see
[Recover from 1Password quota exhaustion](/operate/onepassword-quota).
1 change: 1 addition & 0 deletions docs/sidebar.ts
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ export const sidebar = [
{
text: 'Operate',
items: [
{ text: 'Recover from 1Password quota exhaustion', link: '/operate/onepassword-quota' },
{ text: 'Slack ETL', link: '/operate/slack-etl' },
{ text: 'Expose Slackbot with Tailscale Funnel', link: '/operate/tailscale-funnel' },
],
Expand Down