feat: SHEK-16 — in-cluster Kubernetes auto-discovery from ConfigMap#29
Conversation
New module shekel/integrations/kubernetes.py:
- is_k8s_environment(): detects KUBERNETES_SERVICE_HOST + SHEKEL_BUDGET_NAME
- _fetch_configmap(): loads shekel-budget-{name} from the pod's namespace
via kubernetes.client.CoreV1Api; soft-imports kubernetes (no crash if absent)
- apply_k8s_config(): applies ConfigMap values to Budget fields where still
None (priority: explicit kwarg > AGENT_BUDGET_USD env var > ConfigMap)
- KubernetesPoller: daemon thread that polls paused key every
SHEKEL_POLL_INTERVAL_SECONDS (default 10s); sets _paused_externally
Budget._record_spend(): raises BudgetExceededError immediately when
_paused_externally is True (before spend accumulation).
Budget.__exit__ / __aexit__: stop the poller thread on context exit.
pyproject.toml: add [k8s] extra (kubernetes>=28.0); add to [all];
add kubernetes mypy override.
36 tests; 100% coverage on kubernetes.py.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
New KubernetesSpendReporter daemon thread (shekel/integrations/kubernetes.py):
- Active when ConfigMap has backend=k8s; skipped for backend=redis or absent
- Accumulates cumulative LLM spend/calls under threading.Lock (never hold lock
across network call)
- Flush triggers: flush_every_seconds (time-based), flush_every_usd (USD
threshold on delta since last flush), Budget.__exit__/__aexit__ (always,
including on exception)
- Patch-or-create ConfigMap shekel-spend-{HOSTNAME}: patch first, create on
404 ApiException; any failure logs WARNING and never raises to caller
- After successful write updates _last_flush_spent so next flush computes
correct delta; baseline unchanged on failure so full cumulative total retried
- HOSTNAME absent → flush silently skipped
- Correct labels: shekel.dev/spend-report, shekel.dev/budget,
shekel.dev/group (omitted when SHEKEL_GROUP_VALUE is empty)
Budget._record_spend: calls reporter.on_spend(cost) after each LLM call.
Budget.__exit__ / __aexit__: calls reporter.flush_and_stop() on context exit.
41 new tests; 100% coverage on kubernetes.py.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
/improve |
Code Review by Qodo
Context used✅ Compliance rules (platform):
7 rules 1. per_pod_cap not child Budget
|
| while not self._stop_event.wait(self._interval): | ||
| cm = _fetch_configmap(self._budget_name, self._namespace) | ||
| if cm is not None: | ||
| self._budget._paused_externally = cm.get("paused") == "true" | ||
|
|
There was a problem hiding this comment.
2. K8s paused flag not locked 📎 Requirement gap ☼ Reliability
Kubernetes polling writes to budget._paused_externally from a background thread without any synchronization, and the budget does not define or use a lock to protect concurrent reads/writes. This violates the thread-safety requirement and can cause race conditions during spend recording.
Agent Prompt
## Issue description
The K8s poller mutates `budget._paused_externally` from a background thread without a lock, and `Budget._record_spend()` reads it without synchronization. The compliance rule requires writes to `_paused_externally` to be protected by a `threading.Lock` (or equivalent) and also specifies async budgets should use `asyncio.create_task` when an event loop is running.
## Issue Context
`KubernetesPoller` runs in a daemon thread and periodically updates the paused flag. Without a lock, concurrent access can race with `_record_spend()`.
## Fix Focus Areas
- shekel/_budget.py[303-320]
- shekel/_budget.py[662-671]
- shekel/integrations/kubernetes.py[85-88]
- shekel/integrations/kubernetes.py[153-158]
- shekel/integrations/kubernetes.py[160-189]
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
| if cm.get("backend") == "redis": | ||
| redis_url = os.environ.get("REDIS_URL") | ||
| if redis_url: | ||
| try: | ||
| from shekel.backends.redis import RedisBackend # noqa: PLC0415 | ||
|
|
||
| budget._k8s_redis_backend = RedisBackend(url=redis_url) | ||
| budget._k8s_redis_name = cm.get("redis_key", f"shekel:{namespace}:{budget_name}") | ||
| except ImportError: |
There was a problem hiding this comment.
3. Redisbackend missing name arg 📎 Requirement gap ≡ Correctness
When backend=="redis" and REDIS_URL is set, the code instantiates RedisBackend(url=redis_url) but does not pass name=redis_key as required. This can break correct Redis key naming and violates the specified ConfigMap-to-budget mapping for Redis backend activation.
Agent Prompt
## Issue description
Redis backend activation must instantiate `RedisBackend(url=REDIS_URL, name=redis_key)` (using `redis_key` from ConfigMap, or the default `shekel:{namespace}:{budget_name}`), but the current code does not pass `name=`.
## Issue Context
The compliance spec explicitly requires the Redis backend to use the `redis_key` naming so controller/materialized budgets map to the intended Redis budget key.
## Fix Focus Areas
- shekel/integrations/kubernetes.py[118-131]
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
…ate infinite recursion Constructing a child Budget for per_pod_cap caused unbounded recursion because KUBERNETES_SERVICE_HOST/SHEKEL_BUDGET_NAME are still set in the process, triggering apply_k8s_config() on every child __init__. Store the cap as _per_pod_cap_usd: float | None instead, and enforce it via a new _check_per_pod_limit() method called inside _record_spend(). Adds 4 regression tests including an explicit no-recursion guard. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
/review |
|
Code review by qodo was updated up to the latest commit f2cb46e |
After __exit__ stops the K8s threads, re-entering the same Budget instance (session budget pattern) left K8s polling and spend reporting permanently dead. Python threads cannot be restarted, so a new instance must be created. Persist budget_name, namespace, and poll_interval on the budget during apply_k8s_config(), then call _restart_k8s_threads() from __enter__ and __aenter__ to rebuild any stopped threads idempotently. Adds 6 regression tests including sync and async re-entry, idempotency, and no-K8s no-op. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
|
||
| # --- per_pod_cap --- | ||
| if "per_pod_cap" in cm: | ||
| budget._per_pod_cap_usd = float(cm["per_pod_cap"]) |
There was a problem hiding this comment.
1. per_pod_cap not child budget 📎 Requirement gap ≡ Correctness
per_pod_cap is applied by setting budget._per_pod_cap_usd and enforcing it via _check_per_pod_limit(), instead of creating a child Budget(max_usd=float(v)) as required by the ConfigMap-to-Budget mapping. This breaks the specified behavior for per-pod limiting and the accompanying tests explicitly assert the non-child implementation.
Agent Prompt
## Issue description
The `per_pod_cap` ConfigMap key is required to create a child `Budget(max_usd=float(v))` for per-pod limiting, but the current implementation stores a float on the parent budget and enforces it directly.
## Issue Context
Compliance requires the mapping behavior: `per_pod_cap` → create a child `Budget(max_usd=float(v))` when per-pod limiting is used. Current tests also encode the non-child behavior, so they will need updating alongside the implementation.
## Fix Focus Areas
- shekel/integrations/kubernetes.py[112-115]
- shekel/_budget.py[703-706]
- shekel/_budget.py[814-823]
- tests/test_kubernetes_integration.py[577-593]
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
Blanket except Exception: pass hid bad ConfigMap values, recursion bugs, and any other apply_k8s_config failure with no log or signal. Split into ImportError (silent — optional dep not installed) and Exception (warning with exc_info so operators can diagnose misconfigured ConfigMaps). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tures limit-exceeding call cost Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…xture; fix black line-length Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…r kill-switch from budget exhaustion Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ibuted enforcement Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…cope_mode=shared (group-scoped Redis key) into runtime Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
@qodo-code-review[bot] - did the last commit resolve the gaps?> |
- Fix _check_redis_limit raise path (lines 859-861): test used max_usd=0.05 which caused _check_limit to fire first, never reaching redis path - Add chain() tests to cover happy path and invalid-arg guard (lines 1362-1365) - Add _litellm_available() skipif to TestLiteLLMPatching and TestLiteLLMCostRecording so they skip when litellm optional dep is absent Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Summary
Known issues — tracked for follow-up
Code review identified 9 bugs. All critical/medium issues are resolved; one open for design clarification:
Budgetconstruction recurses infinitely whenSHEKEL_BUDGET_NAMEis setBudgetis reused across multiplewith-blocksBudgetExceededErrorraised on external pause is indistinguishable from normal budget exhaustion_record_spend_check_per_pod_limit()raises unconditionally, ignoringwarn_onlymodescope_mode/scope_group_bystored from ConfigMap but never usedTest plan
pytest tests/test_kubernetes_integration.py— all 97 tests passpytest --cov=shekel --cov-report=term-missing— 97% overall coverage[k8s]extra and confirm no import errorsblack,isort,ruff,mypy🤖 Generated with Claude Code