Skip to content

feat: SHEK-16 — in-cluster Kubernetes auto-discovery from ConfigMap#29

Merged
arieradle merged 12 commits into
mainfrom
feat/shek-16
Jun 17, 2026
Merged

feat: SHEK-16 — in-cluster Kubernetes auto-discovery from ConfigMap#29
arieradle merged 12 commits into
mainfrom
feat/shek-16

Conversation

@arieradle

@arieradle arieradle commented Jun 4, 2026

Copy link
Copy Markdown
Owner

Summary

  • New `shekel/integrations/kubernetes.py` module with `is_k8s_environment()`, `apply_k8s_config()`, `KubernetesPoller`, and `KubernetesSpendReporter` daemon threads
  • `Budget._record_spend()` raises `BudgetPausedError` (subclass of `BudgetExceededError`) immediately when `_paused_externally` is set by the poller
  • `Budget.exit` / `aexit` stop poller and reporter on context exit; threads restart on re-entry
  • Redis backend (`backend=redis` in ConfigMap) wired into `_record_spend()` for distributed cross-pod enforcement
  • Adds `[k8s]` optional extra (`kubernetes>=28.0`) in `pyproject.toml`; included in `[all]`
  • 97 tests, 97% overall coverage

Known issues — tracked for follow-up

Code review identified 9 bugs. All critical/medium issues are resolved; one open for design clarification:

Ticket Summary Priority Status
SHEK-26 Per-pod cap Budget construction recurses infinitely when SHEKEL_BUDGET_NAME is set High ✅ Fixed
SHEK-27 K8s poller/reporter threads not restarted when a Budget is reused across multiple with-blocks High ✅ Fixed
SHEK-28 K8s config errors silently swallowed — misconfigured ConfigMap disables K8s features with no log Medium ✅ Fixed
SHEK-29 BudgetExceededError raised on external pause is indistinguishable from normal budget exhaustion Medium ✅ Fixed
SHEK-30 Redis backend configured from ConfigMap is stored but never used in _record_spend Medium ✅ Fixed
SHEK-31 K8s poller thread leaks in tests that don't enter the budget context Low ✅ Fixed
SHEK-32 K8s spend reporter misses the exceeding call's cost when a budget limit raises Medium ✅ Fixed
SHEK-33 _check_per_pod_limit() raises unconditionally, ignoring warn_only mode Medium ✅ Fixed
SHEK-34 scope_mode / scope_group_by stored from ConfigMap but never used Medium 🔲 Open

Test plan

  • pytest tests/test_kubernetes_integration.py — all 97 tests pass
  • pytest --cov=shekel --cov-report=term-missing — 97% overall coverage
  • Install without [k8s] extra and confirm no import errors
  • All linters pass: black, isort, ruff, mypy

🤖 Generated with Claude Code

New module shekel/integrations/kubernetes.py:
- is_k8s_environment(): detects KUBERNETES_SERVICE_HOST + SHEKEL_BUDGET_NAME
- _fetch_configmap(): loads shekel-budget-{name} from the pod's namespace
  via kubernetes.client.CoreV1Api; soft-imports kubernetes (no crash if absent)
- apply_k8s_config(): applies ConfigMap values to Budget fields where still
  None (priority: explicit kwarg > AGENT_BUDGET_USD env var > ConfigMap)
- KubernetesPoller: daemon thread that polls paused key every
  SHEKEL_POLL_INTERVAL_SECONDS (default 10s); sets _paused_externally

Budget._record_spend(): raises BudgetExceededError immediately when
_paused_externally is True (before spend accumulation).

Budget.__exit__ / __aexit__: stop the poller thread on context exit.

pyproject.toml: add [k8s] extra (kubernetes>=28.0); add to [all];
add kubernetes mypy override.

36 tests; 100% coverage on kubernetes.py.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@codecov

codecov Bot commented Jun 4, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

New KubernetesSpendReporter daemon thread (shekel/integrations/kubernetes.py):
- Active when ConfigMap has backend=k8s; skipped for backend=redis or absent
- Accumulates cumulative LLM spend/calls under threading.Lock (never hold lock
  across network call)
- Flush triggers: flush_every_seconds (time-based), flush_every_usd (USD
  threshold on delta since last flush), Budget.__exit__/__aexit__ (always,
  including on exception)
- Patch-or-create ConfigMap shekel-spend-{HOSTNAME}: patch first, create on
  404 ApiException; any failure logs WARNING and never raises to caller
- After successful write updates _last_flush_spent so next flush computes
  correct delta; baseline unchanged on failure so full cumulative total retried
- HOSTNAME absent → flush silently skipped
- Correct labels: shekel.dev/spend-report, shekel.dev/budget,
  shekel.dev/group (omitted when SHEKEL_GROUP_VALUE is empty)

Budget._record_spend: calls reporter.on_spend(cost) after each LLM call.
Budget.__exit__ / __aexit__: calls reporter.flush_and_stop() on context exit.

41 new tests; 100% coverage on kubernetes.py.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@arieradle

Copy link
Copy Markdown
Owner Author

/improve

@qodo-code-review

qodo-code-review Bot commented Jun 10, 2026

Copy link
Copy Markdown

Code Review by Qodo

🐞 Bugs (0) 📘 Rule violations (0) 📎 Requirement gaps (3) 📜 Skill insights (0)

Context used
✅ Compliance rules (platform): 7 rules

Grey Divider


Action required

1. per_pod_cap not child Budget 📎 Requirement gap ≡ Correctness
Description
per_pod_cap is applied by setting budget._per_pod_cap_usd and enforcing it via
_check_per_pod_limit(), instead of creating a child Budget(max_usd=float(v)) as required by the
ConfigMap-to-Budget mapping. This breaks the specified behavior for per-pod limiting and the
accompanying tests explicitly assert the non-child implementation.
Code

shekel/integrations/kubernetes.py[114]

+        budget._per_pod_cap_usd = float(cm["per_pod_cap"])
Evidence
The checklist mapping explicitly requires per_pod_cap to create a child
Budget(max_usd=float(v)). The new code sets budget._per_pod_cap_usd from the ConfigMap and
enforces it directly in Budget._check_per_pod_limit(), and the new tests assert that no per-pod
child budget exists (assert not hasattr(b, "_per_pod_budget")).

ConfigMap keys are correctly mapped to Budget parameters and behaviors
shekel/integrations/kubernetes.py[112-115]
shekel/_budget.py[814-823]
tests/test_kubernetes_integration.py[577-583]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The `per_pod_cap` ConfigMap key is required to create a child `Budget(max_usd=float(v))` for per-pod limiting, but the current implementation stores a float on the parent budget and enforces it directly.

## Issue Context
Compliance requires the mapping behavior: `per_pod_cap` → create a child `Budget(max_usd=float(v))` when per-pod limiting is used. Current tests also encode the non-child behavior, so they will need updating alongside the implementation.

## Fix Focus Areas
- shekel/integrations/kubernetes.py[112-115]
- shekel/_budget.py[703-706]
- shekel/_budget.py[814-823]
- tests/test_kubernetes_integration.py[577-593]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


2. Per-pod cap bypasses warn_only ✓ Resolved 🐞 Bug ≡ Correctness
Description
Budget._check_per_pod_limit() unconditionally raises BudgetExceededError when the per-pod cap is
exceeded, even if warn_only=True. This violates the existing warn_only contract used by other limit
checks and can cause unexpected exceptions in environments relying on warn-only behavior.
Code

shekel/_budget.py[R814-823]

+    def _check_per_pod_limit(self) -> None:
+        """Enforce the per-pod USD cap set via ConfigMap per_pod_cap (SHEK-26)."""
+        if self._per_pod_cap_usd is None:
+            return
+        if self._spent > self._per_pod_cap_usd:
+            from shekel.exceptions import BudgetExceededError  # noqa: PLC0415
+
+            raise BudgetExceededError(
+                self._spent, self._per_pod_cap_usd, self._last_model, self._last_tokens
+            )
Evidence
Other enforcement checks explicitly suppress exceptions when warn_only=True, but the new per-pod cap
check has no such guard, so it will raise in warn-only mode.

shekel/_budget.py[755-807]
shekel/_budget.py[814-823]
tests/test_budget_warn_only.py[13-34]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`Budget._check_per_pod_limit()` raises even when `Budget.warn_only=True`, unlike `_check_limit()` / `_check_call_limit()` which suppress exceptions in warn-only mode.

### Issue Context
`warn_only=True` is documented and tested to mean “enforce silently, never raise”. Per-pod cap should follow the same semantics for consistency.

### Fix Focus Areas
- shekel/_budget.py[814-823]

### Suggested fix
- Mirror `_check_limit()` behavior:
 - call `self._emit_budget_exceeded_event()` when cap exceeded
 - if `self.warn_only`: optionally fire `_check_warn()` and `return`
 - else: raise `BudgetExceededError(...)`

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


3. RedisBackend missing name arg 📎 Requirement gap ≡ Correctness
Description
When backend=="redis" and REDIS_URL is set, the code instantiates RedisBackend(url=redis_url)
but does not pass name=redis_key as required. This can break correct Redis key naming and violates
the specified ConfigMap-to-budget mapping for Redis backend activation.
Code

shekel/integrations/kubernetes.py[R119-127]

+    if cm.get("backend") == "redis":
+        redis_url = os.environ.get("REDIS_URL")
+        if redis_url:
+            try:
+                from shekel.backends.redis import RedisBackend  # noqa: PLC0415
+
+                budget._k8s_redis_backend = RedisBackend(url=redis_url)
+                budget._k8s_redis_name = cm.get("redis_key", f"shekel:{namespace}:{budget_name}")
+            except ImportError:
Evidence
The rules require that when ConfigMap backend is redis and REDIS_URL is set, RedisBackend
must be created with name=redis_key. The new code creates RedisBackend(url=redis_url) and stores
the key separately in _k8s_redis_name, which does not satisfy the required constructor usage.

Redis backend activation follows ConfigMap/backend and REDIS_URL rules
ConfigMap key-to-budget parameter mapping is implemented as specified
shekel/integrations/kubernetes.py[119-127]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
Redis backend activation must instantiate `RedisBackend(url=REDIS_URL, name=redis_key)` (using `redis_key` from ConfigMap, or the default `shekel:{namespace}:{budget_name}`), but the current code does not pass `name=`.

## Issue Context
The compliance spec explicitly requires the Redis backend to use the `redis_key` naming so controller/materialized budgets map to the intended Redis budget key.

## Fix Focus Areas
- shekel/integrations/kubernetes.py[118-131]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


4. Paused error missing message ✓ Resolved 📎 Requirement gap ≡ Correctness
Description
Budget._record_spend() raises BudgetExceededError when an external pause is active, but it does
so by directly referencing self._paused_externally (instead of guarding with `getattr(self,
"_paused_externally", False)`) and by omitting the required "Budget paused by Kubernetes controller"
reason, while also often populating limit=self.max_usd or 0.0 which can produce misleading “Budget
of $0.00 exceeded” messages. This breaks the documented paused enforcement semantics and makes
pause-triggered failures hard to distinguish from real budget exhaustion (including in track-only
mode and for nested budgets), confusing downstream handling/logging.
Code

shekel/_budget.py[R662-671]

    def _record_spend(self, cost: float, model: str, tokens: dict[str, int]) -> None:
+        if self._paused_externally:
+            from shekel.exceptions import BudgetExceededError  # noqa: PLC0415
+
+            raise BudgetExceededError(
+                spent=self._spent,
+                limit=self.max_usd or 0.0,
+                model=model,
+                tokens=tokens,
+            )
Evidence
The compliance rule requires _record_spend() to begin with a paused check using `getattr(self,
"_paused_externally", False) and to raise BudgetExceededError` with the specific message/reason
"Budget paused by Kubernetes controller", but the current implementation checks
self._paused_externally directly and raises BudgetExceededError without any pause-specific
reason. Additionally, the paused path supplies limit=self.max_usd or 0.0, and since
BudgetExceededError.__str__ formats errors as "Budget of ${limit:.2f} exceeded", paused budgets
without max_usd will be rendered as "$0.00 exceeded", masking the kill-switch condition and
potentially misreporting the effective limit for nested budgets.

Paused enforcement check is executed at the top of Budget._record_spend()
shekel/_budget.py[662-671]
shekel/exceptions.py[52-92]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
Fix `Budget._record_spend()` so that external pause enforcement uses the required `getattr(self, "_paused_externally", False)` guard at the top of the method and raises a `BudgetExceededError` that clearly indicates "Budget paused by Kubernetes controller" (and does not format as a misleading normal "Budget of $X exceeded" exhaustion error, especially when `max_usd` is unset).

## Issue Context
The Kubernetes kill-switch compliance spec requires the paused check to run first in `_record_spend()` and to raise the specified error/reason. Today the code references `self._paused_externally` directly and raises `BudgetExceededError` without the required paused-specific reason; it also commonly sets `limit=self.max_usd or 0.0`, which—given `BudgetExceededError.__str__` formats as "Budget of ${limit:.2f} exceeded"—causes paused budgets (notably when `max_usd` is unset / track-only) to appear as "$0.00 exceeded" and makes pause vs exceed indistinguishable to logs and downstream handlers.

## Fix Focus Areas
- shekel/_budget.py[662-705]
- shekel/exceptions.py[52-92]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


5. Poller not restarted ✓ Resolved 🐞 Bug ☼ Reliability
Description
Budget.__exit__/__aexit__ stop the K8s poller/reporter, but those threads are only started once
during Budget.__init__, so reusing a Budget across multiple with blocks leaves kill-switch
polling and spend reporting permanently disabled after the first exit. This contradicts the
documented “session budget” usage where the same Budget instance is entered multiple times.
Code

shekel/_budget.py[R491-494]

+        if self._k8s_reporter is not None:
+            self._k8s_reporter.flush_and_stop()
+        if self._k8s_poller is not None:
+            self._k8s_poller.stop()
Evidence
The codebase documents reusing a single Budget instance across multiple with blocks, but the new
K8s threads are started only in __init__ and are explicitly stopped on every context exit; there
is no corresponding restart path on __enter__/__aenter__.

shekel/_budget.py[128-134]
shekel/_budget.py[439-495]
shekel/_budget.py[584-639]
shekel/_budget.py[303-319]
shekel/integrations/kubernetes.py[140-157]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`__exit__`/`__aexit__` call `stop()`/`flush_and_stop()` on the K8s poller/reporter, but the only place they are created/started is `Budget.__init__` via `apply_k8s_config(self)`. Budgets are documented as reusable across multiple `with` blocks, so after the first context exit the background threads will never be restarted and K8s kill-switch/reporting silently stops working.

## Issue Context
The `Budget` docstring shows a “Session budget (accumulates across multiple with-blocks)” pattern, which re-enters the same instance multiple times.

## Fix Focus Areas
- shekel/_budget.py[128-134]
- shekel/_budget.py[439-495]
- shekel/_budget.py[584-639]
- shekel/_budget.py[303-319]
- shekel/integrations/kubernetes.py[140-157]

## Suggested fix approach
- Make poller/reporter lifecycle match budget lifecycle:
 - Either **do not stop** the poller/reporter in `__exit__/__aexit__` (let them run for the object’s lifetime),
 - OR, if stopping on exit is desired, then **restart** them on `__enter__/__aenter__` when K8s mode is active and the prior thread is stopped.
- If restarting: set `_k8s_poller/_k8s_reporter` to `None` after stopping, and ensure start logic is idempotent (don’t spawn duplicates on nested enters).
- Consider joining threads (bounded) if you require a clean shutdown before returning from exit.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


View more (2)
6. Per-pod cap recursion ✓ Resolved 🐞 Bug ≡ Correctness
Description
apply_k8s_config() creates a new Budget for per_pod_cap, but every Budget.__init__ also
calls apply_k8s_config() in a K8s environment, causing recursive Budget construction until a
RecursionError / huge nested object chain. This can make Budget() construction extremely slow or
unstable whenever the ConfigMap includes per_pod_cap.
Code

shekel/integrations/kubernetes.py[R112-117]

+    # --- per_pod_cap ---
+    if "per_pod_cap" in cm:
+        from shekel._budget import Budget as _Budget  # noqa: PLC0415
+
+        budget._per_pod_budget = _Budget(max_usd=float(cm["per_pod_cap"]))
+
Evidence
apply_k8s_config() constructs a Budget when per_pod_cap is present, and Budget.__init__
unconditionally invokes apply_k8s_config(self) (under the env gate). Since the env gate is
process-wide, the child budget re-enters the same path and repeats the construction.

shekel/integrations/kubernetes.py[31-33]
shekel/integrations/kubernetes.py[69-117]
shekel/_budget.py[303-319]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`apply_k8s_config()` constructs a child `Budget` when `per_pod_cap` is present, but `Budget.__init__` always calls `apply_k8s_config()` again under the same env gate. This creates unbounded recursive construction and can lead to `RecursionError`, excessive ConfigMap reads, and runaway object graphs.

## Issue Context
- K8s integration is activated purely by process env vars (`KUBERNETES_SERVICE_HOST` + `SHEKEL_BUDGET_NAME`), so any internal `Budget(...)` created in-process inherits the same K8s activation.
- The per-pod cap concept does not require a fully auto-discovered `Budget` instance; it can be represented as a plain float cap or a `Budget` constructed with a “skip k8s” guard.

## Fix Focus Areas
- shekel/integrations/kubernetes.py[69-117]
- shekel/_budget.py[303-319]

## Suggested fix approach
- Option A (simplest): store `per_pod_cap` as a float on the budget (e.g. `_per_pod_cap_usd: float | None`) instead of constructing a new `Budget`.
- Option B: add an internal constructor flag (e.g. `Budget(..., _skip_k8s=True)` or similar) so internal helper budgets do not invoke `apply_k8s_config`.
- Ensure the chosen approach cannot recurse when `per_pod_cap` is present in the ConfigMap.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


7. K8s paused flag not locked 📎 Requirement gap ☼ Reliability
Description
Kubernetes polling writes to budget._paused_externally from a background thread without any
synchronization, and the budget does not define or use a lock to protect concurrent reads/writes.
This violates the thread-safety requirement and can cause race conditions during spend recording.
Code

shekel/integrations/kubernetes.py[R185-189]

+        while not self._stop_event.wait(self._interval):
+            cm = _fetch_configmap(self._budget_name, self._namespace)
+            if cm is not None:
+                self._budget._paused_externally = cm.get("paused") == "true"
+
Evidence
The rule requires thread-safe access to _paused_externally (writes protected by a lock) and
correct async scheduling behavior. The new code assigns budget._paused_externally directly in both
initial ConfigMap application and in the poller thread loop, with no lock shown on the budget or
around these writes.

Background poll runs as daemon and is stopped safely on context exit; thread safety is maintained
shekel/_budget.py[303-320]
shekel/integrations/kubernetes.py[85-88]
shekel/integrations/kubernetes.py[153-158]
shekel/integrations/kubernetes.py[185-189]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The K8s poller mutates `budget._paused_externally` from a background thread without a lock, and `Budget._record_spend()` reads it without synchronization. The compliance rule requires writes to `_paused_externally` to be protected by a `threading.Lock` (or equivalent) and also specifies async budgets should use `asyncio.create_task` when an event loop is running.

## Issue Context
`KubernetesPoller` runs in a daemon thread and periodically updates the paused flag. Without a lock, concurrent access can race with `_record_spend()`.

## Fix Focus Areas
- shekel/_budget.py[303-320]
- shekel/_budget.py[662-671]
- shekel/integrations/kubernetes.py[85-88]
- shekel/integrations/kubernetes.py[153-158]
- shekel/integrations/kubernetes.py[160-189]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended

8. Reporter skips spend on raises ✓ Resolved 🐞 Bug ≡ Correctness
Description
Budget._record_spend() calls KubernetesSpendReporter.on_spend(cost) only after running checks that
may raise (USD limit, call limit, per-pod cap). When an exception is raised, the reporter never
records the cost of that already-incurred LLM call, under-reporting spend.
Code

shekel/_budget.py[R700-705]

        self._check_warn()
        self._check_limit()
        self._check_call_limit()
+        self._check_per_pod_limit()
+        if self._k8s_reporter is not None:
+            self._k8s_reporter.on_spend(cost)
Evidence
_record_spend runs several limit checks before calling the reporter; any of those checks can raise
after spend was accumulated. The reporter’s totals only advance via on_spend(), so skipping it loses
that call’s cost.

shekel/_budget.py[681-706]
shekel/_budget.py[774-812]
shekel/integrations/kubernetes.py[221-228]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
Spend reporting (`_k8s_reporter.on_spend(cost)`) happens after enforcement checks that can raise, so the final (exceeding) call’s cost is not included in reported totals.

### Issue Context
`_record_spend` increments `self._spent` first, so the cost is real and should be reported even if a limit is exceeded on that call.

### Fix Focus Areas
- shekel/_budget.py[681-706]

### Suggested fix
- Ensure `on_spend(cost)` is executed after `self._spent += cost` but before any checks that may raise, **or** wrap the check section in a `try: ... finally:` that calls `on_spend(cost)` when `_k8s_reporter` is set.
- Keep semantics consistent with existing spend/accounting: reporter should reflect the same `_spent` that enforcement uses.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


9. K8s errors swallowed ✓ Resolved 🐞 Bug ◔ Observability
Description
Budget.__init__ wraps apply_k8s_config(self) in except Exception: pass, so parsing bugs,
recursion issues, or other unexpected failures can silently disable K8s
discovery/kill-switch/reporting with no log or signal. This makes production misconfiguration and
regressions extremely difficult to detect and debug.
Code

shekel/_budget.py[R314-319]

+        try:
+            from shekel.integrations.kubernetes import apply_k8s_config  # noqa: PLC0415
+
+            apply_k8s_config(self)
+        except Exception:
+            pass
Evidence
The new try/except Exception: pass around apply_k8s_config will suppress any exception escaping
config application, even though config parsing includes float()/int() conversions that can raise
and would otherwise indicate bad ConfigMap data or a logic bug.

shekel/_budget.py[303-319]
shekel/integrations/kubernetes.py[89-110]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`Budget.__init__` swallows all exceptions from `apply_k8s_config`, which can hide real failures (e.g., value parsing errors, recursion errors, reporter startup issues) and lead to silent loss of K8s features.

## Issue Context
`_fetch_configmap()` already handles and logs many “can’t talk to Kubernetes” failures; the outer blanket `except Exception` mainly hides bugs and config parsing issues inside `apply_k8s_config()`.

## Fix Focus Areas
- shekel/_budget.py[303-319]
- shekel/integrations/kubernetes.py[69-158]

## Suggested fix approach
- Narrow the exception handling:
 - Catch `ImportError` (or a small known set) to preserve the “optional dependency” behavior.
 - For other exceptions, at minimum `logger.warning(..., exc_info=True)` so failures are visible.
- Avoid swallowing `RecursionError`/`KeyboardInterrupt`/`SystemExit`.
- Consider returning early with a warning when ConfigMap values are invalid (e.g., float/int conversion fails) so users can fix their ConfigMap.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Informational

10. Per-pod tests leave pollers ✓ Resolved 🐞 Bug ☼ Reliability
Description
The new per-pod cap tests construct K8s-enabled Budgets without stopping the background poller
thread afterwards. This can leave extra daemon threads running across the test suite and introduce
nondeterministic background activity/noise.
Code

tests/test_kubernetes_integration.py[R573-582]

+    def test_per_pod_cap_stored_as_float(self) -> None:
+        b = _budget_with_k8s({"per_pod_cap": "0.25"})
+        assert b._per_pod_cap_usd == pytest.approx(0.25)
+
+    def test_per_pod_cap_does_not_recurse(self) -> None:
+        # Regression for SHEK-26: constructing a Budget with per_pod_cap in the
+        # ConfigMap must not trigger infinite recursion via nested Budget.__init__ calls.
+        b = _budget_with_k8s({"per_pod_cap": "0.10"})
+        assert b._per_pod_cap_usd == pytest.approx(0.10)
+        assert not hasattr(b, "_per_pod_budget")
Evidence
The helper returns a Budget without cleanup, and the new per-pod cap tests call it without using the
context manager, so no stop path is triggered for the poller.

tests/test_kubernetes_integration.py[37-74]
tests/test_kubernetes_integration.py[572-599]
shekel/integrations/kubernetes.py[151-156]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
The per-pod cap tests create budgets via `_budget_with_k8s()` and never stop the K8s poller thread for tests that don’t enter/exit the budget context.

### Issue Context
`_budget_with_k8s()` returns a constructed `Budget` and does not perform any cleanup; `Budget.__exit__` is the path that stops the poller.

### Fix Focus Areas
- tests/test_kubernetes_integration.py[573-599]

### Suggested fix
- In tests that don’t use `with b:`, explicitly stop (and ideally `join`) the poller after assertions:
 - `if b._k8s_poller: b._k8s_poller.stop(); b._k8s_poller.join(timeout=1)`
- Alternatively, use `with b:` in these tests to ensure `__exit__` runs and stops the poller.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

Qodo Logo

Comment thread shekel/_budget.py
Comment on lines +185 to +189
while not self._stop_event.wait(self._interval):
cm = _fetch_configmap(self._budget_name, self._namespace)
if cm is not None:
self._budget._paused_externally = cm.get("paused") == "true"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

2. K8s paused flag not locked 📎 Requirement gap ☼ Reliability

Kubernetes polling writes to budget._paused_externally from a background thread without any
synchronization, and the budget does not define or use a lock to protect concurrent reads/writes.
This violates the thread-safety requirement and can cause race conditions during spend recording.
Agent Prompt
## Issue description
The K8s poller mutates `budget._paused_externally` from a background thread without a lock, and `Budget._record_spend()` reads it without synchronization. The compliance rule requires writes to `_paused_externally` to be protected by a `threading.Lock` (or equivalent) and also specifies async budgets should use `asyncio.create_task` when an event loop is running.

## Issue Context
`KubernetesPoller` runs in a daemon thread and periodically updates the paused flag. Without a lock, concurrent access can race with `_record_spend()`.

## Fix Focus Areas
- shekel/_budget.py[303-320]
- shekel/_budget.py[662-671]
- shekel/integrations/kubernetes.py[85-88]
- shekel/integrations/kubernetes.py[153-158]
- shekel/integrations/kubernetes.py[160-189]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

Comment on lines +119 to +127
if cm.get("backend") == "redis":
redis_url = os.environ.get("REDIS_URL")
if redis_url:
try:
from shekel.backends.redis import RedisBackend # noqa: PLC0415

budget._k8s_redis_backend = RedisBackend(url=redis_url)
budget._k8s_redis_name = cm.get("redis_key", f"shekel:{namespace}:{budget_name}")
except ImportError:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

3. Redisbackend missing name arg 📎 Requirement gap ≡ Correctness

When backend=="redis" and REDIS_URL is set, the code instantiates RedisBackend(url=redis_url)
but does not pass name=redis_key as required. This can break correct Redis key naming and violates
the specified ConfigMap-to-budget mapping for Redis backend activation.
Agent Prompt
## Issue description
Redis backend activation must instantiate `RedisBackend(url=REDIS_URL, name=redis_key)` (using `redis_key` from ConfigMap, or the default `shekel:{namespace}:{budget_name}`), but the current code does not pass `name=`.

## Issue Context
The compliance spec explicitly requires the Redis backend to use the `redis_key` naming so controller/materialized budgets map to the intended Redis budget key.

## Fix Focus Areas
- shekel/integrations/kubernetes.py[118-131]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

Comment thread shekel/integrations/kubernetes.py
Comment thread shekel/_budget.py
…ate infinite recursion

Constructing a child Budget for per_pod_cap caused unbounded recursion
because KUBERNETES_SERVICE_HOST/SHEKEL_BUDGET_NAME are still set in the
process, triggering apply_k8s_config() on every child __init__.

Store the cap as _per_pod_cap_usd: float | None instead, and enforce it
via a new _check_per_pod_limit() method called inside _record_spend().
Adds 4 regression tests including an explicit no-recursion guard.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@arieradle

Copy link
Copy Markdown
Owner Author

/review

@qodo-code-review

qodo-code-review Bot commented Jun 10, 2026

Copy link
Copy Markdown

Code review by qodo was updated up to the latest commit f2cb46e

After __exit__ stops the K8s threads, re-entering the same Budget instance
(session budget pattern) left K8s polling and spend reporting permanently
dead. Python threads cannot be restarted, so a new instance must be created.

Persist budget_name, namespace, and poll_interval on the budget during
apply_k8s_config(), then call _restart_k8s_threads() from __enter__ and
__aenter__ to rebuild any stopped threads idempotently. Adds 6 regression
tests including sync and async re-entry, idempotency, and no-K8s no-op.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

# --- per_pod_cap ---
if "per_pod_cap" in cm:
budget._per_pod_cap_usd = float(cm["per_pod_cap"])

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

1. per_pod_cap not child budget 📎 Requirement gap ≡ Correctness

per_pod_cap is applied by setting budget._per_pod_cap_usd and enforcing it via
_check_per_pod_limit(), instead of creating a child Budget(max_usd=float(v)) as required by the
ConfigMap-to-Budget mapping. This breaks the specified behavior for per-pod limiting and the
accompanying tests explicitly assert the non-child implementation.
Agent Prompt
## Issue description
The `per_pod_cap` ConfigMap key is required to create a child `Budget(max_usd=float(v))` for per-pod limiting, but the current implementation stores a float on the parent budget and enforces it directly.

## Issue Context
Compliance requires the mapping behavior: `per_pod_cap` → create a child `Budget(max_usd=float(v))` when per-pod limiting is used. Current tests also encode the non-child behavior, so they will need updating alongside the implementation.

## Fix Focus Areas
- shekel/integrations/kubernetes.py[112-115]
- shekel/_budget.py[703-706]
- shekel/_budget.py[814-823]
- tests/test_kubernetes_integration.py[577-593]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

Comment thread shekel/_budget.py
arieradle and others added 7 commits June 10, 2026 08:41
Blanket except Exception: pass hid bad ConfigMap values, recursion bugs,
and any other apply_k8s_config failure with no log or signal. Split into
ImportError (silent — optional dep not installed) and Exception (warning
with exc_info so operators can diagnose misconfigured ConfigMaps).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tures limit-exceeding call cost

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…xture; fix black line-length

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…r kill-switch from budget exhaustion

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ibuted enforcement

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…cope_mode=shared (group-scoped Redis key) into runtime

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@arieradle

Copy link
Copy Markdown
Owner Author

@qodo-code-review[bot] - did the last commit resolve the gaps?>

- Fix _check_redis_limit raise path (lines 859-861): test used max_usd=0.05
  which caused _check_limit to fire first, never reaching redis path
- Add chain() tests to cover happy path and invalid-arg guard (lines 1362-1365)
- Add _litellm_available() skipif to TestLiteLLMPatching and
  TestLiteLLMCostRecording so they skip when litellm optional dep is absent

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@arieradle arieradle merged commit 4163b71 into main Jun 17, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant