Skip to content

fix: reduce noisy PodClique NotFound logs during cascade-delete#641

Open
AsadShahid04 wants to merge 1 commit into
ai-dynamo:mainfrom
AsadShahid04:fix/podclique-notfound-log-level-622
Open

fix: reduce noisy PodClique NotFound logs during cascade-delete#641
AsadShahid04 wants to merge 1 commit into
ai-dynamo:mainfrom
AsadShahid04:fix/podclique-notfound-log-level-622

Conversation

@AsadShahid04
Copy link
Copy Markdown

Summary

  • Lower the `GetPodClique` not-found log from `Info` to `V(1)` (debug) so routine cascade-delete reconciles no longer spam the log at default verbosity
  • Add a contextual `Info`-level log in the PCS gang-termination path (`getMinAvailableBreachedPCLQsNotInPCSG`) when a PodClique expected by a live PodCliqueSet replica is missing, preserving visibility for unexpected absences
  • Add unit tests: rename the `not_found` test case to clarify it exercises the ignore-not-found path, and add a new `TestGetPodCliqueIgnoreNotFoundFalse` covering error propagation when `ignoreNotFound=false`

Problem

After the cascade-delete change in #556, deleting a PodCliqueSet removes the PCS and its owned PodCliques. The PodClique controller then receives reconcile events for already-deleted PodCliques and logs a info-level "PodClique not found" message for each one. At scale (e.g. 5000-replica workloads) this produces thousands of expected but noisy log entries during normal deletion.

Solution

operator/internal/controller/utils/reconciler.go — Change the GetPodClique not-found branch from logger.Info to logger.V(1).Info. Normal operators run at verbosity 0, so this message disappears from default logs entirely while remaining accessible with -v=1.

operator/internal/controller/podcliqueset/components/podcliquesetreplica/gangterminate.go — In getMinAvailableBreachedPCLQsNotInPCSG, the function already collects the names of expected PodCliques that are missing. Previously it silently skipped the replica's MinAvailable evaluation. Now it first emits an Info log naming the missing PodCliques and the affected replica index. This is the correct level here: the PCS is alive and not deleting, so a missing PodClique is a transient but observable state (e.g. creation still in flight), not a guaranteed cascade-delete.

Testing

  • go build ./... — passed
  • go test ./internal/... — all unit tests pass (no cluster required)
  • New test TestGetPodCliqueIgnoreNotFoundFalse verifies error propagation; updated not_found_ignore case documents debug-level behaviour

Closes #622

During normal PCS cascade-delete, the PodClique controller receives
reconcile requests for already-deleted PodCliques and emitted one
info-level "PodClique not found" log per deleted object, producing
significant log noise at scale.

Lower the GetPodClique not-found log from Info to V(1) (debug) so
expected cascade-delete reconciles are silent at the default log level.

Add a contextual Info log in getMinAvailableBreachedPCLQsNotInPCSG so
that when a PodClique expected by a live PodCliqueSet replica is missing
(unexpected while the parent is not deleting) the situation remains
visible: the log records which PodCliques are absent and which replica
index was skipped for MinAvailable evaluation.

Extend the reconciler_test.go to cover GetPodClique with
ignoreNotFound=false, confirming the error is propagated correctly.

Closes ai-dynamo#622

Signed-off-by: OpenClaw Agent <agent@openclaw.local>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Jun 1, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Reduce noisy PodClique NotFound logs during normal cascade-delete flow

1 participant