WIP Add retry diagnostic logging and enable pprof on all metrics servers#3166
WIP Add retry diagnostic logging and enable pprof on all metrics servers#3166afcollins wants to merge 2 commits intoopenshift:masterfrom
Conversation
Signed-off-by: Andrew Collins <ancollin@redhat.com>
Walkthrough
Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Review rate limit: 0/1 reviews remaining, refill in 60 minutes.Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: afcollins The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
go-controller/pkg/retry/obj_retry.go (2)
364-364: Retry attempt counter is off/ambiguous in logs.Line 364 uses
entry.failedAttemptsas the attempt number, which can emitattempt 0on first retry-loop execution. Consider loggingentry.failedAttempts+1(or rename to “failed attempts”) for clearer diagnostics.Suggested fix
- klog.Infof("%s: retry object setup: %s %s (attempt %d, backoff %s)", r.name, r.ResourceHandler.ObjType, objKey, entry.failedAttempts, entry.backoff) + attempt := int(entry.failedAttempts) + 1 + klog.Infof("%s: retry object setup: %s %s (attempt %d, backoff %s)", r.name, r.ResourceHandler.ObjType, objKey, attempt, entry.backoff)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@go-controller/pkg/retry/obj_retry.go` at line 364, The log prints a confusing attempt count using entry.failedAttempts which is zero on the first retry; update the logging in the retry setup (the klog.Infof call that references r.name, r.ResourceHandler.ObjType, objKey, entry.failedAttempts, entry.backoff) to show a 1-based attempt number by logging entry.failedAttempts + 1 (or alternatively change the label to "failed attempts" if you prefer keeping it zero-based) so the message reads an intuitive attempt count.
472-472: Consider lowering sweep logs from info to verbose level.These two logs run every sweep while entries exist and can create significant log volume in busy clusters. Moving them to a verbose level keeps diagnostics available without default noise.
Suggested fix
- klog.Infof("%s: retry cache has %d %s entries pending retry", r.name, len(entriesKeys), r.ResourceHandler.ObjType) + klog.V(4).Infof("%s: retry cache has %d %s entries pending retry", r.name, len(entriesKeys), r.ResourceHandler.ObjType) ... - klog.Infof("%s: retry sweep for %s completed in %v, %d entries remaining in cache", r.name, r.ResourceHandler.ObjType, elapsed, len(remainingKeys)) + klog.V(4).Infof("%s: retry sweep for %s completed in %v, %d entries remaining in cache", r.name, r.ResourceHandler.ObjType, elapsed, len(remainingKeys))Also applies to: 485-487
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@go-controller/pkg/retry/obj_retry.go` at line 472, The sweep logging in obj_retry.go is too noisy at Info level; change the klog.Infof calls that report pending retry entries (the line using r.name, len(entriesKeys), and r.ResourceHandler.ObjType and the similar block at lines ~485-487) to a verbose level (e.g., klog.V(2).Infof) so the same formatted message remains available but only appears when verbose logging is enabled.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@go-controller/pkg/retry/obj_retry.go`:
- Around line 266-270: The first-failure log uses the hardcoded initialBackoff
constant which can be incorrect for entries initialized with different backoff
(e.g., noBackoff); update the klog.Infof in the block checking
entry.failedAttempts == 1 to log entry.backoff instead of initialBackoff so the
message reflects the actual backoff state (keep r.name and
r.ResourceHandler.ObjType and entry.failedAttempts in the format string as
before).
---
Nitpick comments:
In `@go-controller/pkg/retry/obj_retry.go`:
- Line 364: The log prints a confusing attempt count using entry.failedAttempts
which is zero on the first retry; update the logging in the retry setup (the
klog.Infof call that references r.name, r.ResourceHandler.ObjType, objKey,
entry.failedAttempts, entry.backoff) to show a 1-based attempt number by logging
entry.failedAttempts + 1 (or alternatively change the label to "failed attempts"
if you prefer keeping it zero-based) so the message reads an intuitive attempt
count.
- Line 472: The sweep logging in obj_retry.go is too noisy at Info level; change
the klog.Infof calls that report pending retry entries (the line using r.name,
len(entriesKeys), and r.ResourceHandler.ObjType and the similar block at lines
~485-487) to a verbose level (e.g., klog.V(2).Infof) so the same formatted
message remains available but only appears when verbose logging is enabled.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Enterprise
Run ID: acb791d5-f2f0-4e58-ade0-ce7bcf877c4b
📒 Files selected for processing (2)
go-controller/cmd/ovnkube/ovnkube.gogo-controller/pkg/retry/obj_retry.go
| if entry.failedAttempts == 1 { | ||
| klog.Infof("%s: %s first retry failure queued, backoff will start at %s", r.name, r.ResourceHandler.ObjType, initialBackoff) | ||
| } else if entry.failedAttempts%5 == 0 { | ||
| klog.Warningf("%s: %s has now failed %d times, current backoff %s", r.name, r.ResourceHandler.ObjType, entry.failedAttempts, entry.backoff) | ||
| } |
There was a problem hiding this comment.
First-failure backoff message can report the wrong value.
Line 267 logs a hardcoded initialBackoff, but the effective backoff state may differ (e.g., entries inserted with noBackoff). Logging entry.backoff is more accurate for diagnostics.
Suggested fix
- if entry.failedAttempts == 1 {
- klog.Infof("%s: %s first retry failure queued, backoff will start at %s", r.name, r.ResourceHandler.ObjType, initialBackoff)
+ if entry.failedAttempts == 1 {
+ klog.Infof("%s: %s first retry failure queued, current backoff %s", r.name, r.ResourceHandler.ObjType, entry.backoff)
} else if entry.failedAttempts%5 == 0 {
klog.Warningf("%s: %s has now failed %d times, current backoff %s", r.name, r.ResourceHandler.ObjType, entry.failedAttempts, entry.backoff)
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@go-controller/pkg/retry/obj_retry.go` around lines 266 - 270, The
first-failure log uses the hardcoded initialBackoff constant which can be
incorrect for entries initialized with different backoff (e.g., noBackoff);
update the klog.Infof in the block checking entry.failedAttempts == 1 to log
entry.backoff instead of initialBackoff so the message reflects the actual
backoff state (keep r.name and r.ResourceHandler.ObjType and
entry.failedAttempts in the format string as before).
940c64f to
94e9e4d
Compare
|
/test pull-ci-openshift-ovn-kubernetes-master-qe-perfscale-payload-control-plane-6nodes |
|
/test qe-perfscale-payload-control-plane-6nodes |
Signed-off-by: Andrew Collins <ancollin@redhat.com>
94e9e4d to
7e4ae70
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
go-controller/pkg/retry/obj_retry.go (1)
261-270:⚠️ Potential issue | 🟠 Major | ⚡ Quick winPrevent warning-log spam after failed-attempt counter saturation.
Once
failedAttemptsreaches 255, the%5==0condition remains true forever, so infinite-retry entries can warn on every failure cycle.Suggested fix
func (r *RetryFramework) increaseFailedAttemptsCounter(entry *retryObjEntry) { + prev := entry.failedAttempts // avoid overflowing the counter for infinite retries if entry.failedAttempts < 255 { entry.failedAttempts++ } if entry.failedAttempts == 1 { klog.Infof("%s: %s first retry failure queued, backoff will start at %s", r.name, r.ResourceHandler.ObjType, initialBackoff) - } else if entry.failedAttempts%5 == 0 { + } else if entry.failedAttempts%5 == 0 && entry.failedAttempts != prev { klog.Warningf("%s: %s has now failed %d times, current backoff %s", r.name, r.ResourceHandler.ObjType, entry.failedAttempts, entry.backoff) } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@go-controller/pkg/retry/obj_retry.go` around lines 261 - 270, The periodic warning log in increaseFailedAttemptsCounter() will spam once failedAttempts saturates at 255 because 255%5==0; update the logging condition so the "%5==0" branch only runs for values below the saturation threshold: keep the existing increment and first-failure log, but change the else-if to check both entry.failedAttempts%5==0 AND entry.failedAttempts < 255 (or use a symbolic constant for maxAttempts) so warnings stop once the counter is saturated; reference function increaseFailedAttemptsCounter and field failedAttempts and backoff in retryObjEntry.
🧹 Nitpick comments (1)
go-controller/pkg/retry/obj_retry.go (1)
472-472: ⚡ Quick winConsider moving sweep-level diagnostics behind verbosity gating.
These periodic sweep logs at info level can become noisy under sustained retries;
V(4/5)would preserve diagnostics with less default log pressure.Suggested refactor
- klog.Infof("%s: retry cache has %d %s entries pending retry", r.name, len(entriesKeys), r.ResourceHandler.ObjType) + klog.V(4).Infof("%s: retry cache has %d %s entries pending retry", r.name, len(entriesKeys), r.ResourceHandler.ObjType) @@ - klog.Infof("%s: retry sweep for %s completed in %v, %d entries remaining in cache", r.name, r.ResourceHandler.ObjType, elapsed, len(remainingKeys)) + klog.V(4).Infof("%s: retry sweep for %s completed in %v, %d entries remaining in cache", r.name, r.ResourceHandler.ObjType, elapsed, len(remainingKeys))Also applies to: 485-487
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@go-controller/pkg/retry/obj_retry.go` at line 472, The periodic sweep diagnostic log using klog.Infof("%s: retry cache has %d %s entries pending retry", r.name, len(entriesKeys), r.ResourceHandler.ObjType) (and the similar logs around the 485-487 region) should be gated behind a verbosity check; change these Infof calls to use klog.V(4).Infof (or V(5) per desired noise level) so the messages are only emitted when verbose logging is enabled, ensuring the same message and interpolation of r.name, len(entriesKeys) and r.ResourceHandler.ObjType are preserved but conditional on klog.V(...).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@go-controller/pkg/retry/obj_retry.go`:
- Line 364: The diagnostic log uses entry.backoff and entry.failedAttempts after
they were advanced, so it reports the next values instead of the current ones;
capture the current values before they are mutated (e.g., store currentBackoff
:= entry.backoff and currentAttempts := entry.failedAttempts) and use those
captured variables in the klog.Infof call that references r.name,
r.ResourceHandler.ObjType, objKey, entry.failedAttempts, and entry.backoff so
the log accurately reflects the state at the time of this retry step.
---
Outside diff comments:
In `@go-controller/pkg/retry/obj_retry.go`:
- Around line 261-270: The periodic warning log in
increaseFailedAttemptsCounter() will spam once failedAttempts saturates at 255
because 255%5==0; update the logging condition so the "%5==0" branch only runs
for values below the saturation threshold: keep the existing increment and
first-failure log, but change the else-if to check both
entry.failedAttempts%5==0 AND entry.failedAttempts < 255 (or use a symbolic
constant for maxAttempts) so warnings stop once the counter is saturated;
reference function increaseFailedAttemptsCounter and field failedAttempts and
backoff in retryObjEntry.
---
Nitpick comments:
In `@go-controller/pkg/retry/obj_retry.go`:
- Line 472: The periodic sweep diagnostic log using klog.Infof("%s: retry cache
has %d %s entries pending retry", r.name, len(entriesKeys),
r.ResourceHandler.ObjType) (and the similar logs around the 485-487 region)
should be gated behind a verbosity check; change these Infof calls to use
klog.V(4).Infof (or V(5) per desired noise level) so the messages are only
emitted when verbose logging is enabled, ensuring the same message and
interpolation of r.name, len(entriesKeys) and r.ResourceHandler.ObjType are
preserved but conditional on klog.V(...).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Enterprise
Run ID: 7e36603a-4b30-49a9-b2d8-28d9a38a1f8b
📒 Files selected for processing (1)
go-controller/pkg/retry/obj_retry.go
| } | ||
|
|
||
| klog.Infof("%s: retry object setup: %s %s", r.name, r.ResourceHandler.ObjType, objKey) | ||
| klog.Infof("%s: retry object setup: %s %s (attempt %d, backoff %s)", r.name, r.ResourceHandler.ObjType, objKey, entry.failedAttempts, entry.backoff) |
There was a problem hiding this comment.
Retry diagnostic log reports ambiguous attempt/backoff values.
At Line 364, entry.backoff is already updated to the next value, so the log can misrepresent the current retry timing.
Suggested fix
+ currentBackoff := entry.backoff
+ attempt := int(entry.failedAttempts) + 1
// update backoff for future attempts in case of failure
- entry.backoff = entry.backoff * 2
+ entry.backoff = currentBackoff * 2
if entry.backoff > maxBackoff {
entry.backoff = maxBackoff
}
@@
- klog.Infof("%s: retry object setup: %s %s (attempt %d, backoff %s)", r.name, r.ResourceHandler.ObjType, objKey, entry.failedAttempts, entry.backoff)
+ klog.Infof("%s: retry object setup: %s %s (attempt %d, current backoff %s, next backoff %s)",
+ r.name, r.ResourceHandler.ObjType, objKey, attempt, currentBackoff, entry.backoff)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@go-controller/pkg/retry/obj_retry.go` at line 364, The diagnostic log uses
entry.backoff and entry.failedAttempts after they were advanced, so it reports
the next values instead of the current ones; capture the current values before
they are mutated (e.g., store currentBackoff := entry.backoff and
currentAttempts := entry.failedAttempts) and use those captured variables in the
klog.Infof call that references r.name, r.ResourceHandler.ObjType, objKey,
entry.failedAttempts, and entry.backoff so the log accurately reflects the state
at the time of this retry step.
|
@afcollins: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
📑 Description
Seeing a lot of retries sometimes. Need to run some tests from a PR to see if I can get more information.
Additional Information for reviewers
✅ Checks
How to verify it
Summary by CodeRabbit
Refactor
Chores