WIP Add retry diagnostic logging and enable pprof on all metrics servers by afcollins · Pull Request #3166 · openshift/ovn-kubernetes

afcollins · 2026-04-29T05:23:55Z

📑 Description

Seeing a lot of retries sometimes. Need to run some tests from a PR to see if I can get more information.

Additional Information for reviewers

✅ Checks

My code requires changes to the documentation
if so, I have updated the documentation as required
My code requires tests
if so, I have added and/or updated the tests as required
All the tests have passed in the CI

How to verify it

Summary by CodeRabbit

Refactor
- Simplified OVN metrics server profiling and endpoint sharing behavior to make metrics profiling handling more predictable.
Chores
- Improved retry-queue observability: clearer logs when failures are first enqueued, periodic warnings on repeated failures with backoff info, summaries of sweep progress and completion, and reduced noise for deletes of terminal-state resources.

Signed-off-by: Andrew Collins <ancollin@redhat.com>

coderabbitai · 2026-04-29T05:24:09Z

Walkthrough

ovnkube moves pprof enabling to metrics.MetricServerOptions (set from config.Metrics.EnablePprof); combineMetricsEndpoints no longer mutates opts.EnablePprof. The retry subsystem adds first-failure and periodic warning logs (every 5th attempt with backoff), sweep start/completion logs, and lowers delete-on-terminal-state log verbosity to V(5).

Changes

Cohort / File(s)	Summary
Metrics Server Configuration `go-controller/cmd/ovnkube/ovnkube.go`	Assigns `EnablePprof` directly from `config.Metrics.EnablePprof` into `metrics.MetricServerOptions`. `combineMetricsEndpoints(runMode)` no longer mutates `opts.EnablePprof`; it only switches the Prometheus registry to `prometheus.DefaultRegisterer` when sharing endpoints.
Retry-Queue Observability `go-controller/pkg/retry/obj_retry.go`	Adds structured logging: logs initial failure enqueue at first increment, emits warning logs every 5th failed attempt including current backoff, appends attempt number and backoff to per-object setup logs, logs sweep start pending counts and sweep completion elapsed time with remaining entries, and downgrades delete-on-terminal-state messages to `V(5)`.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐇 I hopped in code where metrics hide,

set pprof gently, no more tide.
Retries counted, warnings hum,
sweeps report when errands done.
Small patch, big-heart thump — yum!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately reflects the main changes: adding retry diagnostic logging and enabling pprof on metrics servers, both evident in the file modifications.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Review rate limit: 0/1 reviews remaining, refill in 60 minutes.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

openshift-ci · 2026-04-29T05:24:34Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: afcollins
Once this PR has been reviewed and has the lgtm label, please assign jcaamano for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

go-controller/pkg/retry/obj_retry.go (2)

364-364: Retry attempt counter is off/ambiguous in logs.

Line 364 uses entry.failedAttempts as the attempt number, which can emit attempt 0 on first retry-loop execution. Consider logging entry.failedAttempts+1 (or rename to “failed attempts”) for clearer diagnostics.

Suggested fix

-		klog.Infof("%s: retry object setup: %s %s (attempt %d, backoff %s)", r.name, r.ResourceHandler.ObjType, objKey, entry.failedAttempts, entry.backoff)
+		attempt := int(entry.failedAttempts) + 1
+		klog.Infof("%s: retry object setup: %s %s (attempt %d, backoff %s)", r.name, r.ResourceHandler.ObjType, objKey, attempt, entry.backoff)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@go-controller/pkg/retry/obj_retry.go` at line 364, The log prints a confusing
attempt count using entry.failedAttempts which is zero on the first retry;
update the logging in the retry setup (the klog.Infof call that references
r.name, r.ResourceHandler.ObjType, objKey, entry.failedAttempts, entry.backoff)
to show a 1-based attempt number by logging entry.failedAttempts + 1 (or
alternatively change the label to "failed attempts" if you prefer keeping it
zero-based) so the message reads an intuitive attempt count.

472-472: Consider lowering sweep logs from info to verbose level.

These two logs run every sweep while entries exist and can create significant log volume in busy clusters. Moving them to a verbose level keeps diagnostics available without default noise.

Suggested fix

-	klog.Infof("%s: retry cache has %d %s entries pending retry", r.name, len(entriesKeys), r.ResourceHandler.ObjType)
+	klog.V(4).Infof("%s: retry cache has %d %s entries pending retry", r.name, len(entriesKeys), r.ResourceHandler.ObjType)
...
-	klog.Infof("%s: retry sweep for %s completed in %v, %d entries remaining in cache", r.name, r.ResourceHandler.ObjType, elapsed, len(remainingKeys))
+	klog.V(4).Infof("%s: retry sweep for %s completed in %v, %d entries remaining in cache", r.name, r.ResourceHandler.ObjType, elapsed, len(remainingKeys))

Also applies to: 485-487

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@go-controller/pkg/retry/obj_retry.go` at line 472, The sweep logging in
obj_retry.go is too noisy at Info level; change the klog.Infof calls that report
pending retry entries (the line using r.name, len(entriesKeys), and
r.ResourceHandler.ObjType and the similar block at lines ~485-487) to a verbose
level (e.g., klog.V(2).Infof) so the same formatted message remains available
but only appears when verbose logging is enabled.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@go-controller/pkg/retry/obj_retry.go`:
- Around line 266-270: The first-failure log uses the hardcoded initialBackoff
constant which can be incorrect for entries initialized with different backoff
(e.g., noBackoff); update the klog.Infof in the block checking
entry.failedAttempts == 1 to log entry.backoff instead of initialBackoff so the
message reflects the actual backoff state (keep r.name and
r.ResourceHandler.ObjType and entry.failedAttempts in the format string as
before).

---

Nitpick comments:
In `@go-controller/pkg/retry/obj_retry.go`:
- Line 364: The log prints a confusing attempt count using entry.failedAttempts
which is zero on the first retry; update the logging in the retry setup (the
klog.Infof call that references r.name, r.ResourceHandler.ObjType, objKey,
entry.failedAttempts, entry.backoff) to show a 1-based attempt number by logging
entry.failedAttempts + 1 (or alternatively change the label to "failed attempts"
if you prefer keeping it zero-based) so the message reads an intuitive attempt
count.
- Line 472: The sweep logging in obj_retry.go is too noisy at Info level; change
the klog.Infof calls that report pending retry entries (the line using r.name,
len(entriesKeys), and r.ResourceHandler.ObjType and the similar block at lines
~485-487) to a verbose level (e.g., klog.V(2).Infof) so the same formatted
message remains available but only appears when verbose logging is enabled.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Enterprise

Run ID: acb791d5-f2f0-4e58-ade0-ce7bcf877c4b

📥 Commits

Reviewing files that changed from the base of the PR and between ab5bb67 and 940c64f.

📒 Files selected for processing (2)

go-controller/cmd/ovnkube/ovnkube.go
go-controller/pkg/retry/obj_retry.go

coderabbitai · 2026-04-29T05:26:55Z

+	if entry.failedAttempts == 1 {
+		klog.Infof("%s: %s first retry failure queued, backoff will start at %s", r.name, r.ResourceHandler.ObjType, initialBackoff)
+	} else if entry.failedAttempts%5 == 0 {
+		klog.Warningf("%s: %s has now failed %d times, current backoff %s", r.name, r.ResourceHandler.ObjType, entry.failedAttempts, entry.backoff)
+	}


⚠️ Potential issue | 🟡 Minor

First-failure backoff message can report the wrong value.

Line 267 logs a hardcoded initialBackoff, but the effective backoff state may differ (e.g., entries inserted with noBackoff). Logging entry.backoff is more accurate for diagnostics.

Suggested fix

- if entry.failedAttempts == 1 { - klog.Infof("%s: %s first retry failure queued, backoff will start at %s", r.name, r.ResourceHandler.ObjType, initialBackoff) + if entry.failedAttempts == 1 { + klog.Infof("%s: %s first retry failure queued, current backoff %s", r.name, r.ResourceHandler.ObjType, entry.backoff) } else if entry.failedAttempts%5 == 0 { klog.Warningf("%s: %s has now failed %d times, current backoff %s", r.name, r.ResourceHandler.ObjType, entry.failedAttempts, entry.backoff) }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@go-controller/pkg/retry/obj_retry.go` around lines 266 - 270, The first-failure log uses the hardcoded initialBackoff constant which can be incorrect for entries initialized with different backoff (e.g., noBackoff); update the klog.Infof in the block checking entry.failedAttempts == 1 to log entry.backoff instead of initialBackoff so the message reflects the actual backoff state (keep r.name and r.ResourceHandler.ObjType and entry.failedAttempts in the format string as before).

afcollins · 2026-04-29T22:39:35Z

/test pull-ci-openshift-ovn-kubernetes-master-qe-perfscale-payload-control-plane-6nodes

afcollins · 2026-04-29T22:40:04Z

/test qe-perfscale-payload-control-plane-6nodes

Signed-off-by: Andrew Collins <ancollin@redhat.com>

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

go-controller/pkg/retry/obj_retry.go (1)

261-270: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Prevent warning-log spam after failed-attempt counter saturation.

Once failedAttempts reaches 255, the %5==0 condition remains true forever, so infinite-retry entries can warn on every failure cycle.

Suggested fix

func (r *RetryFramework) increaseFailedAttemptsCounter(entry *retryObjEntry) {
+	prev := entry.failedAttempts
 	// avoid overflowing the counter for infinite retries
 	if entry.failedAttempts < 255 {
 		entry.failedAttempts++
 	}
 	if entry.failedAttempts == 1 {
 		klog.Infof("%s: %s first retry failure queued, backoff will start at %s", r.name, r.ResourceHandler.ObjType, initialBackoff)
-	} else if entry.failedAttempts%5 == 0 {
+	} else if entry.failedAttempts%5 == 0 && entry.failedAttempts != prev {
 		klog.Warningf("%s: %s has now failed %d times, current backoff %s", r.name, r.ResourceHandler.ObjType, entry.failedAttempts, entry.backoff)
 	}
}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@go-controller/pkg/retry/obj_retry.go` around lines 261 - 270, The periodic
warning log in increaseFailedAttemptsCounter() will spam once failedAttempts
saturates at 255 because 255%5==0; update the logging condition so the "%5==0"
branch only runs for values below the saturation threshold: keep the existing
increment and first-failure log, but change the else-if to check both
entry.failedAttempts%5==0 AND entry.failedAttempts < 255 (or use a symbolic
constant for maxAttempts) so warnings stop once the counter is saturated;
reference function increaseFailedAttemptsCounter and field failedAttempts and
backoff in retryObjEntry.

🧹 Nitpick comments (1)

go-controller/pkg/retry/obj_retry.go (1)

472-472: ⚡ Quick win

Consider moving sweep-level diagnostics behind verbosity gating.

These periodic sweep logs at info level can become noisy under sustained retries; V(4/5) would preserve diagnostics with less default log pressure.

Suggested refactor

-	klog.Infof("%s: retry cache has %d %s entries pending retry", r.name, len(entriesKeys), r.ResourceHandler.ObjType)
+	klog.V(4).Infof("%s: retry cache has %d %s entries pending retry", r.name, len(entriesKeys), r.ResourceHandler.ObjType)
@@
-	klog.Infof("%s: retry sweep for %s completed in %v, %d entries remaining in cache", r.name, r.ResourceHandler.ObjType, elapsed, len(remainingKeys))
+	klog.V(4).Infof("%s: retry sweep for %s completed in %v, %d entries remaining in cache", r.name, r.ResourceHandler.ObjType, elapsed, len(remainingKeys))

Also applies to: 485-487

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@go-controller/pkg/retry/obj_retry.go` at line 472, The periodic sweep
diagnostic log using klog.Infof("%s: retry cache has %d %s entries pending
retry", r.name, len(entriesKeys), r.ResourceHandler.ObjType) (and the similar
logs around the 485-487 region) should be gated behind a verbosity check; change
these Infof calls to use klog.V(4).Infof (or V(5) per desired noise level) so
the messages are only emitted when verbose logging is enabled, ensuring the same
message and interpolation of r.name, len(entriesKeys) and
r.ResourceHandler.ObjType are preserved but conditional on klog.V(...).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@go-controller/pkg/retry/obj_retry.go`:
- Line 364: The diagnostic log uses entry.backoff and entry.failedAttempts after
they were advanced, so it reports the next values instead of the current ones;
capture the current values before they are mutated (e.g., store currentBackoff
:= entry.backoff and currentAttempts := entry.failedAttempts) and use those
captured variables in the klog.Infof call that references r.name,
r.ResourceHandler.ObjType, objKey, entry.failedAttempts, and entry.backoff so
the log accurately reflects the state at the time of this retry step.

---

Outside diff comments:
In `@go-controller/pkg/retry/obj_retry.go`:
- Around line 261-270: The periodic warning log in
increaseFailedAttemptsCounter() will spam once failedAttempts saturates at 255
because 255%5==0; update the logging condition so the "%5==0" branch only runs
for values below the saturation threshold: keep the existing increment and
first-failure log, but change the else-if to check both
entry.failedAttempts%5==0 AND entry.failedAttempts < 255 (or use a symbolic
constant for maxAttempts) so warnings stop once the counter is saturated;
reference function increaseFailedAttemptsCounter and field failedAttempts and
backoff in retryObjEntry.

---

Nitpick comments:
In `@go-controller/pkg/retry/obj_retry.go`:
- Line 472: The periodic sweep diagnostic log using klog.Infof("%s: retry cache
has %d %s entries pending retry", r.name, len(entriesKeys),
r.ResourceHandler.ObjType) (and the similar logs around the 485-487 region)
should be gated behind a verbosity check; change these Infof calls to use
klog.V(4).Infof (or V(5) per desired noise level) so the messages are only
emitted when verbose logging is enabled, ensuring the same message and
interpolation of r.name, len(entriesKeys) and r.ResourceHandler.ObjType are
preserved but conditional on klog.V(...).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Enterprise

Run ID: 7e36603a-4b30-49a9-b2d8-28d9a38a1f8b

📥 Commits

Reviewing files that changed from the base of the PR and between 94e9e4d and 7e4ae70.

📒 Files selected for processing (1)

go-controller/pkg/retry/obj_retry.go

coderabbitai · 2026-04-29T23:56:13Z

 		}

-		klog.Infof("%s: retry object setup: %s %s", r.name, r.ResourceHandler.ObjType, objKey)
+		klog.Infof("%s: retry object setup: %s %s (attempt %d, backoff %s)", r.name, r.ResourceHandler.ObjType, objKey, entry.failedAttempts, entry.backoff)


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Retry diagnostic log reports ambiguous attempt/backoff values.

At Line 364, entry.backoff is already updated to the next value, so the log can misrepresent the current retry timing.

Suggested fix

+ currentBackoff := entry.backoff + attempt := int(entry.failedAttempts) + 1 // update backoff for future attempts in case of failure - entry.backoff = entry.backoff * 2 + entry.backoff = currentBackoff * 2 if entry.backoff > maxBackoff { entry.backoff = maxBackoff } @@ - klog.Infof("%s: retry object setup: %s %s (attempt %d, backoff %s)", r.name, r.ResourceHandler.ObjType, objKey, entry.failedAttempts, entry.backoff) + klog.Infof("%s: retry object setup: %s %s (attempt %d, current backoff %s, next backoff %s)", + r.name, r.ResourceHandler.ObjType, objKey, attempt, currentBackoff, entry.backoff)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@go-controller/pkg/retry/obj_retry.go` at line 364, The diagnostic log uses entry.backoff and entry.failedAttempts after they were advanced, so it reports the next values instead of the current ones; capture the current values before they are mutated (e.g., store currentBackoff := entry.backoff and currentAttempts := entry.failedAttempts) and use those captured variables in the klog.Infof call that references r.name, r.ResourceHandler.ObjType, objKey, entry.failedAttempts, and entry.backoff so the log accurately reflects the state at the time of this retry step.

openshift-ci · 2026-04-30T05:48:26Z

@afcollins: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-aws-ovn-upgrade	`7e4ae70`	link	true	`/test e2e-aws-ovn-upgrade`
ci/prow/e2e-metal-ipi-ovn-dualstack-bgp-local-gw	`7e4ae70`	link	true	`/test e2e-metal-ipi-ovn-dualstack-bgp-local-gw`
ci/prow/e2e-aws-ovn-rhcos10-techpreview	`7e4ae70`	link	false	`/test e2e-aws-ovn-rhcos10-techpreview`
ci/prow/security	`7e4ae70`	link	false	`/test security`
ci/prow/e2e-aws-ovn-upgrade-local-gateway	`7e4ae70`	link	true	`/test e2e-aws-ovn-upgrade-local-gateway`
ci/prow/e2e-azure-ovn-upgrade	`7e4ae70`	link	true	`/test e2e-azure-ovn-upgrade`
ci/prow/e2e-metal-ipi-ovn-dualstack-bgp	`7e4ae70`	link	true	`/test e2e-metal-ipi-ovn-dualstack-bgp`
ci/prow/e2e-metal-ipi-ovn-ipv6	`7e4ae70`	link	true	`/test e2e-metal-ipi-ovn-ipv6`
ci/prow/qe-perfscale-payload-control-plane-6nodes	`7e4ae70`	link	true	`/test qe-perfscale-payload-control-plane-6nodes`
ci/prow/e2e-aws-ovn-serial	`7e4ae70`	link	true	`/test e2e-aws-ovn-serial`
ci/prow/e2e-metal-ipi-ovn-dualstack	`7e4ae70`	link	true	`/test e2e-metal-ipi-ovn-dualstack`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Add retry diagnostic logging and enable pprof on all metrics servers

5794ed6

Signed-off-by: Andrew Collins <ancollin@redhat.com>

openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 29, 2026

openshift-ci Bot requested review from kyrtapz and martinkennelly April 29, 2026 05:24

coderabbitai Bot reviewed Apr 29, 2026

View reviewed changes

afcollins force-pushed the perfscale-retry-logging branch from 940c64f to 94e9e4d Compare April 29, 2026 17:56

increase logging verbosity for noisy message

7e4ae70

Signed-off-by: Andrew Collins <ancollin@redhat.com>

afcollins force-pushed the perfscale-retry-logging branch from 94e9e4d to 7e4ae70 Compare April 29, 2026 23:53

coderabbitai Bot reviewed Apr 29, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP Add retry diagnostic logging and enable pprof on all metrics servers#3166

WIP Add retry diagnostic logging and enable pprof on all metrics servers#3166
afcollins wants to merge 2 commits intoopenshift:masterfrom
afcollins:perfscale-retry-logging

afcollins commented Apr 29, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 29, 2026 •

edited

Loading

❌ Failed checks (1 warning)

Uh oh!

openshift-ci Bot commented Apr 29, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Apr 29, 2026

Uh oh!

afcollins commented Apr 29, 2026

Uh oh!

afcollins commented Apr 29, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Apr 29, 2026

Uh oh!

openshift-ci Bot commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

afcollins commented Apr 29, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📑 Description

Additional Information for reviewers

✅ Checks

How to verify it

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

openshift-ci Bot commented Apr 29, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

afcollins commented Apr 29, 2026

Uh oh!

afcollins commented Apr 29, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

openshift-ci Bot commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

afcollins commented Apr 29, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 29, 2026 •

edited

Loading