Skip to content

WIP Add retry diagnostic logging and enable pprof on all metrics servers#3166

Open
afcollins wants to merge 2 commits intoopenshift:masterfrom
afcollins:perfscale-retry-logging
Open

WIP Add retry diagnostic logging and enable pprof on all metrics servers#3166
afcollins wants to merge 2 commits intoopenshift:masterfrom
afcollins:perfscale-retry-logging

Conversation

@afcollins
Copy link
Copy Markdown

@afcollins afcollins commented Apr 29, 2026

📑 Description

Seeing a lot of retries sometimes. Need to run some tests from a PR to see if I can get more information.

Additional Information for reviewers

✅ Checks

  • My code requires changes to the documentation
  • if so, I have updated the documentation as required
  • My code requires tests
  • if so, I have added and/or updated the tests as required
  • All the tests have passed in the CI

How to verify it

Summary by CodeRabbit

  • Refactor

    • Simplified OVN metrics server profiling and endpoint sharing behavior to make metrics profiling handling more predictable.
  • Chores

    • Improved retry-queue observability: clearer logs when failures are first enqueued, periodic warnings on repeated failures with backoff info, summaries of sweep progress and completion, and reduced noise for deletes of terminal-state resources.

Signed-off-by: Andrew Collins <ancollin@redhat.com>
@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 29, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 29, 2026

Walkthrough

ovnkube moves pprof enabling to metrics.MetricServerOptions (set from config.Metrics.EnablePprof); combineMetricsEndpoints no longer mutates opts.EnablePprof. The retry subsystem adds first-failure and periodic warning logs (every 5th attempt with backoff), sweep start/completion logs, and lowers delete-on-terminal-state log verbosity to V(5).

Changes

Cohort / File(s) Summary
Metrics Server Configuration
go-controller/cmd/ovnkube/ovnkube.go
Assigns EnablePprof directly from config.Metrics.EnablePprof into metrics.MetricServerOptions. combineMetricsEndpoints(runMode) no longer mutates opts.EnablePprof; it only switches the Prometheus registry to prometheus.DefaultRegisterer when sharing endpoints.
Retry-Queue Observability
go-controller/pkg/retry/obj_retry.go
Adds structured logging: logs initial failure enqueue at first increment, emits warning logs every 5th failed attempt including current backoff, appends attempt number and backoff to per-object setup logs, logs sweep start pending counts and sweep completion elapsed time with remaining entries, and downgrades delete-on-terminal-state messages to V(5).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐇 I hopped in code where metrics hide,

set pprof gently, no more tide.
Retries counted, warnings hum,
sweeps report when errands done.
Small patch, big-heart thump — yum!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately reflects the main changes: adding retry diagnostic logging and enabling pprof on metrics servers, both evident in the file modifications.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
Review rate limit: 0/1 reviews remaining, refill in 60 minutes.

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 29, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: afcollins
Once this PR has been reviewed and has the lgtm label, please assign jcaamano for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
go-controller/pkg/retry/obj_retry.go (2)

364-364: Retry attempt counter is off/ambiguous in logs.

Line 364 uses entry.failedAttempts as the attempt number, which can emit attempt 0 on first retry-loop execution. Consider logging entry.failedAttempts+1 (or rename to “failed attempts”) for clearer diagnostics.

Suggested fix
-		klog.Infof("%s: retry object setup: %s %s (attempt %d, backoff %s)", r.name, r.ResourceHandler.ObjType, objKey, entry.failedAttempts, entry.backoff)
+		attempt := int(entry.failedAttempts) + 1
+		klog.Infof("%s: retry object setup: %s %s (attempt %d, backoff %s)", r.name, r.ResourceHandler.ObjType, objKey, attempt, entry.backoff)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@go-controller/pkg/retry/obj_retry.go` at line 364, The log prints a confusing
attempt count using entry.failedAttempts which is zero on the first retry;
update the logging in the retry setup (the klog.Infof call that references
r.name, r.ResourceHandler.ObjType, objKey, entry.failedAttempts, entry.backoff)
to show a 1-based attempt number by logging entry.failedAttempts + 1 (or
alternatively change the label to "failed attempts" if you prefer keeping it
zero-based) so the message reads an intuitive attempt count.

472-472: Consider lowering sweep logs from info to verbose level.

These two logs run every sweep while entries exist and can create significant log volume in busy clusters. Moving them to a verbose level keeps diagnostics available without default noise.

Suggested fix
-	klog.Infof("%s: retry cache has %d %s entries pending retry", r.name, len(entriesKeys), r.ResourceHandler.ObjType)
+	klog.V(4).Infof("%s: retry cache has %d %s entries pending retry", r.name, len(entriesKeys), r.ResourceHandler.ObjType)
...
-	klog.Infof("%s: retry sweep for %s completed in %v, %d entries remaining in cache", r.name, r.ResourceHandler.ObjType, elapsed, len(remainingKeys))
+	klog.V(4).Infof("%s: retry sweep for %s completed in %v, %d entries remaining in cache", r.name, r.ResourceHandler.ObjType, elapsed, len(remainingKeys))

Also applies to: 485-487

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@go-controller/pkg/retry/obj_retry.go` at line 472, The sweep logging in
obj_retry.go is too noisy at Info level; change the klog.Infof calls that report
pending retry entries (the line using r.name, len(entriesKeys), and
r.ResourceHandler.ObjType and the similar block at lines ~485-487) to a verbose
level (e.g., klog.V(2).Infof) so the same formatted message remains available
but only appears when verbose logging is enabled.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@go-controller/pkg/retry/obj_retry.go`:
- Around line 266-270: The first-failure log uses the hardcoded initialBackoff
constant which can be incorrect for entries initialized with different backoff
(e.g., noBackoff); update the klog.Infof in the block checking
entry.failedAttempts == 1 to log entry.backoff instead of initialBackoff so the
message reflects the actual backoff state (keep r.name and
r.ResourceHandler.ObjType and entry.failedAttempts in the format string as
before).

---

Nitpick comments:
In `@go-controller/pkg/retry/obj_retry.go`:
- Line 364: The log prints a confusing attempt count using entry.failedAttempts
which is zero on the first retry; update the logging in the retry setup (the
klog.Infof call that references r.name, r.ResourceHandler.ObjType, objKey,
entry.failedAttempts, entry.backoff) to show a 1-based attempt number by logging
entry.failedAttempts + 1 (or alternatively change the label to "failed attempts"
if you prefer keeping it zero-based) so the message reads an intuitive attempt
count.
- Line 472: The sweep logging in obj_retry.go is too noisy at Info level; change
the klog.Infof calls that report pending retry entries (the line using r.name,
len(entriesKeys), and r.ResourceHandler.ObjType and the similar block at lines
~485-487) to a verbose level (e.g., klog.V(2).Infof) so the same formatted
message remains available but only appears when verbose logging is enabled.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Enterprise

Run ID: acb791d5-f2f0-4e58-ade0-ce7bcf877c4b

📥 Commits

Reviewing files that changed from the base of the PR and between ab5bb67 and 940c64f.

📒 Files selected for processing (2)
  • go-controller/cmd/ovnkube/ovnkube.go
  • go-controller/pkg/retry/obj_retry.go

Comment on lines +266 to +270
if entry.failedAttempts == 1 {
klog.Infof("%s: %s first retry failure queued, backoff will start at %s", r.name, r.ResourceHandler.ObjType, initialBackoff)
} else if entry.failedAttempts%5 == 0 {
klog.Warningf("%s: %s has now failed %d times, current backoff %s", r.name, r.ResourceHandler.ObjType, entry.failedAttempts, entry.backoff)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

First-failure backoff message can report the wrong value.

Line 267 logs a hardcoded initialBackoff, but the effective backoff state may differ (e.g., entries inserted with noBackoff). Logging entry.backoff is more accurate for diagnostics.

Suggested fix
-	if entry.failedAttempts == 1 {
-		klog.Infof("%s: %s first retry failure queued, backoff will start at %s", r.name, r.ResourceHandler.ObjType, initialBackoff)
+	if entry.failedAttempts == 1 {
+		klog.Infof("%s: %s first retry failure queued, current backoff %s", r.name, r.ResourceHandler.ObjType, entry.backoff)
 	} else if entry.failedAttempts%5 == 0 {
 		klog.Warningf("%s: %s has now failed %d times, current backoff %s", r.name, r.ResourceHandler.ObjType, entry.failedAttempts, entry.backoff)
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@go-controller/pkg/retry/obj_retry.go` around lines 266 - 270, The
first-failure log uses the hardcoded initialBackoff constant which can be
incorrect for entries initialized with different backoff (e.g., noBackoff);
update the klog.Infof in the block checking entry.failedAttempts == 1 to log
entry.backoff instead of initialBackoff so the message reflects the actual
backoff state (keep r.name and r.ResourceHandler.ObjType and
entry.failedAttempts in the format string as before).

@afcollins afcollins force-pushed the perfscale-retry-logging branch from 940c64f to 94e9e4d Compare April 29, 2026 17:56
@afcollins
Copy link
Copy Markdown
Author

/test pull-ci-openshift-ovn-kubernetes-master-qe-perfscale-payload-control-plane-6nodes

@afcollins
Copy link
Copy Markdown
Author

/test qe-perfscale-payload-control-plane-6nodes

Signed-off-by: Andrew Collins <ancollin@redhat.com>
@afcollins afcollins force-pushed the perfscale-retry-logging branch from 94e9e4d to 7e4ae70 Compare April 29, 2026 23:53
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
go-controller/pkg/retry/obj_retry.go (1)

261-270: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Prevent warning-log spam after failed-attempt counter saturation.

Once failedAttempts reaches 255, the %5==0 condition remains true forever, so infinite-retry entries can warn on every failure cycle.

Suggested fix
func (r *RetryFramework) increaseFailedAttemptsCounter(entry *retryObjEntry) {
+	prev := entry.failedAttempts
 	// avoid overflowing the counter for infinite retries
 	if entry.failedAttempts < 255 {
 		entry.failedAttempts++
 	}
 	if entry.failedAttempts == 1 {
 		klog.Infof("%s: %s first retry failure queued, backoff will start at %s", r.name, r.ResourceHandler.ObjType, initialBackoff)
-	} else if entry.failedAttempts%5 == 0 {
+	} else if entry.failedAttempts%5 == 0 && entry.failedAttempts != prev {
 		klog.Warningf("%s: %s has now failed %d times, current backoff %s", r.name, r.ResourceHandler.ObjType, entry.failedAttempts, entry.backoff)
 	}
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@go-controller/pkg/retry/obj_retry.go` around lines 261 - 270, The periodic
warning log in increaseFailedAttemptsCounter() will spam once failedAttempts
saturates at 255 because 255%5==0; update the logging condition so the "%5==0"
branch only runs for values below the saturation threshold: keep the existing
increment and first-failure log, but change the else-if to check both
entry.failedAttempts%5==0 AND entry.failedAttempts < 255 (or use a symbolic
constant for maxAttempts) so warnings stop once the counter is saturated;
reference function increaseFailedAttemptsCounter and field failedAttempts and
backoff in retryObjEntry.
🧹 Nitpick comments (1)
go-controller/pkg/retry/obj_retry.go (1)

472-472: ⚡ Quick win

Consider moving sweep-level diagnostics behind verbosity gating.

These periodic sweep logs at info level can become noisy under sustained retries; V(4/5) would preserve diagnostics with less default log pressure.

Suggested refactor
-	klog.Infof("%s: retry cache has %d %s entries pending retry", r.name, len(entriesKeys), r.ResourceHandler.ObjType)
+	klog.V(4).Infof("%s: retry cache has %d %s entries pending retry", r.name, len(entriesKeys), r.ResourceHandler.ObjType)
@@
-	klog.Infof("%s: retry sweep for %s completed in %v, %d entries remaining in cache", r.name, r.ResourceHandler.ObjType, elapsed, len(remainingKeys))
+	klog.V(4).Infof("%s: retry sweep for %s completed in %v, %d entries remaining in cache", r.name, r.ResourceHandler.ObjType, elapsed, len(remainingKeys))

Also applies to: 485-487

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@go-controller/pkg/retry/obj_retry.go` at line 472, The periodic sweep
diagnostic log using klog.Infof("%s: retry cache has %d %s entries pending
retry", r.name, len(entriesKeys), r.ResourceHandler.ObjType) (and the similar
logs around the 485-487 region) should be gated behind a verbosity check; change
these Infof calls to use klog.V(4).Infof (or V(5) per desired noise level) so
the messages are only emitted when verbose logging is enabled, ensuring the same
message and interpolation of r.name, len(entriesKeys) and
r.ResourceHandler.ObjType are preserved but conditional on klog.V(...).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@go-controller/pkg/retry/obj_retry.go`:
- Line 364: The diagnostic log uses entry.backoff and entry.failedAttempts after
they were advanced, so it reports the next values instead of the current ones;
capture the current values before they are mutated (e.g., store currentBackoff
:= entry.backoff and currentAttempts := entry.failedAttempts) and use those
captured variables in the klog.Infof call that references r.name,
r.ResourceHandler.ObjType, objKey, entry.failedAttempts, and entry.backoff so
the log accurately reflects the state at the time of this retry step.

---

Outside diff comments:
In `@go-controller/pkg/retry/obj_retry.go`:
- Around line 261-270: The periodic warning log in
increaseFailedAttemptsCounter() will spam once failedAttempts saturates at 255
because 255%5==0; update the logging condition so the "%5==0" branch only runs
for values below the saturation threshold: keep the existing increment and
first-failure log, but change the else-if to check both
entry.failedAttempts%5==0 AND entry.failedAttempts < 255 (or use a symbolic
constant for maxAttempts) so warnings stop once the counter is saturated;
reference function increaseFailedAttemptsCounter and field failedAttempts and
backoff in retryObjEntry.

---

Nitpick comments:
In `@go-controller/pkg/retry/obj_retry.go`:
- Line 472: The periodic sweep diagnostic log using klog.Infof("%s: retry cache
has %d %s entries pending retry", r.name, len(entriesKeys),
r.ResourceHandler.ObjType) (and the similar logs around the 485-487 region)
should be gated behind a verbosity check; change these Infof calls to use
klog.V(4).Infof (or V(5) per desired noise level) so the messages are only
emitted when verbose logging is enabled, ensuring the same message and
interpolation of r.name, len(entriesKeys) and r.ResourceHandler.ObjType are
preserved but conditional on klog.V(...).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Enterprise

Run ID: 7e36603a-4b30-49a9-b2d8-28d9a38a1f8b

📥 Commits

Reviewing files that changed from the base of the PR and between 94e9e4d and 7e4ae70.

📒 Files selected for processing (1)
  • go-controller/pkg/retry/obj_retry.go

}

klog.Infof("%s: retry object setup: %s %s", r.name, r.ResourceHandler.ObjType, objKey)
klog.Infof("%s: retry object setup: %s %s (attempt %d, backoff %s)", r.name, r.ResourceHandler.ObjType, objKey, entry.failedAttempts, entry.backoff)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Retry diagnostic log reports ambiguous attempt/backoff values.

At Line 364, entry.backoff is already updated to the next value, so the log can misrepresent the current retry timing.

Suggested fix
+		currentBackoff := entry.backoff
+		attempt := int(entry.failedAttempts) + 1
 		// update backoff for future attempts in case of failure
-		entry.backoff = entry.backoff * 2
+		entry.backoff = currentBackoff * 2
 		if entry.backoff > maxBackoff {
 			entry.backoff = maxBackoff
 		}
@@
-		klog.Infof("%s: retry object setup: %s %s (attempt %d, backoff %s)", r.name, r.ResourceHandler.ObjType, objKey, entry.failedAttempts, entry.backoff)
+		klog.Infof("%s: retry object setup: %s %s (attempt %d, current backoff %s, next backoff %s)",
+			r.name, r.ResourceHandler.ObjType, objKey, attempt, currentBackoff, entry.backoff)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@go-controller/pkg/retry/obj_retry.go` at line 364, The diagnostic log uses
entry.backoff and entry.failedAttempts after they were advanced, so it reports
the next values instead of the current ones; capture the current values before
they are mutated (e.g., store currentBackoff := entry.backoff and
currentAttempts := entry.failedAttempts) and use those captured variables in the
klog.Infof call that references r.name, r.ResourceHandler.ObjType, objKey,
entry.failedAttempts, and entry.backoff so the log accurately reflects the state
at the time of this retry step.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 30, 2026

@afcollins: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-ovn-upgrade 7e4ae70 link true /test e2e-aws-ovn-upgrade
ci/prow/e2e-metal-ipi-ovn-dualstack-bgp-local-gw 7e4ae70 link true /test e2e-metal-ipi-ovn-dualstack-bgp-local-gw
ci/prow/e2e-aws-ovn-rhcos10-techpreview 7e4ae70 link false /test e2e-aws-ovn-rhcos10-techpreview
ci/prow/security 7e4ae70 link false /test security
ci/prow/e2e-aws-ovn-upgrade-local-gateway 7e4ae70 link true /test e2e-aws-ovn-upgrade-local-gateway
ci/prow/e2e-azure-ovn-upgrade 7e4ae70 link true /test e2e-azure-ovn-upgrade
ci/prow/e2e-metal-ipi-ovn-dualstack-bgp 7e4ae70 link true /test e2e-metal-ipi-ovn-dualstack-bgp
ci/prow/e2e-metal-ipi-ovn-ipv6 7e4ae70 link true /test e2e-metal-ipi-ovn-ipv6
ci/prow/qe-perfscale-payload-control-plane-6nodes 7e4ae70 link true /test qe-perfscale-payload-control-plane-6nodes
ci/prow/e2e-aws-ovn-serial 7e4ae70 link true /test e2e-aws-ovn-serial
ci/prow/e2e-metal-ipi-ovn-dualstack 7e4ae70 link true /test e2e-metal-ipi-ovn-dualstack

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant