Skip to content

mgmt-agent: add informer-driven data dumping#5361

Open
stevekuznetsov wants to merge 1 commit into
Azure:mainfrom
stevekuznetsov:skuznets/mgmt-agent-data-dumper
Open

mgmt-agent: add informer-driven data dumping#5361
stevekuznetsov wants to merge 1 commit into
Azure:mainfrom
stevekuznetsov:skuznets/mgmt-agent-data-dumper

Conversation

@stevekuznetsov
Copy link
Copy Markdown
Contributor

We have a large number of GVRs on the management cluster which we do not directly create or control from the service cluster, but are nevertheless crucial for deep understanding of errors in HostedCluster state. We can start a set of informers for API groups of interest and dump state as we see it.

This approach is naive, but should be functional. Areas for improvement:

  • backoff/rate-limiting: high churn rate on status may cause large log volumes; we can try to alleviate this with a logging backoff policy that could give us guarantees within one pod restart period
  • resource consumption: in order to not put a high load on the management cluster's API server, we use LIST+WATCH and keep an in-memory store of all the resource we care about; if we find the data set is larger than we'd like, we can revisit this

Copilot AI review requested due to automatic review settings May 21, 2026 15:58
@openshift-ci openshift-ci Bot requested review from deads2k and roivaz May 21, 2026 15:58
@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented May 21, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: stevekuznetsov

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

We have a large number of GVRs on the management cluster which we do not
directly create or control from the service cluster, but are
nevertheless crucial for deep understanding of errors in HostedCluster
state. We can start a set of informers for API groups of interest and
dump state as we see it.

This approach is naive, but should be functional. Areas for improvement:
- backoff/rate-limiting: high churn rate on status may cause large log
  volumes; we can try to alleviate this with a logging backoff policy
  that could give us guarantees within one pod restart period
- resource consumption: in order to not put a high load on the
  management cluster's API server, we use LIST+WATCH and keep an
  in-memory store of all the resource we care about; if we find the data
  set is larger than we'd like, we can revisit this

Signed-off-by: Steve Kuznetsov <stekuznetsov@microsoft.com>
@stevekuznetsov stevekuznetsov force-pushed the skuznets/mgmt-agent-data-dumper branch from f209abf to 7d4e754 Compare May 21, 2026 15:59
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an informer-driven “resource watcher” to the mgmt-agent so it can continuously LIST+WATCH selected API groups on the management cluster and dump observed objects/events to logs for deeper HostedCluster debugging.

Changes:

  • Introduces ResourceWatcher that discovers GVRs by API group suffix and logs Add/Update/Delete events from dynamic informers.
  • Wires the watcher into the mgmt-agent controller process under leader election alongside the existing SwiftNIC controller.
  • Updates Helm deployment to emit JSON-formatted klog output and expands RBAC to allow list/watch broadly.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
mgmt-agent/pkg/controller/resourcewatcher.go New dynamic informer-based watcher that discovers and logs events for matching API groups.
mgmt-agent/pkg/controller/resourcewatcher_test.go Unit tests for group-suffix matching, list/watch verb detection, and discovery filtering.
mgmt-agent/cmd/options.go Creates dynamic+discovery clients, instantiates ResourceWatcher, and runs it under leader election.
mgmt-agent/deploy/templates/deployment.yaml Switches klog output to JSON format in the container args.
mgmt-agent/deploy/templates/clusterrole.yaml Grants additional list/watch permissions intended to support the new watcher.

return nil
}

factory := dynamicinformer.NewDynamicSharedInformerFactory(w.dynamicClient, 10*time.Hour)
Comment on lines +105 to +108
factory.Start(ctx.Done())
factory.WaitForCacheSync(ctx.Done())

logger.Info("Resource watcher informers synced and running")
Comment on lines +188 to +194
logger.Info("resource event",
"event", eventType,
"gvr", gvr.String(),
"namespace", u.GetNamespace(),
"name", u.GetName(),
"object", u.Object,
)
Comment on lines +23 to +29
- apiGroups:
- "*"
resources:
- "*"
verbs:
- list
- watch
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we care? Maybe we exclude core/v1?

@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented May 21, 2026

@stevekuznetsov: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/verify 7d4e754 link true /test verify
ci/prow/test-unit 7d4e754 link true /test test-unit
ci/prow/config-change-detection 7d4e754 link true /test config-change-detection
ci/prow/e2e-parallel 7d4e754 link true /test e2e-parallel

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants