mgmt-agent: add informer-driven data dumping#5361
Conversation
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: stevekuznetsov The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
We have a large number of GVRs on the management cluster which we do not directly create or control from the service cluster, but are nevertheless crucial for deep understanding of errors in HostedCluster state. We can start a set of informers for API groups of interest and dump state as we see it. This approach is naive, but should be functional. Areas for improvement: - backoff/rate-limiting: high churn rate on status may cause large log volumes; we can try to alleviate this with a logging backoff policy that could give us guarantees within one pod restart period - resource consumption: in order to not put a high load on the management cluster's API server, we use LIST+WATCH and keep an in-memory store of all the resource we care about; if we find the data set is larger than we'd like, we can revisit this Signed-off-by: Steve Kuznetsov <stekuznetsov@microsoft.com>
f209abf to
7d4e754
Compare
There was a problem hiding this comment.
Pull request overview
Adds an informer-driven “resource watcher” to the mgmt-agent so it can continuously LIST+WATCH selected API groups on the management cluster and dump observed objects/events to logs for deeper HostedCluster debugging.
Changes:
- Introduces
ResourceWatcherthat discovers GVRs by API group suffix and logs Add/Update/Delete events from dynamic informers. - Wires the watcher into the mgmt-agent controller process under leader election alongside the existing SwiftNIC controller.
- Updates Helm deployment to emit JSON-formatted klog output and expands RBAC to allow list/watch broadly.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| mgmt-agent/pkg/controller/resourcewatcher.go | New dynamic informer-based watcher that discovers and logs events for matching API groups. |
| mgmt-agent/pkg/controller/resourcewatcher_test.go | Unit tests for group-suffix matching, list/watch verb detection, and discovery filtering. |
| mgmt-agent/cmd/options.go | Creates dynamic+discovery clients, instantiates ResourceWatcher, and runs it under leader election. |
| mgmt-agent/deploy/templates/deployment.yaml | Switches klog output to JSON format in the container args. |
| mgmt-agent/deploy/templates/clusterrole.yaml | Grants additional list/watch permissions intended to support the new watcher. |
| return nil | ||
| } | ||
|
|
||
| factory := dynamicinformer.NewDynamicSharedInformerFactory(w.dynamicClient, 10*time.Hour) |
| factory.Start(ctx.Done()) | ||
| factory.WaitForCacheSync(ctx.Done()) | ||
|
|
||
| logger.Info("Resource watcher informers synced and running") |
| logger.Info("resource event", | ||
| "event", eventType, | ||
| "gvr", gvr.String(), | ||
| "namespace", u.GetNamespace(), | ||
| "name", u.GetName(), | ||
| "object", u.Object, | ||
| ) |
| - apiGroups: | ||
| - "*" | ||
| resources: | ||
| - "*" | ||
| verbs: | ||
| - list | ||
| - watch |
There was a problem hiding this comment.
Do we care? Maybe we exclude core/v1?
|
@stevekuznetsov: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
We have a large number of GVRs on the management cluster which we do not directly create or control from the service cluster, but are nevertheless crucial for deep understanding of errors in HostedCluster state. We can start a set of informers for API groups of interest and dump state as we see it.
This approach is naive, but should be functional. Areas for improvement: