Skip to content

CA: Persist and expose NodeInfos computed by TemplateNodeInfoProvider #8882

@towca

Description

@towca

Which component are you using?:

/area cluster-autoscaler
/area core-autoscaler

Is your feature request designed to solve a problem? If so describe the problem this feature should solve.:

The TemplateNodeInfoProvider interface is responsible for computing template NodeInfos for every autoscaled NodeGroup for the purpose of scale-up simulations. The main implementation for the interface is MixedTemplateNodeInfoProvider, which tries to create the templates based on sanitized real Nodes, and falls back to NodeGroup.TemplateNodeInfo() if there aren't any healthy Nodes to sanitize in a given NodeGroup.

The processor is called near the beginning of StaticAutoscaler.RunOnce() here. The map of NodeInfo templates it produces is just a local variable that gets passed to various pieces of RunOnce() logic, most notably scale-up.

This has the following problems:

  • Most CA processors don't have access to the NodeInfo map. Some of the processor implementations need NodeInfo templates for their logic. One example is the DRA readiness processor, which needs the template to know what DRA Devices the readiness logic should wait for. Right now, this and similar processors have to resort to using NodeGroup.TemplateNodeInfo() as the template. Using NodeGroup.TemplateNodeInfo() is much less reliable than sanitizing a real Node - each part of the template has to be crafted from scratch based on hardcoded logic in CA. If there's a part of the Node (e.g. a new label, or a DRA Device) that isn't correctly predicted by NodeGroup.TemplateNodeInfo(), such processor implementations stop working correctly - even if there's at least 1 Node in the NodeGroup.
  • Some of the processors that need the template map (e.g. the DRA processor mentioned above) are actually executed before TemplateNodeInfoProvider computes the map within a single CA loop. This is intentional and necessary - MixedTemplateNodeInfoProvider uses Node readiness as part of the logic to determine if a Node is a good candidate for being sanitized into a template, so the DRA processor needs to be executed earlier to hack the readiness correctly.

Describe the solution you'd like.:

We should introduce a new component responsible for storing, updating, and exposing template NodeInfos computed by TemplateNodeInfoProvider. Such component should:

  • Embed TemplateNodeInfoProcessor, use it for computing template NodeInfos, and cache the results internally until the next recomputation.
  • Recompute the cached templates every CA loop, in the same place where they are computed now.
  • Expose both the full map of computed templates, and a way to get a template for a single NodeGroup.
  • Be accessible from CA processors. This is probably best achieved by placing it in AutoscalingContext.
  • Be usable in any part of CA logic - including in the parts of main CA loop before templates are recomputed, and in fully separate goroutines. If the templates are accessed before the recomputation, the component should return the previously computed ones. The component should be thread-safe.

Additional context.:

#8881 can be trivially solved after this is completed.

Metadata

Metadata

Assignees

Labels

area/cluster-autoscalerarea/core-autoscalerDenotes an issue that is related to the core autoscaler and is not specific to any provider.kind/featureCategorizes issue or PR as related to a new feature.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions