CA: introduce internal NodeInfo #7447

towca · 2024-10-30T08:39:08Z

What type of PR is this?

/kind feature

What this PR does / why we need it:

This is a part of Dynamic Resource Allocation (DRA) support in Cluster Autoscaler. The CA codebase is migrated to use new NodeInfo/PodInfo types wrapping the corresponding schedulerframework types. This allows attaching additional data to Nodes/Pods in autoscaling simulations, which will be needed for DRA.

Which issue(s) this PR fixes:

The CA/DRA integration is tracked in kubernetes/kubernetes#118612, this is just part of the implementation.

Special notes for your reviewer:

This is intended to be a no-op refactor. It was extracted from #7350 as the easiest part to tackle first while reviewing.

There are 3 meaningful commits to be reviewed sequentially, though most of the delta is in the last one.

Does this PR introduce a user-facing change?

None

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

- [KEP]: https://github.com/kubernetes/enhancements/blob/9de7f62e16fc5c1ea3bd40689487c9edc7fa5057/keps/sig-node/4381-dra-structured-parameters/README.md

/assign @MaciekPytel
/assign @jackfrancis

jackfrancis · 2024-10-30T15:55:18Z

cluster-autoscaler/core/scaledown/unremovable/nodes.go

@@ -42,7 +42,7 @@ func NewNodes() *Nodes {

 // NodeInfoGetter is anything that can return NodeInfo object by name.
 type NodeInfoGetter interface {
-	Get(name string) (*schedulerframework.NodeInfo, error)


Would it be safer to keep the legacy Get method in here given that this is an exported, public interface?

The idea is for all of CA codebase to use the new types instead of the scheduler ones, unless it's absolutely necessary (e.g. when actually interacting with the scheduler framework). So most NodeInfos().Get() calls have to be translated to GetNodeInfo() - including the ones made via this interface. There is no reason to leave the scheduler version here IMO.

This interface looks like the classic "local interface with a relevant subset of methods from some other interface" pattern, but it was exported probably by mistake. As expected the interface is only used locally, so changed it to be unexported, WDYT?

Right, I think this is defensible. Maybe worth calling it out in release notes.

I think we might not be on the same page here. Why would changing an arbitrary exported interface in CA code be release-note-worthy? There's a policy around breaking changes in the CloudProvider/NodeGroup interfaces specifically, but others should be fair game to change as needed, right?

It could be argued that the NodeGroup.TemplateNodeInfo() signature change in this PR should be highlighted in release notes (or actually go through a deprecation?), I'm not sure about that part. All in-tree cloud providers are automatically adapted so the change doesn't break them, but it will break out-of-tree cloud providers. Not sure if that's also covered under the policy though, I think we've done similar "mechanical" changes to all in-tree cloud providers without going through the deprecation process.

@MaciekPytel WDYT?

I'm just speaking to the fact that it's theoretically possible for an external dependancy (outside of the local CA codebase) to exist in the wild that reuses NodeInfoGetter, and those downstream use cases will need to be adapted to these changes.

Not a big deal either way. If that doesn't rise to the level of being included as a release note, that's fine.

I don't think we promise to keep any sort of backward compatibility when it comes to internal interfaces in CA, besides the CloudProvider interface and processors (we did go through deprecation and multiple releases of supporting NodeInfoProcessor before dropping it).

I'm mostly basing my opinion on kubernetes/kubernetes - I don't think there is any promise of golang interface level backward compatibility there. We certainly need to modify predicate checker pretty much every minor just to handle backward incompatible refactors in scheduler framework. So far our general practice when it comes to deprecations, supported versions and similar was to generally follow kubernetes.

All that being said - we all know there are CA forks out there. It's no secret we have one and I also assume it's no secret that so does Azure (I'm pretty sure there are issues in this repo that mention upstreaming from that fork). With processors/ CA is pretty much asking to be forked and customized. If we think this interface is somehow particularly useful for fork maintainers (e.g. based on experience of anyone in this thread) it could be a reason to leave it - perhaps with a comment discouraging new use?

My vote would be slightly in favor of not worrying about this given the historical precedent in the k/k ecosystem of projects.

Forks are probably O.K. because they may already have implemented their own opinionated NodeInfoGetter anyways. :)

jackfrancis · 2024-10-30T16:01:00Z

cluster-autoscaler/simulator/clustersnapshot/clustersnapshot.go

@@ -43,6 +44,16 @@ type ClusterSnapshot interface {
 	// IsPVCUsedByPods returns if the pvc is used by any pod, key = <namespace>/<pvc_name>
 	IsPVCUsedByPods(key string) bool

+	// AddNodeInfo adds the given NodeInfo to the snapshot. The Node and the Pods are added, as well as
+	// any DRA objects passed along them.


rather than refer to "DRA objects" generally, should we be more explicit and refer to "ResourceClaims" and "ResourceSlices"?

I just tried to make the comment generic enough so it doesn't get outdated if we there are more DRA objects to track in the future.

MaciekPytel · 2024-10-30T18:06:25Z

cluster-autoscaler/simulator/framework/infos.go

+	// Pods should always match the embedded NodeInfo.Pods. This way the schedulerframework.PodInfo can be
+	// accessed via .NodeInfo.Pods[i], and the internal one via .Pods[i]. CA code normally only needs to
+	// access the Pod, so iterating over the internal version should be enough.
+	Pods []*PodInfo


I understand why you need Pods with this extended information, but it's a bit scary that you're now having 2 sources of truth for the list of pods on a node (NodeInfo.Pods and NodeInfo.NodeInfo.Pods). It seems like this is asking for the two to somehow get out of sync.

I'm not yet sure what the right solution is. Intuitively the correct way to handle this would be to have an internal representation with a single list of pods and convert on-demand, but I am worried about performance implications of that.

Rather than embed a new, local PodInfo type can we simply embed NeededResourceClaims []*resourceapi.ResourceClaim directly? And follow ResourceClaim.Status.ReservedFor to get the associated pod references?

@MaciekPytel fully agreed on this being scary, but the alternatives that I see are:

As you mention, changing the ClusterSnapshot implementations to operate only on the internal NodeInfos as the single source of truth, and converting them to schedulerframework.NodeInfo on-demand whenever scheduler code calls NodeInfos().List(). I went with this in the first prototype, but then decided against it for 2 reasons:

As you point out, it's unclear how it would affect performance. This would have to be measured, but I suspect it might not be insignificant given that some things are precomputed during schedulerframework.NodeInfo.AddPod().

It added unnecessary complexity to ClusterSnapshot implementations, and was annoyingly bug-prone. This is less important IMO, as at least the problems were limited to simulator/snapshot.

We could remove some redundancy by having the pod list only inside the wrapped schedulerframework.NodeInfo, and only keeping the additional information in PodInfo. So PodInfo would only contain NeededResourceSlices, and NodeInfo would contain schedulerframework.NodeInfo and a map from pod UIDs to PodInfo. We could also have a Pods() method on the internal NodeInfo that would merge the two data sources together for ease-of-use. This should be less prone to getting out-of-sync, but at a cost of more complexity. Do you think it's enough/worth it?

FWIW I tried to assert in the unit test that things stay in sync during normal usage patterns that we have in the codebase right now.

@jackfrancis Unfortunately we can't rely on any information inside a ResourceClaim to identify the pod(s) that need it. ReservedFor is only filled out if the claim is actually allocated for the pod. But I think the spirit of your suggestion is similar to the option 2. I describe above, right?

IMO that additional complexity is preferable to storing the same data in two places.

Agree with @jackfrancis, duplicating data seems like the worst option. I think both approaches you listed could work.

For approach 1 - you could opportunistically cache schedulerframework.NodeInfo to minimize performance impact (i.e. cache sf.NI whenever you build it, invalidate on any change to framework.NI). That still involves duplicated lists to an extent, but I think it's very different in that there is a clearly defined source of truth. It's probably doesn't solve all the performance problems, as you'd be building sf.NI each time and not updating it and IIRC you're correct and AddPod() was indeed optimized to efficiently update precomputes (for antiaffinity probalby?). I think this may be workable, but I don't love this option.

Approach 2 seems better to me. As long as we can come up with a good interface that encapsulates the complexity (e.g. the Pods() method you described), I think it's a good option. The main thing is to make sure users of framework.NodeInfo won't have to know the information is stored separately (I'd go as far as making fields private and only allowing access via methods outside of simulator).

I think the best approach would be to extend NodeInfo in scheduler framework (i.e. make change to kubernetes/kubernetes).
I think the fact that scheduler snapshots nodes and pods, but just uses in-plugin informers for everything else is inconsistent, error prone and all around terrible. Including DRA objects in scheduler SharedLister seems like a step in the right direction from perspective of just the scheduler. And it has the added benefit of completely removing the need for any custom wrappers in CA together with all the complexity they bring.
Unfortunately, I'm guessing we're out of time for such change in 1.32 scheduler. Even more unfortunate is that once we implement the wrappers, I can't see anyone going back and cleaning this up for 1.33 :(

So, to sum up - I'm +1 for approach 2.

Is there a case to be made that extending NodeInfo in scheduler could have general reuse value? If so, we could probably find someone to do that work (I'd be happy to give that a shot).

This is just to say IMO approach 2 is the best overall, and we may still have a chance at an even better long term outcome if we have the appetite.

@MaciekPytel Done, went with approach 2. Also made the fields private as you suggested. WDYT?

IMO having an internal NodeInfo type wrapping the schedulerframework one is useful for CA outside of the DRA logic. Not all info that CA would want to attach to a Node is relevant to scheduler, and putting it there just for CA would be confusing. E.g. now that we have this type, we could keep information like whether a Node is upcoming properly in there instead of hacking it inside Node annotations. Same for e.g. the boot-disk info that GCE cloud provider passes through annotations. The added complexity, on the other hand, should be hidden from most of the codebase.

Separately, I agree that the ideal solution would be for the scheduler to make all API objects that plugins might need a part of its Snapshot - not just Nodes and Pods. I explored that solution, but got to a conclusion that it'd be a massive change, and maybe it wouldn't even solve the problem fully:

The Node/Pod cache that feeds the scheduler Snapshot is quite complex. If we wanted to correlate the DRA objects to Nodes/Pods on the fly during informer updates, we'd need to be very careful not to bloat the complexity further.

Not all DRA objects are "bound" to a Node or a Pod. So we'd need to expand the scheduler Snapshot beyond the current NodeInfo/PodInfo anyway to track those. At this point, maybe it makes more sense not to correlate the objects on the fly, and just essentially have an additional snapshot of DRA objects alongside NodeInfos. In which case we'd need extra fields just in NodeInfo just for CA anyway.

The plugins would still need access to live informers to handle queuing hints. Not a problem for CA since CA doesn't run the queuing logic, but it's a readability challenge on the scheduler side.

@jackfrancis IMO extending the NodeInfo on scheduler side without doing the huge scheduler effort described above doesn't make sense, as the scheduler plugin wouldn't use these fields anyway - they would only be used by CA.

MaciekPytel · 2024-10-30T18:14:59Z

cluster-autoscaler/simulator/clustersnapshot/delta.go

+	return framework.WrapSchedulerNodeInfos(schedNodeInfos), nil
+}
+
+func (snapshot *DeltaClusterSnapshot) AddNodeInfo(nodeInfo *framework.NodeInfo) error {


Why have this method at all (here and elsewhere) if you're discarding all the extra data in framework.NodeInfo and just storing the scheduler representation? I'd expect the result of:

snapshot.AddNodeInfo(x) y = snapshot.GetNodeInfo(x.Node.Name)

To be that x and y are the same. I think anyone would instinctively assume the same, but that's not the case here and that strikes me as confusing and ultimately error prone when someone makes a wrong assumption. If you only want to store scheduler stuff, I'd rather this method didn't exist at all and everyone would just keep using AddNodeWithPods.
Even if that requires a tiny bit of boilerplate to unpack NodeInfo it is much more clear what's happening here. I know ResourceClaims/Slices will not be stored if I can't even pass them. If I do pass them to snapshot, I expect it will store them as well.

The extra data will be stored and your example will hold, this just comes in later commits (ef9d420#diff-29296088fba933fa837b2efdce93b2423b19247d99f59f1d8ef1bc8d5e8c6915).

I can remove the DRA fields from the new types until then, but then we have wrapper types for seemingly no reason - either way isn't ideal in the intermediate, but the end state should be clear IMO.

utils/test is supposed to be usable in any CA package. Having a dependency on cloudprovider makes it unusuable in any package that cloudprovider depends on because of import cycles. The cloudprovider import is only needed by GetGpuConfigFromNode, which is only used in info_test.go. This commit just moves GetGpuConfigFromNode there as an unexported function.

Methods to interact with the new internal types are added to ClusterSnapshot. Cluster Autoscaler code will be migrated to only use these methods and work on the internal types instead of directly using the framework types. The new types are designed so that they can be used as close to the framework types as possible, which should make the migration manageable. This allows easily adding additional data to the Nodes and Pods tracked in ClusterSnapshot, without having to change the scheduler framework. This will be needed to support DRA, as we'll need to track ResourceSlices and ResourceClaims.

The new wrapper types should behave like the direct schedulerframework types for most purposes, so most of the migration is just changing the imported package. Constructors look a bit different, so they have to be adapted - mostly in test code. Accesses to the Pods field have to be changed to a method call. After this, the schedulerframework types are only used in the new wrappers, and in the parts of simulator/ that directly interact with the scheduler framework. The rest of CA codebase operates on the new wrapper types.

jackfrancis

/lgtm

k8s-ci-robot · 2024-11-06T18:43:32Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jackfrancis, towca

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~cluster-autoscaler/OWNERS~~ [towca]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

MaciekPytel · 2024-11-08T16:46:59Z

cluster-autoscaler/core/utils/utils.go

 	}

 	// Build a new node info.
-	newNodeInfo := schedulerframework.NewNodeInfo(newPods...)
-	newNodeInfo.SetNode(nodeInfo.Node().DeepCopy())
+	newNodeInfo := framework.NewNodeInfo(nodeInfo.Node().DeepCopy(), nil, newPods...)


Shouldn't this also copy DRA stuff?

Yes, definitely. This will come in a later commit though, after these NodeInfo utils are refactored a bit: #7479. I'm trying to have these first PRs as just refactors so that we can get them in easier, and then have all of the DRA logic in the last PRs.

MaciekPytel · 2024-11-08T16:47:55Z

cluster-autoscaler/core/utils/utils.go

 	}

-	sanitizedNodeInfo := schedulerframework.NewNodeInfo(SanitizePods(pods, sanitizedNode)...)
-	sanitizedNodeInfo.SetNode(sanitizedNode)
+	sanitizedNodeInfo := framework.NewNodeInfo(sanitizedNode, nil, SanitizePods(pods, sanitizedNode)...)


Shouldn't this also copy DRA stuff?

Same answer as the previous comment - the DRA logic will be added later, this just refactors the non-DRA parts.

MaciekPytel · 2024-11-08T16:48:58Z

cluster-autoscaler/core/utils/utils.go

 		sanitizedPod.Spec.NodeName = sanitizedNode.Name
-		sanitizedPods = append(sanitizedPods, sanitizedPod)
+		sanitizedPods = append(sanitizedPods, &framework.PodInfo{Pod: sanitizedPod})


Here I'm less sure, but once again - feels like we should also include info about needed claims in the copy, no?

Same answer as the previous comment - the DRA logic will be added later, this just refactors the non-DRA parts.

MaciekPytel · 2024-11-08T16:53:44Z

cluster-autoscaler/simulator/nodes.go

@@ -44,7 +43,7 @@ func addExpectedPods(nodeInfo *schedulerframework.NodeInfo, scheduledPods []*api
 		}
 		// Add scheduled mirror and DS pods
 		if pod_util.IsMirrorPod(pod) || pod_util.IsDaemonSetPod(pod) {
-			nodeInfo.AddPod(pod)
+			nodeInfo.AddPod(&framework.PodInfo{Pod: pod})


This type of code makes me nervous - is every place where we pass pod list going to have to be updated to work with DRA? Do we need to never again use pod and just use PodInfo everywhere?

Why nervous? It's a fair bit of toil when refactoring now, but doesn't seem risky going forward.

I don't think using PodInfo instead of Pod is strictly needed everywhere in the codebase, but certainly everywhere where the pods are expected to interact with the snapshot. But there's a lot of code that just grabs something from the snapshot, processes the pods and doesn't modify the snapshot. In this case just working on pods is fine IMO.

Right, but we need to refactor things like PodListProcessor to take and return list of PodInfos (we definitely interact with snapshot/scheduling in filter out schedulable). At which point we probably want to change it in all processors for consistency. At which point we probably just want to use PodInfo everywhere for consistency.

MaciekPytel · 2024-11-08T16:55:51Z

cluster-autoscaler/utils/daemonset/daemonset.go

 	for _, ds := range daemonsets {
 		shouldRun, _ := daemon.NodeShouldRunDaemonPod(nodeInfo.Node(), ds)
 		if shouldRun {
 			pod := daemon.NewPod(ds, nodeInfo.Node().Name)
 			pod.Name = fmt.Sprintf("%s-pod-%d", ds.Name, rand.Int63())
-			result = append(result, pod)
+			result = append(result, &framework.PodInfo{Pod: pod})


Similar to above and probably plenty of other places -> should we change this to somehow include DRA claims (where would we get them from though)?

Yes, all this logic will be added in DRA-specific PRs. Though this particular place will probably be a TODO in the initial implementation (meaning that forcing missing DS pods that use DRA won't work initially, but that's probably not a popular use-case).

towca added area/cluster-autoscaler area/core-autoscaler Denotes an issue that is related to the core autoscaler and is not specific to any provider. labels Oct 30, 2024

k8s-ci-robot assigned jackfrancis and MaciekPytel Oct 30, 2024

k8s-ci-robot requested review from aleksandra-malinowska and apricote October 30, 2024 08:39

nojnhuh mentioned this pull request Oct 30, 2024

Create and use new internal NodeInfo and PodInfo types to enable tracking DRA resources #7390

Closed

jackfrancis reviewed Oct 30, 2024

View reviewed changes

MaciekPytel reviewed Oct 30, 2024

View reviewed changes

towca force-pushed the jtuznik/dra-node-info branch 3 times, most recently from ed7acb7 to 8f2e1ba Compare November 4, 2024 15:05

towca added 3 commits November 5, 2024 16:42

towca force-pushed the jtuznik/dra-node-info branch from 8f2e1ba to 879c6a8 Compare November 5, 2024 15:46

towca added a commit to towca/autoscaler that referenced this pull request Nov 5, 2024

https://github.com/kubernetes/autoscaler/pull/7447

bc0ae44

towca mentioned this pull request Nov 5, 2024

CA: refactor ClusterSnapshot methods #7466

Open

jackfrancis approved these changes Nov 6, 2024

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 6, 2024

k8s-ci-robot merged commit 9e15df8 into kubernetes:master Nov 6, 2024
7 checks passed

towca mentioned this pull request Nov 7, 2024

CA: refactor utils related to NodeInfos #7479

Open

MaciekPytel reviewed Nov 8, 2024

View reviewed changes

towca mentioned this pull request Nov 14, 2024

CA: refactor PredicateChecker into ClusterSnapshot #7497

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CA: introduce internal NodeInfo #7447

CA: introduce internal NodeInfo #7447

towca commented Oct 30, 2024

jackfrancis Oct 30, 2024

towca Oct 30, 2024 •

edited

Loading

jackfrancis Oct 30, 2024

towca Oct 30, 2024

jackfrancis Oct 30, 2024

MaciekPytel Oct 31, 2024

jackfrancis Oct 31, 2024

jackfrancis Oct 30, 2024

towca Oct 30, 2024

MaciekPytel Oct 30, 2024

jackfrancis Oct 30, 2024

towca Oct 30, 2024

jackfrancis Oct 30, 2024

MaciekPytel Oct 31, 2024

jackfrancis Oct 31, 2024

towca Nov 4, 2024

MaciekPytel Oct 30, 2024

towca Oct 30, 2024

jackfrancis left a comment

k8s-ci-robot commented Nov 6, 2024

MaciekPytel Nov 8, 2024

towca Nov 13, 2024

MaciekPytel Nov 8, 2024

towca Nov 13, 2024

MaciekPytel Nov 8, 2024

towca Nov 13, 2024

MaciekPytel Nov 8, 2024

towca Nov 13, 2024

MaciekPytel Nov 13, 2024

MaciekPytel Nov 8, 2024

towca Nov 13, 2024

CA: introduce internal NodeInfo #7447

CA: introduce internal NodeInfo #7447

Conversation

towca commented Oct 30, 2024

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

Choose a reason for hiding this comment

towca Oct 30, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jackfrancis left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Nov 6, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

towca Oct 30, 2024 •

edited

Loading