Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CA: introduce internal NodeInfo #7447

Merged
merged 3 commits into from
Nov 6, 2024

Conversation

towca
Copy link
Collaborator

@towca towca commented Oct 30, 2024

What type of PR is this?

/kind feature

What this PR does / why we need it:

This is a part of Dynamic Resource Allocation (DRA) support in Cluster Autoscaler. The CA codebase is migrated to use new NodeInfo/PodInfo types wrapping the corresponding schedulerframework types. This allows attaching additional data to Nodes/Pods in autoscaling simulations, which will be needed for DRA.

Which issue(s) this PR fixes:

The CA/DRA integration is tracked in kubernetes/kubernetes#118612, this is just part of the implementation.

Special notes for your reviewer:

This is intended to be a no-op refactor. It was extracted from #7350 as the easiest part to tackle first while reviewing.

There are 3 meaningful commits to be reviewed sequentially, though most of the delta is in the last one.

Does this PR introduce a user-facing change?

None

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

- [KEP]: https://github.com/kubernetes/enhancements/blob/9de7f62e16fc5c1ea3bd40689487c9edc7fa5057/keps/sig-node/4381-dra-structured-parameters/README.md

/assign @MaciekPytel
/assign @jackfrancis

@towca towca added area/cluster-autoscaler area/core-autoscaler Denotes an issue that is related to the core autoscaler and is not specific to any provider. labels Oct 30, 2024
@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. area/provider/alicloud Issues or PRs related to the AliCloud cloud provider implementation area/provider/aws Issues or PRs related to aws provider approved Indicates a PR has been approved by an approver from all required OWNERS files. area/provider/azure Issues or PRs related to azure provider area/provider/cluster-api Issues or PRs related to Cluster API provider labels Oct 30, 2024
@k8s-ci-robot k8s-ci-robot added area/provider/digitalocean Issues or PRs related to digitalocean provider area/provider/equinixmetal Issues or PRs related to the Equinix Metal cloud provider for Cluster Autoscaler area/provider/externalgrpc Issues or PRs related to the External gRPC provider area/provider/gce area/provider/hetzner Issues or PRs related to Hetzner provider area/provider/ionoscloud area/provider/kwok Issues or PRs related to the kwok cloud provider for Cluster Autoscaler size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. area/provider/linode Issues or PRs related to linode provider area/provider/magnum Issues or PRs related to the Magnum cloud provider for Cluster Autoscaler area/provider/oci Issues or PRs related to oci provider area/provider/rancher labels Oct 30, 2024
@@ -42,7 +42,7 @@ func NewNodes() *Nodes {

// NodeInfoGetter is anything that can return NodeInfo object by name.
type NodeInfoGetter interface {
Get(name string) (*schedulerframework.NodeInfo, error)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be safer to keep the legacy Get method in here given that this is an exported, public interface?

Copy link
Collaborator Author

@towca towca Oct 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is for all of CA codebase to use the new types instead of the scheduler ones, unless it's absolutely necessary (e.g. when actually interacting with the scheduler framework). So most NodeInfos().Get() calls have to be translated to GetNodeInfo() - including the ones made via this interface. There is no reason to leave the scheduler version here IMO.

This interface looks like the classic "local interface with a relevant subset of methods from some other interface" pattern, but it was exported probably by mistake. As expected the interface is only used locally, so changed it to be unexported, WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I think this is defensible. Maybe worth calling it out in release notes.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might not be on the same page here. Why would changing an arbitrary exported interface in CA code be release-note-worthy? There's a policy around breaking changes in the CloudProvider/NodeGroup interfaces specifically, but others should be fair game to change as needed, right?

It could be argued that the NodeGroup.TemplateNodeInfo() signature change in this PR should be highlighted in release notes (or actually go through a deprecation?), I'm not sure about that part. All in-tree cloud providers are automatically adapted so the change doesn't break them, but it will break out-of-tree cloud providers. Not sure if that's also covered under the policy though, I think we've done similar "mechanical" changes to all in-tree cloud providers without going through the deprecation process.

@MaciekPytel WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just speaking to the fact that it's theoretically possible for an external dependancy (outside of the local CA codebase) to exist in the wild that reuses NodeInfoGetter, and those downstream use cases will need to be adapted to these changes.

Not a big deal either way. If that doesn't rise to the level of being included as a release note, that's fine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we promise to keep any sort of backward compatibility when it comes to internal interfaces in CA, besides the CloudProvider interface and processors (we did go through deprecation and multiple releases of supporting NodeInfoProcessor before dropping it).

I'm mostly basing my opinion on kubernetes/kubernetes - I don't think there is any promise of golang interface level backward compatibility there. We certainly need to modify predicate checker pretty much every minor just to handle backward incompatible refactors in scheduler framework. So far our general practice when it comes to deprecations, supported versions and similar was to generally follow kubernetes.

All that being said - we all know there are CA forks out there. It's no secret we have one and I also assume it's no secret that so does Azure (I'm pretty sure there are issues in this repo that mention upstreaming from that fork). With processors/ CA is pretty much asking to be forked and customized. If we think this interface is somehow particularly useful for fork maintainers (e.g. based on experience of anyone in this thread) it could be a reason to leave it - perhaps with a comment discouraging new use?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My vote would be slightly in favor of not worrying about this given the historical precedent in the k/k ecosystem of projects.

Forks are probably O.K. because they may already have implemented their own opinionated NodeInfoGetter anyways. :)

@@ -43,6 +44,16 @@ type ClusterSnapshot interface {
// IsPVCUsedByPods returns if the pvc is used by any pod, key = <namespace>/<pvc_name>
IsPVCUsedByPods(key string) bool

// AddNodeInfo adds the given NodeInfo to the snapshot. The Node and the Pods are added, as well as
// any DRA objects passed along them.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather than refer to "DRA objects" generally, should we be more explicit and refer to "ResourceClaims" and "ResourceSlices"?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just tried to make the comment generic enough so it doesn't get outdated if we there are more DRA objects to track in the future.

// Pods should always match the embedded NodeInfo.Pods. This way the schedulerframework.PodInfo can be
// accessed via .NodeInfo.Pods[i], and the internal one via .Pods[i]. CA code normally only needs to
// access the Pod, so iterating over the internal version should be enough.
Pods []*PodInfo
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand why you need Pods with this extended information, but it's a bit scary that you're now having 2 sources of truth for the list of pods on a node (NodeInfo.Pods and NodeInfo.NodeInfo.Pods). It seems like this is asking for the two to somehow get out of sync.

I'm not yet sure what the right solution is. Intuitively the correct way to handle this would be to have an internal representation with a single list of pods and convert on-demand, but I am worried about performance implications of that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than embed a new, local PodInfo type can we simply embed NeededResourceClaims []*resourceapi.ResourceClaim directly? And follow ResourceClaim.Status.ReservedFor to get the associated pod references?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MaciekPytel fully agreed on this being scary, but the alternatives that I see are:

  1. As you mention, changing the ClusterSnapshot implementations to operate only on the internal NodeInfos as the single source of truth, and converting them to schedulerframework.NodeInfo on-demand whenever scheduler code calls NodeInfos().List(). I went with this in the first prototype, but then decided against it for 2 reasons:
  • As you point out, it's unclear how it would affect performance. This would have to be measured, but I suspect it might not be insignificant given that some things are precomputed during schedulerframework.NodeInfo.AddPod().
  • It added unnecessary complexity to ClusterSnapshot implementations, and was annoyingly bug-prone. This is less important IMO, as at least the problems were limited to simulator/snapshot.
  1. We could remove some redundancy by having the pod list only inside the wrapped schedulerframework.NodeInfo, and only keeping the additional information in PodInfo. So PodInfo would only contain NeededResourceSlices, and NodeInfo would contain schedulerframework.NodeInfo and a map from pod UIDs to PodInfo. We could also have a Pods() method on the internal NodeInfo that would merge the two data sources together for ease-of-use. This should be less prone to getting out-of-sync, but at a cost of more complexity. Do you think it's enough/worth it?

FWIW I tried to assert in the unit test that things stay in sync during normal usage patterns that we have in the codebase right now.

@jackfrancis Unfortunately we can't rely on any information inside a ResourceClaim to identify the pod(s) that need it. ReservedFor is only filled out if the claim is actually allocated for the pod. But I think the spirit of your suggestion is similar to the option 2. I describe above, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO that additional complexity is preferable to storing the same data in two places.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with @jackfrancis, duplicating data seems like the worst option. I think both approaches you listed could work.

For approach 1 - you could opportunistically cache schedulerframework.NodeInfo to minimize performance impact (i.e. cache sf.NI whenever you build it, invalidate on any change to framework.NI). That still involves duplicated lists to an extent, but I think it's very different in that there is a clearly defined source of truth. It's probably doesn't solve all the performance problems, as you'd be building sf.NI each time and not updating it and IIRC you're correct and AddPod() was indeed optimized to efficiently update precomputes (for antiaffinity probalby?). I think this may be workable, but I don't love this option.

Approach 2 seems better to me. As long as we can come up with a good interface that encapsulates the complexity (e.g. the Pods() method you described), I think it's a good option. The main thing is to make sure users of framework.NodeInfo won't have to know the information is stored separately (I'd go as far as making fields private and only allowing access via methods outside of simulator).

I think the best approach would be to extend NodeInfo in scheduler framework (i.e. make change to kubernetes/kubernetes).
I think the fact that scheduler snapshots nodes and pods, but just uses in-plugin informers for everything else is inconsistent, error prone and all around terrible. Including DRA objects in scheduler SharedLister seems like a step in the right direction from perspective of just the scheduler. And it has the added benefit of completely removing the need for any custom wrappers in CA together with all the complexity they bring.
Unfortunately, I'm guessing we're out of time for such change in 1.32 scheduler. Even more unfortunate is that once we implement the wrappers, I can't see anyone going back and cleaning this up for 1.33 :(

So, to sum up - I'm +1 for approach 2.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a case to be made that extending NodeInfo in scheduler could have general reuse value? If so, we could probably find someone to do that work (I'd be happy to give that a shot).

This is just to say IMO approach 2 is the best overall, and we may still have a chance at an even better long term outcome if we have the appetite.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MaciekPytel Done, went with approach 2. Also made the fields private as you suggested. WDYT?

IMO having an internal NodeInfo type wrapping the schedulerframework one is useful for CA outside of the DRA logic. Not all info that CA would want to attach to a Node is relevant to scheduler, and putting it there just for CA would be confusing. E.g. now that we have this type, we could keep information like whether a Node is upcoming properly in there instead of hacking it inside Node annotations. Same for e.g. the boot-disk info that GCE cloud provider passes through annotations. The added complexity, on the other hand, should be hidden from most of the codebase.

Separately, I agree that the ideal solution would be for the scheduler to make all API objects that plugins might need a part of its Snapshot - not just Nodes and Pods. I explored that solution, but got to a conclusion that it'd be a massive change, and maybe it wouldn't even solve the problem fully:

  1. The Node/Pod cache that feeds the scheduler Snapshot is quite complex. If we wanted to correlate the DRA objects to Nodes/Pods on the fly during informer updates, we'd need to be very careful not to bloat the complexity further.
  2. Not all DRA objects are "bound" to a Node or a Pod. So we'd need to expand the scheduler Snapshot beyond the current NodeInfo/PodInfo anyway to track those. At this point, maybe it makes more sense not to correlate the objects on the fly, and just essentially have an additional snapshot of DRA objects alongside NodeInfos. In which case we'd need extra fields just in NodeInfo just for CA anyway.
  3. The plugins would still need access to live informers to handle queuing hints. Not a problem for CA since CA doesn't run the queuing logic, but it's a readability challenge on the scheduler side.

@jackfrancis IMO extending the NodeInfo on scheduler side without doing the huge scheduler effort described above doesn't make sense, as the scheduler plugin wouldn't use these fields anyway - they would only be used by CA.

return framework.WrapSchedulerNodeInfos(schedNodeInfos), nil
}

func (snapshot *DeltaClusterSnapshot) AddNodeInfo(nodeInfo *framework.NodeInfo) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why have this method at all (here and elsewhere) if you're discarding all the extra data in framework.NodeInfo and just storing the scheduler representation? I'd expect the result of:

snapshot.AddNodeInfo(x)
y = snapshot.GetNodeInfo(x.Node.Name)

To be that x and y are the same. I think anyone would instinctively assume the same, but that's not the case here and that strikes me as confusing and ultimately error prone when someone makes a wrong assumption. If you only want to store scheduler stuff, I'd rather this method didn't exist at all and everyone would just keep using AddNodeWithPods.
Even if that requires a tiny bit of boilerplate to unpack NodeInfo it is much more clear what's happening here. I know ResourceClaims/Slices will not be stored if I can't even pass them. If I do pass them to snapshot, I expect it will store them as well.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The extra data will be stored and your example will hold, this just comes in later commits (ef9d420#diff-29296088fba933fa837b2efdce93b2423b19247d99f59f1d8ef1bc8d5e8c6915).

I can remove the DRA fields from the new types until then, but then we have wrapper types for seemingly no reason - either way isn't ideal in the intermediate, but the end state should be clear IMO.

@towca towca force-pushed the jtuznik/dra-node-info branch 3 times, most recently from ed7acb7 to 8f2e1ba Compare November 4, 2024 15:05
utils/test is supposed to be usable in any CA package. Having a
dependency on cloudprovider makes it unusuable in any package
that cloudprovider depends on because of import cycles.

The cloudprovider import is only needed by GetGpuConfigFromNode,
which is only used in info_test.go. This commit just moves
GetGpuConfigFromNode there as an unexported function.
Methods to interact with the new internal types are added to
ClusterSnapshot. Cluster Autoscaler code will be migrated to only
use these methods and work on the internal types instead of directly
using the framework types.

The new types are designed so that they can be used as close to the
framework types as possible, which should make the migration manageable.

This allows easily adding additional data to the Nodes and Pods tracked
in ClusterSnapshot, without having to change the scheduler framework.
This will be needed to support DRA, as we'll need to track
ResourceSlices and ResourceClaims.
The new wrapper types should behave like the direct schedulerframework
types for most purposes, so most of the migration is just changing
the imported package.

Constructors look a bit different, so they have to be adapted -
mostly in test code. Accesses to the Pods field have to be changed
to a method call.

After this, the schedulerframework types are only used in the new
wrappers, and in the parts of simulator/ that directly interact with
the scheduler framework. The rest of CA codebase operates on the new
wrapper types.
towca added a commit to towca/autoscaler that referenced this pull request Nov 5, 2024
Copy link
Contributor

@jackfrancis jackfrancis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 6, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jackfrancis, towca

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 9e15df8 into kubernetes:master Nov 6, 2024
7 checks passed
}

// Build a new node info.
newNodeInfo := schedulerframework.NewNodeInfo(newPods...)
newNodeInfo.SetNode(nodeInfo.Node().DeepCopy())
newNodeInfo := framework.NewNodeInfo(nodeInfo.Node().DeepCopy(), nil, newPods...)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this also copy DRA stuff?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, definitely. This will come in a later commit though, after these NodeInfo utils are refactored a bit: #7479. I'm trying to have these first PRs as just refactors so that we can get them in easier, and then have all of the DRA logic in the last PRs.

}

sanitizedNodeInfo := schedulerframework.NewNodeInfo(SanitizePods(pods, sanitizedNode)...)
sanitizedNodeInfo.SetNode(sanitizedNode)
sanitizedNodeInfo := framework.NewNodeInfo(sanitizedNode, nil, SanitizePods(pods, sanitizedNode)...)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this also copy DRA stuff?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same answer as the previous comment - the DRA logic will be added later, this just refactors the non-DRA parts.

sanitizedPod.Spec.NodeName = sanitizedNode.Name
sanitizedPods = append(sanitizedPods, sanitizedPod)
sanitizedPods = append(sanitizedPods, &framework.PodInfo{Pod: sanitizedPod})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I'm less sure, but once again - feels like we should also include info about needed claims in the copy, no?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same answer as the previous comment - the DRA logic will be added later, this just refactors the non-DRA parts.

@@ -44,7 +43,7 @@ func addExpectedPods(nodeInfo *schedulerframework.NodeInfo, scheduledPods []*api
}
// Add scheduled mirror and DS pods
if pod_util.IsMirrorPod(pod) || pod_util.IsDaemonSetPod(pod) {
nodeInfo.AddPod(pod)
nodeInfo.AddPod(&framework.PodInfo{Pod: pod})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This type of code makes me nervous - is every place where we pass pod list going to have to be updated to work with DRA? Do we need to never again use pod and just use PodInfo everywhere?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why nervous? It's a fair bit of toil when refactoring now, but doesn't seem risky going forward.

I don't think using PodInfo instead of Pod is strictly needed everywhere in the codebase, but certainly everywhere where the pods are expected to interact with the snapshot. But there's a lot of code that just grabs something from the snapshot, processes the pods and doesn't modify the snapshot. In this case just working on pods is fine IMO.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but we need to refactor things like PodListProcessor to take and return list of PodInfos (we definitely interact with snapshot/scheduling in filter out schedulable). At which point we probably want to change it in all processors for consistency. At which point we probably just want to use PodInfo everywhere for consistency.

for _, ds := range daemonsets {
shouldRun, _ := daemon.NodeShouldRunDaemonPod(nodeInfo.Node(), ds)
if shouldRun {
pod := daemon.NewPod(ds, nodeInfo.Node().Name)
pod.Name = fmt.Sprintf("%s-pod-%d", ds.Name, rand.Int63())
result = append(result, pod)
result = append(result, &framework.PodInfo{Pod: pod})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to above and probably plenty of other places -> should we change this to somehow include DRA claims (where would we get them from though)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, all this logic will be added in DRA-specific PRs. Though this particular place will probably be a TODO in the initial implementation (meaning that forcing missing DS pods that use DRA won't work initially, but that's probably not a popular use-case).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/cluster-autoscaler area/core-autoscaler Denotes an issue that is related to the core autoscaler and is not specific to any provider. area/provider/alicloud Issues or PRs related to the AliCloud cloud provider implementation area/provider/aws Issues or PRs related to aws provider area/provider/azure Issues or PRs related to azure provider area/provider/cluster-api Issues or PRs related to Cluster API provider area/provider/digitalocean Issues or PRs related to digitalocean provider area/provider/equinixmetal Issues or PRs related to the Equinix Metal cloud provider for Cluster Autoscaler area/provider/externalgrpc Issues or PRs related to the External gRPC provider area/provider/gce area/provider/hetzner Issues or PRs related to Hetzner provider area/provider/ionoscloud area/provider/kwok Issues or PRs related to the kwok cloud provider for Cluster Autoscaler area/provider/linode Issues or PRs related to linode provider area/provider/magnum Issues or PRs related to the Magnum cloud provider for Cluster Autoscaler area/provider/oci Issues or PRs related to oci provider area/provider/rancher cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants