-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CA: introduce internal NodeInfo #7447
Conversation
@@ -42,7 +42,7 @@ func NewNodes() *Nodes { | |||
|
|||
// NodeInfoGetter is anything that can return NodeInfo object by name. | |||
type NodeInfoGetter interface { | |||
Get(name string) (*schedulerframework.NodeInfo, error) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be safer to keep the legacy Get
method in here given that this is an exported, public interface?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea is for all of CA codebase to use the new types instead of the scheduler ones, unless it's absolutely necessary (e.g. when actually interacting with the scheduler framework). So most NodeInfos().Get()
calls have to be translated to GetNodeInfo()
- including the ones made via this interface. There is no reason to leave the scheduler version here IMO.
This interface looks like the classic "local interface with a relevant subset of methods from some other interface" pattern, but it was exported probably by mistake. As expected the interface is only used locally, so changed it to be unexported, WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I think this is defensible. Maybe worth calling it out in release notes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we might not be on the same page here. Why would changing an arbitrary exported interface in CA code be release-note-worthy? There's a policy around breaking changes in the CloudProvider
/NodeGroup
interfaces specifically, but others should be fair game to change as needed, right?
It could be argued that the NodeGroup.TemplateNodeInfo()
signature change in this PR should be highlighted in release notes (or actually go through a deprecation?), I'm not sure about that part. All in-tree cloud providers are automatically adapted so the change doesn't break them, but it will break out-of-tree cloud providers. Not sure if that's also covered under the policy though, I think we've done similar "mechanical" changes to all in-tree cloud providers without going through the deprecation process.
@MaciekPytel WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm just speaking to the fact that it's theoretically possible for an external dependancy (outside of the local CA codebase) to exist in the wild that reuses NodeInfoGetter
, and those downstream use cases will need to be adapted to these changes.
Not a big deal either way. If that doesn't rise to the level of being included as a release note, that's fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we promise to keep any sort of backward compatibility when it comes to internal interfaces in CA, besides the CloudProvider interface and processors (we did go through deprecation and multiple releases of supporting NodeInfoProcessor before dropping it).
I'm mostly basing my opinion on kubernetes/kubernetes - I don't think there is any promise of golang interface level backward compatibility there. We certainly need to modify predicate checker pretty much every minor just to handle backward incompatible refactors in scheduler framework. So far our general practice when it comes to deprecations, supported versions and similar was to generally follow kubernetes.
All that being said - we all know there are CA forks out there. It's no secret we have one and I also assume it's no secret that so does Azure (I'm pretty sure there are issues in this repo that mention upstreaming from that fork). With processors/ CA is pretty much asking to be forked and customized. If we think this interface is somehow particularly useful for fork maintainers (e.g. based on experience of anyone in this thread) it could be a reason to leave it - perhaps with a comment discouraging new use?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My vote would be slightly in favor of not worrying about this given the historical precedent in the k/k ecosystem of projects.
Forks are probably O.K. because they may already have implemented their own opinionated NodeInfoGetter
anyways. :)
@@ -43,6 +44,16 @@ type ClusterSnapshot interface { | |||
// IsPVCUsedByPods returns if the pvc is used by any pod, key = <namespace>/<pvc_name> | |||
IsPVCUsedByPods(key string) bool | |||
|
|||
// AddNodeInfo adds the given NodeInfo to the snapshot. The Node and the Pods are added, as well as | |||
// any DRA objects passed along them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rather than refer to "DRA objects" generally, should we be more explicit and refer to "ResourceClaims" and "ResourceSlices"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just tried to make the comment generic enough so it doesn't get outdated if we there are more DRA objects to track in the future.
// Pods should always match the embedded NodeInfo.Pods. This way the schedulerframework.PodInfo can be | ||
// accessed via .NodeInfo.Pods[i], and the internal one via .Pods[i]. CA code normally only needs to | ||
// access the Pod, so iterating over the internal version should be enough. | ||
Pods []*PodInfo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand why you need Pods with this extended information, but it's a bit scary that you're now having 2 sources of truth for the list of pods on a node (NodeInfo.Pods and NodeInfo.NodeInfo.Pods). It seems like this is asking for the two to somehow get out of sync.
I'm not yet sure what the right solution is. Intuitively the correct way to handle this would be to have an internal representation with a single list of pods and convert on-demand, but I am worried about performance implications of that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than embed a new, local PodInfo
type can we simply embed NeededResourceClaims []*resourceapi.ResourceClaim
directly? And follow ResourceClaim.Status.ReservedFor
to get the associated pod references?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MaciekPytel fully agreed on this being scary, but the alternatives that I see are:
- As you mention, changing the ClusterSnapshot implementations to operate only on the internal NodeInfos as the single source of truth, and converting them to
schedulerframework.NodeInfo
on-demand whenever scheduler code callsNodeInfos().List()
. I went with this in the first prototype, but then decided against it for 2 reasons:
- As you point out, it's unclear how it would affect performance. This would have to be measured, but I suspect it might not be insignificant given that some things are precomputed during
schedulerframework.NodeInfo.AddPod()
. - It added unnecessary complexity to
ClusterSnapshot
implementations, and was annoyingly bug-prone. This is less important IMO, as at least the problems were limited tosimulator/snapshot
.
- We could remove some redundancy by having the pod list only inside the wrapped
schedulerframework.NodeInfo
, and only keeping the additional information inPodInfo
. SoPodInfo
would only containNeededResourceSlices
, andNodeInfo
would containschedulerframework.NodeInfo
and a map from pod UIDs toPodInfo
. We could also have aPods()
method on the internal NodeInfo that would merge the two data sources together for ease-of-use. This should be less prone to getting out-of-sync, but at a cost of more complexity. Do you think it's enough/worth it?
FWIW I tried to assert in the unit test that things stay in sync during normal usage patterns that we have in the codebase right now.
@jackfrancis Unfortunately we can't rely on any information inside a ResourceClaim to identify the pod(s) that need it. ReservedFor
is only filled out if the claim is actually allocated for the pod. But I think the spirit of your suggestion is similar to the option 2. I describe above, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO that additional complexity is preferable to storing the same data in two places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with @jackfrancis, duplicating data seems like the worst option. I think both approaches you listed could work.
For approach 1 - you could opportunistically cache schedulerframework.NodeInfo to minimize performance impact (i.e. cache sf.NI whenever you build it, invalidate on any change to framework.NI). That still involves duplicated lists to an extent, but I think it's very different in that there is a clearly defined source of truth. It's probably doesn't solve all the performance problems, as you'd be building sf.NI each time and not updating it and IIRC you're correct and AddPod() was indeed optimized to efficiently update precomputes (for antiaffinity probalby?). I think this may be workable, but I don't love this option.
Approach 2 seems better to me. As long as we can come up with a good interface that encapsulates the complexity (e.g. the Pods() method you described), I think it's a good option. The main thing is to make sure users of framework.NodeInfo won't have to know the information is stored separately (I'd go as far as making fields private and only allowing access via methods outside of simulator).
I think the best approach would be to extend NodeInfo in scheduler framework (i.e. make change to kubernetes/kubernetes).
I think the fact that scheduler snapshots nodes and pods, but just uses in-plugin informers for everything else is inconsistent, error prone and all around terrible. Including DRA objects in scheduler SharedLister seems like a step in the right direction from perspective of just the scheduler. And it has the added benefit of completely removing the need for any custom wrappers in CA together with all the complexity they bring.
Unfortunately, I'm guessing we're out of time for such change in 1.32 scheduler. Even more unfortunate is that once we implement the wrappers, I can't see anyone going back and cleaning this up for 1.33 :(
So, to sum up - I'm +1 for approach 2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a case to be made that extending NodeInfo in scheduler could have general reuse value? If so, we could probably find someone to do that work (I'd be happy to give that a shot).
This is just to say IMO approach 2 is the best overall, and we may still have a chance at an even better long term outcome if we have the appetite.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MaciekPytel Done, went with approach 2. Also made the fields private as you suggested. WDYT?
IMO having an internal NodeInfo type wrapping the schedulerframework one is useful for CA outside of the DRA logic. Not all info that CA would want to attach to a Node is relevant to scheduler, and putting it there just for CA would be confusing. E.g. now that we have this type, we could keep information like whether a Node is upcoming properly in there instead of hacking it inside Node annotations. Same for e.g. the boot-disk info that GCE cloud provider passes through annotations. The added complexity, on the other hand, should be hidden from most of the codebase.
Separately, I agree that the ideal solution would be for the scheduler to make all API objects that plugins might need a part of its Snapshot - not just Nodes and Pods. I explored that solution, but got to a conclusion that it'd be a massive change, and maybe it wouldn't even solve the problem fully:
- The Node/Pod cache that feeds the scheduler Snapshot is quite complex. If we wanted to correlate the DRA objects to Nodes/Pods on the fly during informer updates, we'd need to be very careful not to bloat the complexity further.
- Not all DRA objects are "bound" to a Node or a Pod. So we'd need to expand the scheduler Snapshot beyond the current NodeInfo/PodInfo anyway to track those. At this point, maybe it makes more sense not to correlate the objects on the fly, and just essentially have an additional snapshot of DRA objects alongside NodeInfos. In which case we'd need extra fields just in NodeInfo just for CA anyway.
- The plugins would still need access to live informers to handle queuing hints. Not a problem for CA since CA doesn't run the queuing logic, but it's a readability challenge on the scheduler side.
@jackfrancis IMO extending the NodeInfo on scheduler side without doing the huge scheduler effort described above doesn't make sense, as the scheduler plugin wouldn't use these fields anyway - they would only be used by CA.
return framework.WrapSchedulerNodeInfos(schedNodeInfos), nil | ||
} | ||
|
||
func (snapshot *DeltaClusterSnapshot) AddNodeInfo(nodeInfo *framework.NodeInfo) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why have this method at all (here and elsewhere) if you're discarding all the extra data in framework.NodeInfo and just storing the scheduler representation? I'd expect the result of:
snapshot.AddNodeInfo(x)
y = snapshot.GetNodeInfo(x.Node.Name)
To be that x and y are the same. I think anyone would instinctively assume the same, but that's not the case here and that strikes me as confusing and ultimately error prone when someone makes a wrong assumption. If you only want to store scheduler stuff, I'd rather this method didn't exist at all and everyone would just keep using AddNodeWithPods.
Even if that requires a tiny bit of boilerplate to unpack NodeInfo it is much more clear what's happening here. I know ResourceClaims/Slices will not be stored if I can't even pass them. If I do pass them to snapshot, I expect it will store them as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The extra data will be stored and your example will hold, this just comes in later commits (ef9d420#diff-29296088fba933fa837b2efdce93b2423b19247d99f59f1d8ef1bc8d5e8c6915).
I can remove the DRA fields from the new types until then, but then we have wrapper types for seemingly no reason - either way isn't ideal in the intermediate, but the end state should be clear IMO.
ed7acb7
to
8f2e1ba
Compare
utils/test is supposed to be usable in any CA package. Having a dependency on cloudprovider makes it unusuable in any package that cloudprovider depends on because of import cycles. The cloudprovider import is only needed by GetGpuConfigFromNode, which is only used in info_test.go. This commit just moves GetGpuConfigFromNode there as an unexported function.
Methods to interact with the new internal types are added to ClusterSnapshot. Cluster Autoscaler code will be migrated to only use these methods and work on the internal types instead of directly using the framework types. The new types are designed so that they can be used as close to the framework types as possible, which should make the migration manageable. This allows easily adding additional data to the Nodes and Pods tracked in ClusterSnapshot, without having to change the scheduler framework. This will be needed to support DRA, as we'll need to track ResourceSlices and ResourceClaims.
The new wrapper types should behave like the direct schedulerframework types for most purposes, so most of the migration is just changing the imported package. Constructors look a bit different, so they have to be adapted - mostly in test code. Accesses to the Pods field have to be changed to a method call. After this, the schedulerframework types are only used in the new wrappers, and in the parts of simulator/ that directly interact with the scheduler framework. The rest of CA codebase operates on the new wrapper types.
8f2e1ba
to
879c6a8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jackfrancis, towca The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
} | ||
|
||
// Build a new node info. | ||
newNodeInfo := schedulerframework.NewNodeInfo(newPods...) | ||
newNodeInfo.SetNode(nodeInfo.Node().DeepCopy()) | ||
newNodeInfo := framework.NewNodeInfo(nodeInfo.Node().DeepCopy(), nil, newPods...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this also copy DRA stuff?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, definitely. This will come in a later commit though, after these NodeInfo utils are refactored a bit: #7479. I'm trying to have these first PRs as just refactors so that we can get them in easier, and then have all of the DRA logic in the last PRs.
} | ||
|
||
sanitizedNodeInfo := schedulerframework.NewNodeInfo(SanitizePods(pods, sanitizedNode)...) | ||
sanitizedNodeInfo.SetNode(sanitizedNode) | ||
sanitizedNodeInfo := framework.NewNodeInfo(sanitizedNode, nil, SanitizePods(pods, sanitizedNode)...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this also copy DRA stuff?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same answer as the previous comment - the DRA logic will be added later, this just refactors the non-DRA parts.
sanitizedPod.Spec.NodeName = sanitizedNode.Name | ||
sanitizedPods = append(sanitizedPods, sanitizedPod) | ||
sanitizedPods = append(sanitizedPods, &framework.PodInfo{Pod: sanitizedPod}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I'm less sure, but once again - feels like we should also include info about needed claims in the copy, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same answer as the previous comment - the DRA logic will be added later, this just refactors the non-DRA parts.
@@ -44,7 +43,7 @@ func addExpectedPods(nodeInfo *schedulerframework.NodeInfo, scheduledPods []*api | |||
} | |||
// Add scheduled mirror and DS pods | |||
if pod_util.IsMirrorPod(pod) || pod_util.IsDaemonSetPod(pod) { | |||
nodeInfo.AddPod(pod) | |||
nodeInfo.AddPod(&framework.PodInfo{Pod: pod}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This type of code makes me nervous - is every place where we pass pod list going to have to be updated to work with DRA? Do we need to never again use pod and just use PodInfo everywhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why nervous? It's a fair bit of toil when refactoring now, but doesn't seem risky going forward.
I don't think using PodInfo instead of Pod is strictly needed everywhere in the codebase, but certainly everywhere where the pods are expected to interact with the snapshot. But there's a lot of code that just grabs something from the snapshot, processes the pods and doesn't modify the snapshot. In this case just working on pods is fine IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, but we need to refactor things like PodListProcessor to take and return list of PodInfos (we definitely interact with snapshot/scheduling in filter out schedulable). At which point we probably want to change it in all processors for consistency. At which point we probably just want to use PodInfo everywhere for consistency.
for _, ds := range daemonsets { | ||
shouldRun, _ := daemon.NodeShouldRunDaemonPod(nodeInfo.Node(), ds) | ||
if shouldRun { | ||
pod := daemon.NewPod(ds, nodeInfo.Node().Name) | ||
pod.Name = fmt.Sprintf("%s-pod-%d", ds.Name, rand.Int63()) | ||
result = append(result, pod) | ||
result = append(result, &framework.PodInfo{Pod: pod}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to above and probably plenty of other places -> should we change this to somehow include DRA claims (where would we get them from though)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, all this logic will be added in DRA-specific PRs. Though this particular place will probably be a TODO in the initial implementation (meaning that forcing missing DS pods that use DRA won't work initially, but that's probably not a popular use-case).
What type of PR is this?
/kind feature
What this PR does / why we need it:
This is a part of Dynamic Resource Allocation (DRA) support in Cluster Autoscaler. The CA codebase is migrated to use new NodeInfo/PodInfo types wrapping the corresponding schedulerframework types. This allows attaching additional data to Nodes/Pods in autoscaling simulations, which will be needed for DRA.
Which issue(s) this PR fixes:
The CA/DRA integration is tracked in kubernetes/kubernetes#118612, this is just part of the implementation.
Special notes for your reviewer:
This is intended to be a no-op refactor. It was extracted from #7350 as the easiest part to tackle first while reviewing.
There are 3 meaningful commits to be reviewed sequentially, though most of the delta is in the last one.
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:
/assign @MaciekPytel
/assign @jackfrancis