-
Notifications
You must be signed in to change notification settings - Fork 259
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optional garbage collection of finished workloads #2686
base: main
Are you sure you want to change the base?
Optional garbage collection of finished workloads #2686
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: mwysokin The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Hi @mwysokin. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
✅ Deploy Preview for kubernetes-sigs-kueue canceled.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall lgtm. I'm not certain about is the finalizer - it might be worth adding an integration test for it.
I think the issue with suspending a Job when deleting a workload is pre-existing, and can be handled separately from this feature (feel free to open a separate one).
Obj(), | ||
wantError: nil, | ||
}, | ||
"should delete the workload because the retention period has elapsed, object retention configured": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a test case for a workload with finalizer? I would like to also consider an integration test for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure thing 👍 It'd greatly help me to understand the case if you could respond to my comment above first about cleaning up Workloads with a finalizer 🙏
In my opinion it's better to be done with this PR, it should just be a matter of moving
kueue/pkg/controller/jobframework/reconciler.go Lines 615 to 625 in 014d4ec
(I don't think the job is really suspended the logs are lying :) ) |
Co-authored-by: Michał Woźniak <[email protected]>
Co-authored-by: Michał Woźniak <[email protected]>
…letion events occurrences Co-authored-by: Michał Woźniak <[email protected]>
Thank you @mimowo @tenzen-y @trasc for your comments 🙏
|
It would better to open a dedicated PR for the KEP :) |
Sounds like a great plan. |
…t retention policies to be consistent with the KEP.
It took me a while but I applied all requested updates. Thank you so much for your time spent on the initial review 🙏 May I please ask for a little bit more of your time for a second round? 🙏 |
/ok-to-test |
I'll take a look at the failing test on Friday. I'm pretty sure I run them last week and they worked fine. Maybe one of the main merges broke something. I'll investigate. |
FYI I just wanted to let everyone know that I didn't abandon the work. I've been on vacation last week and I'll be back in September. Sorry for the delay 🙇 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm leaving the API discussion to #2742
@@ -193,4 +193,7 @@ func SetDefaults_Configuration(cfg *Configuration) { | |||
if fs := cfg.FairSharing; fs != nil && fs.Enable && len(fs.PreemptionStrategies) == 0 { | |||
fs.PreemptionStrategies = []PreemptionStrategy{LessThanOrEqualToFinalShare, LessThanInitialShare} | |||
} | |||
if cfg.ObjectRetentionPolicies == nil { | |||
cfg.ObjectRetentionPolicies = &ObjectRetentionPolicies{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are not adding anything relevant, let's not add any code to "defaults"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a test
log.Error(err, "Failed to delete workload from the API server") | ||
return ctrl.Result{}, fmt.Errorf("deleting workflow from the API server: %w", err) | ||
} | ||
r.recorder.Eventf(&wl, corev1.EventTypeNormal, "Deleted", "Deleted finished workload due to elapsed retention: %v", workload.Key(&wl)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
r.recorder.Eventf(&wl, corev1.EventTypeNormal, "Deleted", "Deleted finished workload due to elapsed retention: %v", workload.Key(&wl)) | |
r.recorder.Eventf(&wl, corev1.EventTypeNormal, "Deleted", "Deleted finished workload due to elapsed retention") |
Because the event is already targeting the workload, we don't need to add it's name again in the message.
log.Error(err, "Failed to delete workload from the API server") | ||
return ctrl.Result{}, fmt.Errorf("deleting workflow from the API server: %w", err) | ||
} | ||
r.recorder.Eventf(&wl, corev1.EventTypeNormal, "Deleted", "Deleted finished workload due to elapsed retention: %v", workload.Key(&wl)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we are missing a return
for this happy path.
r.recorder.Eventf(&wl, corev1.EventTypeNormal, "Deleted", "Deleted finished workload due to elapsed retention: %v", workload.Key(&wl)) | ||
} else { | ||
remainingTime := expirationTime.Sub(now) | ||
log.V(2).Info("Requeueing workload for deletion after retention period", "remainingTime", remainingTime) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
log.V(2).Info("Requeueing workload for deletion after retention period", "remainingTime", remainingTime) | |
log.V(3).Info("Requeueing workload for deletion after retention period", "remainingTime", remainingTime) |
Or even 4.
expirationTime := finishedCond.LastTransitionTime.Add(r.objectRetention.Duration) | ||
if now.After(expirationTime) { | ||
log.V(2).Info("Deleting workload because it has finished and the retention period has elapsed", "retention", r.objectRetention.Duration) | ||
if err := r.client.Delete(ctx, &wl); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should probably ignore a NotFound error.
Obj(), | ||
wantError: nil, | ||
}, | ||
"shouldn't try to delete the workload (no event emitted) because it is already being deleted by kubernetes, object retention configured": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if it has an owner reference?
ginkgo.By("workload should be deleted after the retention period") | ||
gomega.Eventually(func() error { | ||
return k8sClient.Get(ctx, client.ObjectKeyFromObject(wl), &createdWorkload) | ||
}, 2*time.Second, time.Second).ShouldNot(gomega.Succeed()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use the existing timeout constants.
|
||
func managerSetupWithWorkloadRetentionEnabled(ctx context.Context, mgr manager.Manager) { | ||
objectRetention := &config.ObjectRetentionPolicies{FinishedWorkloadRetention: &v1.Duration{ | ||
Duration: 3 * time.Second, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a bit too long, given that this influences how long the test runs for.
I'm really sorry it's been dragging. I'm coming back to it starting Monday. |
PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
@mwysokin: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
What type of PR is this?
/kind feature
What this PR does / why we need it:
This PR adds an optional and configurable garbage collection of finished Workloads. If the object retention policy is configured for Workloads a completed workload will be either re-queued for later deletion (if the policy is configured but its retention period hasn't elapsed) or deleted (both for existing Workloads that might've been created by the previous version or instance of kueue but now are expired according to the retention policy or for newly created Workloads when their retention period elapses).
It decreases both kueue's memory footprint as well as etcd's storage.
Which issue(s) this PR fixes:
Fixes #1618
Special notes for your reviewer:
This PR comes with a couple of configuration scenarios:
Known issue
The Workload deletion triggers a Workload delete event to be emitted. It's needed because in response to that event kueue performs its internal cleanup (e.g. caches and queues). However it seems that emitting that event automatically causes a Job related to the Workload to be suspended. In this scenario it's confusing and unnecessary because the job is already completed. This is how it chronologically looks:
The author of this PR commits to creating a fix/improvement for the showcased scenario. It's up to the reviewers to decide whether the fix should be either: part of this PR or its own subsequent issue/PR and whether this PR should be blocked or not until the fix is done.
Does this PR introduce a user-facing change?