Optional garbage collection of finished workloads #2686

mwysokin · 2024-07-24T12:12:52Z

What type of PR is this?

/kind feature

What this PR does / why we need it:

This PR adds an optional and configurable garbage collection of finished Workloads. If the object retention policy is configured for Workloads a completed workload will be either re-queued for later deletion (if the policy is configured but its retention period hasn't elapsed) or deleted (both for existing Workloads that might've been created by the previous version or instance of kueue but now are expired according to the retention policy or for newly created Workloads when their retention period elapses).

It decreases both kueue's memory footprint as well as etcd's storage.

Which issue(s) this PR fixes:

Fixes #1618

Special notes for your reviewer:

This PR comes with a couple of configuration scenarios:

if the object retention policy is not configured for Workloads (default) or the policy is explicitly set to null all Workloads are kept indefinitely like in the previous versions of kueue,
if the object retention policy is configured to 0s it means that there's effectively no retention and all completed Workloads are going to be immediately delated,
if the object retention policy is set to any other value than 0s kueue will delete Workloads when their retention period elapses (both for existing and newly created Workloads).

Known issue

The Workload deletion triggers a Workload delete event to be emitted. It's needed because in response to that event kueue performs its internal cleanup (e.g. caches and queues). However it seems that emitting that event automatically causes a Job related to the Workload to be suspended. In this scenario it's confusing and unnecessary because the job is already completed. This is how it chronologically looks:

{'level': 'Level(-2)', 'caller': 'core/workload_controller.go:154', 'msg': 'Reconciling Workload'}
{'level': 'Level(-2)', 'caller': 'core/workload_controller.go:169', 'msg': 'Deleting workload because it has finished and the retention period has elapsed'}
{'level': 'debug, 'caller': 'recorder/recorder.go:104', 'msg': 'Deleted workload job-stress-job-1-pqsxf-dd783'}
{'level': 'Level(-2)', 'caller': 'core/workload_controller.go:627', 'msg': 'Workload delete event'}
{'level': 'Level(-2)', 'caller': 'core/clusterqueue_controller.go:330', 'msg': 'Got generic event'}
{'level': 'Level(-2)', 'caller': 'jobframework/reconciler.go:313', 'msg': 'Reconciling Job'}
{'level': 'Level(-2)', 'caller': 'jobframework/reconciler.go:606', 'msg': 'job with no matching workload, suspending'}

The author of this PR commits to creating a fix/improvement for the showcased scenario. It's up to the reviewers to decide whether the fix should be either: part of this PR or its own subsequent issue/PR and whether this PR should be blocked or not until the fix is done.

Does this PR introduce a user-facing change?

Introduce ObjectRetentionPolicies.FinishedWorkloadRetention flag, which allows to configure a retention policy for Workloads, Workloads will be deleted when their retention period elapses, by default this new behavior is disabled

k8s-ci-robot · 2024-07-24T12:12:59Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: mwysokin
Once this PR has been reviewed and has the lgtm label, please assign alculquicondor for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2024-07-24T12:13:01Z

Hi @mwysokin. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

netlify · 2024-07-24T12:13:08Z

✅ Deploy Preview for kubernetes-sigs-kueue canceled.

Name	Link
🔨 Latest commit	`0b9933f`
🔍 Latest deploy log	https://app.netlify.com/sites/kubernetes-sigs-kueue/deploys/66b4c01c7ad9d90008a181a2

mimowo

Overall lgtm. I'm not certain about is the finalizer - it might be worth adding an integration test for it.

I think the issue with suspending a Job when deleting a workload is pre-existing, and can be handled separately from this feature (feel free to open a separate one).

apis/config/v1beta1/configuration_types.go

pkg/controller/core/workload_controller.go

pkg/controller/core/workload_controller_test.go

pkg/controller/core/workload_controller.go

mimowo · 2024-07-24T12:50:57Z

pkg/controller/core/workload_controller_test.go

+				Obj(),
+			wantError: nil,
+		},
+		"should delete the workload because the retention period has elapsed, object retention configured": {


Can we add a test case for a workload with finalizer? I would like to also consider an integration test for it.

Sure thing 👍 It'd greatly help me to understand the case if you could respond to my comment above first about cleaning up Workloads with a finalizer 🙏

trasc · 2024-07-24T13:37:51Z

I think the issue with suspending a Job when deleting a workload is pre-existing, and can be handled separately from this feature (feel free to open a separate one).

In my opinion it's better to be done with this PR, it should just be a matter of moving

kueue/pkg/controller/jobframework/reconciler.go

Line 606 in 014d4ec

log.V(2).Info("job with no matching workload, suspending")

within

kueue/pkg/controller/jobframework/reconciler.go

Lines 615 to 625 in 014d4ec

    
           if _, _, finished := job.Finished(); !finished { 
        
           	var msg string 
        
           	if w == nil { 
        
           		msg = "Missing Workload; unable to restore pod templates" 
        
           	} else { 
        
           		msg = "No matching Workload; restoring pod templates according to existent Workload" 
        
           	} 
        
           	if err := r.stopJob(ctx, job, w, StopReasonNoMatchingWorkload, msg); err != nil { 
        
           		return nil, fmt.Errorf("stopping job with no matching workload: %w", err) 
        
           	} 
        
           }

.

(I don't think the job is really suspended the logs are lying :) )

Co-authored-by: Michał Woźniak <[email protected]>

…letion events occurrences Co-authored-by: Michał Woźniak <[email protected]>

mwysokin · 2024-07-25T12:41:46Z

Thank you @mimowo @tenzen-y @trasc for your comments 🙏
The plan with this PR is following:

create a small KEP for it (KEP 1618: Optional garbage collection of finished Workloads #2742),
add logic/a test case for an expired Workload with a finalizer once I understand that case correctly (26d083e),
either fix the suspension or remove the confusing message about removing a Job that was completed if it's a small enough change so that it doesn't pollute the PR.

tenzen-y · 2024-07-25T12:43:36Z

Thank you @mimowo @tenzen-y @trasc for your comments 🙏 The plan with this PR is following:

create a small KEP for it,

add logic/a test case for an expired Workload with a finalizer once I understand that case correctly,

either fix the suspension or remove the confusing message about removing a Job that was completed if it's a small enough change so that it doesn't pollute the PR.

It would better to open a dedicated PR for the KEP :)

mimowo · 2024-07-25T12:47:56Z

Sounds like a great plan.

…nished-workloads

…n policy is set

…t retention policies to be consistent with the KEP.

mwysokin · 2024-08-08T12:56:22Z

It took me a while but I applied all requested updates. Thank you so much for your time spent on the initial review 🙏 May I please ask for a little bit more of your time for a second round? 🙏

alculquicondor · 2024-08-14T16:46:14Z

/ok-to-test
/cc

mwysokin · 2024-08-14T18:12:40Z

I'll take a look at the failing test on Friday. I'm pretty sure I run them last week and they worked fine. Maybe one of the main merges broke something. I'll investigate.

mwysokin · 2024-08-26T05:33:27Z

FYI I just wanted to let everyone know that I didn't abandon the work. I've been on vacation last week and I'll be back in September. Sorry for the delay 🙇

alculquicondor · 2024-08-26T15:55:16Z

apis/config/v1beta1/configuration_types.go

I'm leaving the API discussion to #2742

alculquicondor · 2024-08-26T15:56:30Z

apis/config/v1beta1/defaults.go

@@ -193,4 +193,7 @@ func SetDefaults_Configuration(cfg *Configuration) {
 	if fs := cfg.FairSharing; fs != nil && fs.Enable && len(fs.PreemptionStrategies) == 0 {
 		fs.PreemptionStrategies = []PreemptionStrategy{LessThanOrEqualToFinalShare, LessThanInitialShare}
 	}
+	if cfg.ObjectRetentionPolicies == nil {
+		cfg.ObjectRetentionPolicies = &ObjectRetentionPolicies{}


If we are not adding anything relevant, let's not add any code to "defaults"

alculquicondor · 2024-08-26T15:57:20Z

pkg/config/validation.go

alculquicondor · 2024-08-26T16:00:03Z

pkg/controller/core/workload_controller.go

+				log.Error(err, "Failed to delete workload from the API server")
+				return ctrl.Result{}, fmt.Errorf("deleting workflow from the API server: %w", err)
+			}
+			r.recorder.Eventf(&wl, corev1.EventTypeNormal, "Deleted", "Deleted finished workload due to elapsed retention:  %v", workload.Key(&wl))


Suggested change

r.recorder.Eventf(&wl, corev1.EventTypeNormal, "Deleted", "Deleted finished workload due to elapsed retention: %v", workload.Key(&wl))

r.recorder.Eventf(&wl, corev1.EventTypeNormal, "Deleted", "Deleted finished workload due to elapsed retention")

Because the event is already targeting the workload, we don't need to add it's name again in the message.

alculquicondor · 2024-08-26T16:00:51Z

pkg/controller/core/workload_controller.go

+				log.Error(err, "Failed to delete workload from the API server")
+				return ctrl.Result{}, fmt.Errorf("deleting workflow from the API server: %w", err)
+			}
+			r.recorder.Eventf(&wl, corev1.EventTypeNormal, "Deleted", "Deleted finished workload due to elapsed retention:  %v", workload.Key(&wl))


we are missing a return for this happy path.

alculquicondor · 2024-08-26T16:01:12Z

pkg/controller/core/workload_controller.go

+			r.recorder.Eventf(&wl, corev1.EventTypeNormal, "Deleted", "Deleted finished workload due to elapsed retention:  %v", workload.Key(&wl))
+		} else {
+			remainingTime := expirationTime.Sub(now)
+			log.V(2).Info("Requeueing workload for deletion after retention period", "remainingTime", remainingTime)


Suggested change

log.V(2).Info("Requeueing workload for deletion after retention period", "remainingTime", remainingTime)

log.V(3).Info("Requeueing workload for deletion after retention period", "remainingTime", remainingTime)

Or even 4.

alculquicondor · 2024-08-26T16:02:48Z

pkg/controller/core/workload_controller.go

+		expirationTime := finishedCond.LastTransitionTime.Add(r.objectRetention.Duration)
+		if now.After(expirationTime) {
+			log.V(2).Info("Deleting workload because it has finished and the retention period has elapsed", "retention", r.objectRetention.Duration)
+			if err := r.client.Delete(ctx, &wl); err != nil {


we should probably ignore a NotFound error.

alculquicondor · 2024-08-26T16:04:26Z

pkg/controller/core/workload_controller_test.go

+				Obj(),
+			wantError: nil,
+		},
+		"shouldn't try to delete the workload (no event emitted) because it is already being deleted by kubernetes, object retention configured": {


what if it has an owner reference?

alculquicondor · 2024-08-26T16:06:28Z

test/integration/controller/core/workload_controller_test.go

+			ginkgo.By("workload should be deleted after the retention period")
+			gomega.Eventually(func() error {
+				return k8sClient.Get(ctx, client.ObjectKeyFromObject(wl), &createdWorkload)
+			}, 2*time.Second, time.Second).ShouldNot(gomega.Succeed())


use the existing timeout constants.

alculquicondor · 2024-08-26T16:07:23Z

test/integration/controller/core/suite_test.go

+
+func managerSetupWithWorkloadRetentionEnabled(ctx context.Context, mgr manager.Manager) {
+	objectRetention := &config.ObjectRetentionPolicies{FinishedWorkloadRetention: &v1.Duration{
+		Duration: 3 * time.Second,


a bit too long, given that this influences how long the test runs for.

mwysokin · 2024-09-27T08:59:14Z

I'm really sorry it's been dragging. I'm coming back to it starting Monday.

k8s-ci-robot · 2024-10-19T02:45:50Z

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot · 2024-11-14T16:38:33Z

@mwysokin: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-kueue-test-scheduling-perf-main	`0b9933f`	link	true	`/test pull-kueue-test-scheduling-perf-main`
pull-kueue-verify-main	`0b9933f`	link	true	`/test pull-kueue-verify-main`
pull-kueue-test-integration-main	`0b9933f`	link	true	`/test pull-kueue-test-integration-main`
pull-kueue-test-tas-e2e-main	`0b9933f`	link	true	`/test pull-kueue-test-tas-e2e-main`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

mwysokin added 2 commits July 22, 2024 14:41

Add resource retention config settings

c25ad39

Add gc of finished workloads

396c6ae

k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jul 24, 2024

k8s-ci-robot requested review from mimowo and trasc July 24, 2024 12:12

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jul 24, 2024

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jul 24, 2024

mimowo reviewed Jul 24, 2024

View reviewed changes

mwysokin and others added 3 commits July 25, 2024 14:12

Review: Use inline error check for workload deletion

0eae7e4

Co-authored-by: Michał Woźniak <[email protected]>

Review: Call things by their right name

b39604f

Co-authored-by: Michał Woźniak <[email protected]>

Review: Make recorded Deleted event consistent with other workload de…

34884c1

…letion events occurrences Co-authored-by: Michał Woźniak <[email protected]>

mwysokin added 3 commits August 5, 2024 11:25

Merge branch 'refs/heads/main' into optional-garbage-collection-of-fi…

5501f4a

…nished-workloads

Fix typo

5187622

Add integration test checking whether deletion happens when retetntio…

26d083e

…n policy is set

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Aug 7, 2024

mwysokin added 4 commits August 7, 2024 12:09

REVIEW: Modified info about the behavior of a nil value for all objec…

44697e9

…t retention policies to be consistent with the KEP.

Fix 0 value in the comment for the finished workload retention

0a40e3f

Fix test after event message change

acc4dc2

Fix confusing message about suspending a job with no matching workload

0b9933f

mwysokin requested a review from mimowo August 8, 2024 12:55

mwysokin requested a review from tenzen-y August 8, 2024 12:55

k8s-ci-robot requested a review from alculquicondor August 14, 2024 16:46

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Aug 14, 2024

alculquicondor reviewed Aug 26, 2024

View reviewed changes

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optional garbage collection of finished workloads #2686

Optional garbage collection of finished workloads #2686

mwysokin commented Jul 24, 2024 •

edited

Loading

k8s-ci-robot commented Jul 24, 2024

k8s-ci-robot commented Jul 24, 2024

netlify bot commented Jul 24, 2024 •

edited

Loading

mimowo left a comment

mimowo Jul 24, 2024

mwysokin Jul 25, 2024 •

edited

Loading

trasc commented Jul 24, 2024

mwysokin commented Jul 25, 2024 •

edited

Loading

tenzen-y commented Jul 25, 2024

mimowo commented Jul 25, 2024

mwysokin commented Aug 8, 2024

alculquicondor commented Aug 14, 2024

mwysokin commented Aug 14, 2024

mwysokin commented Aug 26, 2024

alculquicondor Aug 26, 2024

alculquicondor Aug 26, 2024

alculquicondor Aug 26, 2024

alculquicondor Aug 26, 2024

alculquicondor Aug 26, 2024

alculquicondor Aug 26, 2024

alculquicondor Aug 26, 2024

alculquicondor Aug 26, 2024

alculquicondor Aug 26, 2024

alculquicondor Aug 26, 2024

mwysokin commented Sep 27, 2024

k8s-ci-robot commented Oct 19, 2024

k8s-ci-robot commented Nov 14, 2024

	r.recorder.Eventf(&wl, corev1.EventTypeNormal, "Deleted", "Deleted finished workload due to elapsed retention: %v", workload.Key(&wl))
	r.recorder.Eventf(&wl, corev1.EventTypeNormal, "Deleted", "Deleted finished workload due to elapsed retention")

	log.V(2).Info("Requeueing workload for deletion after retention period", "remainingTime", remainingTime)
	log.V(3).Info("Requeueing workload for deletion after retention period", "remainingTime", remainingTime)

Optional garbage collection of finished workloads #2686

Are you sure you want to change the base?

Optional garbage collection of finished workloads #2686

Conversation

mwysokin commented Jul 24, 2024 • edited Loading

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Known issue

Does this PR introduce a user-facing change?

k8s-ci-robot commented Jul 24, 2024

k8s-ci-robot commented Jul 24, 2024

netlify bot commented Jul 24, 2024 • edited Loading

✅ Deploy Preview for kubernetes-sigs-kueue canceled.

mimowo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mwysokin Jul 25, 2024 • edited Loading

Choose a reason for hiding this comment

trasc commented Jul 24, 2024

mwysokin commented Jul 25, 2024 • edited Loading

tenzen-y commented Jul 25, 2024

mimowo commented Jul 25, 2024

mwysokin commented Aug 8, 2024

alculquicondor commented Aug 14, 2024

mwysokin commented Aug 14, 2024

mwysokin commented Aug 26, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mwysokin commented Sep 27, 2024

k8s-ci-robot commented Oct 19, 2024

k8s-ci-robot commented Nov 14, 2024

mwysokin commented Jul 24, 2024 •

edited

Loading

netlify bot commented Jul 24, 2024 •

edited

Loading

mwysokin Jul 25, 2024 •

edited

Loading

mwysokin commented Jul 25, 2024 •

edited

Loading