Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC for Degraded NodePool Status Condition #1910

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

jigisha620
Copy link
Contributor

Description

Adding RFC for Degraded NodePool Status Condition.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 10, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: jigisha620
Once this PR has been reviewed and has the lgtm label, please assign bwagner5 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jan 10, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @jigisha620. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 10, 2025
@jigisha620 jigisha620 force-pushed the degraded-nodepool-rfc branch from 79262b2 to 1bc7741 Compare January 10, 2025 23:24
@coveralls
Copy link

coveralls commented Jan 10, 2025

Pull Request Test Coverage Report for Build 12718791708

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 4 unchanged lines in 2 files lost coverage.
  • Overall coverage decreased (-0.02%) to 81.184%

Files with Coverage Reduction New Missed Lines %
pkg/controllers/disruption/drift.go 2 89.66%
pkg/scheduling/requirements.go 2 98.01%
Totals Coverage Status
Change from base Build 12718181288: -0.02%
Covered Lines: 9082
Relevant Lines: 11187

💛 - Coveralls

Copy link
Member

@jmdeal jmdeal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checkpointing

designs/degraded-nodepools.md Outdated Show resolved Hide resolved
designs/degraded-nodepools.md Outdated Show resolved Hide resolved
designs/degraded-nodepools.md Outdated Show resolved Hide resolved
designs/degraded-nodepools.md Outdated Show resolved Hide resolved
designs/degraded-nodepools.md Outdated Show resolved Hide resolved
designs/degraded-nodepools.md Outdated Show resolved Hide resolved

#### Considerations

1. 👎 Heuristics can be wrong and mask failures
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate on what type of failures are being masked? As for it being wrong, I'm wondering if we should only consider degraded unknown or true. Maybe we don't ever transition it to false?


One example is that when a network path does not exist due to a misconfigured VPC (network access control lists, subnets, route tables), Karpenter will not be able to provision compute with that NodeClass that joins the cluster until the error is fixed. Crucially, this will continue to charge users for compute that can never be used in a cluster.

To improve visibility of these failure modes, this RFC proposes adding a `Degraded` status condition on the Nodepool that indicate to cluster users there may be a problem with a NodePool/NodeClass combination that needs to be investigated and corrected.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think like Jason has called out online and offline, I would make our motivation front and center here. Why do we think that there is a need for something like this to exist? Does it make tracking down failures to NodePools easier? Does alarming get easier with this kind of a setup?

designs/degraded-nodepools.md Outdated Show resolved Hide resolved

Karpenter can launch nodes with a NodePool that will never join a cluster when a NodeClass is misconfigured.

One example is that when a network path does not exist due to a misconfigured VPC (network access control lists, subnets, route tables), Karpenter will not be able to provision compute with that NodeClass that joins the cluster until the error is fixed. Crucially, this will continue to charge users for compute that can never be used in a cluster.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think listing out tangible, real faillure types that this will catch will help illuminate how we design this.

Evaluation conditions -

1. We start with an empty buffer with `Degraded: Unknown`.
2. There have to be 2 minimum failures in the buffer for `Degraded` to transition to `True` or basically the threshold would be 80%.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing that I still feel like would be better here is if we considered flipping the polarity of this condition type at all -- Degraded: False meaning that it's healthy feels a bit weird to me but I get that we have to come up with some other word besides "Degraded" that isn't "Ready" and probably isn't "Healthy" to really reflect what this condition is evaluating

designs/degraded-nodepools.md Outdated Show resolved Hide resolved
Unsuccessful Launch: -1

[] = 'Degraded: Unknown'
[-1] = 'Degraded: Unknown'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: It's slightly confusing to call this "Degraded: Unknown". The only reason that I say that is becasue this doesn't necessarily mean that we transition the condition to Unknown when the condition is already set -- I know this is said above but I did find it a tad semantically odd as I was reading through this and trying to parse-out the design

designs/degraded-nodepools.md Outdated Show resolved Hide resolved
@jigisha620 jigisha620 force-pushed the degraded-nodepool-rfc branch from 1bc7741 to 3ffdbcc Compare January 14, 2025 23:41
@jigisha620 jigisha620 changed the title WIP: RFC for Degraded NodePool Status Condition RFC for Degraded NodePool Status Condition Jan 16, 2025
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 16, 2025
Last Transition Time: 2025-01-13T18:57:20Z
Message:
Observed Generation: 1
Reason: Degraded
Copy link
Contributor

@saurav-agarwalla saurav-agarwalla Jan 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing that I had discussed with Reed is making Reason a more structured object and putting a serialized string output of that here since I understand that this has to be a string. That way, we can expose more details including error codes mentioning the reason behind the degradation, expose resource IDs/dependents causing it as well as have more than one reason for the degradation. Making it a structured object will also allow us to parse it better for metrics.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants