Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛(metrics) Initialize metrics for autoscaler errors, scale events, and pod evictions #7449

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

thiha-min-thant
Copy link

@thiha-min-thant thiha-min-thant commented Oct 31, 2024

  • Set initial count to zero for various autoscaler error types (e.g., CloudProviderError, ApiCallError)
  • Define failed scale-up reasons and initialize metrics (e.g., CloudProviderError, APIError)
  • Initialize pod eviction result counters for success and failure cases
  • Initialize skipped scale events for CPU and memory resource limits in both scale-up and scale-down directions

What type of PR is this?

/kind bug

What this PR does / why we need it:

This PR initializes the failedScaleUpCount and other key metrics at startup, setting their values to zero so they appear in Prometheus even if no events have occurred. By pre-defining these metrics, we ensure comprehensive monitoring and avoid gaps in visibility, particularly for scale-up and scale-down events, error types, and pod evictions.

Which issue(s) this PR fixes:

Fixes #7448

Special notes for your reviewer:

Certain node metrics have not been initialized in this PR because they require runtime information. These metrics are tied to dynamic node states and cannot be set at startup.

Metrics Log Reference:
metrics.txt

Metrics requiring runtime data include:

node_group_min_count
node_group_max_count
node_group_target_count
node_group_healthiness
scaled_up_gpu_nodes_total
failed_gpu_scale_ups_total
scaled_down_gpu_nodes_total
unremovable_nodes_count ?
created_node_groups_total
deleted_node_groups_total
node_taints_count

Does this PR introduce a user-facing change?

NONE


Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Oct 31, 2024
@k8s-ci-robot
Copy link
Contributor

Welcome @thiha-min-thant!

It looks like this is your first PR to kubernetes/autoscaler 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/autoscaler has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Oct 31, 2024
@k8s-ci-robot
Copy link
Contributor

Hi @thiha-min-thant. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Oct 31, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: thiha-min-thant
Once this PR has been reviewed and has the lgtm label, please assign bigdarkclown for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@thiha-min-thant
Copy link
Author

Hi @JoelSpeed @elmiko , could you please review this PR? Thanks!

@thiha-min-thant thiha-min-thant changed the title 🐛 Initialize failed scale-up metrics with specific reasons 🐛 Initialize cluster_autoscaler_failed_scale_ups_total metrics with specific reasons Oct 31, 2024
cluster-autoscaler/metrics/metrics.go Outdated Show resolved Hide resolved
cluster-autoscaler/main.go Outdated Show resolved Hide resolved
@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Nov 1, 2024
@elmiko
Copy link
Contributor

elmiko commented Nov 1, 2024

thanks for picking this up @thiha-min-thant !

…d pod evictions

- Set initial count to zero for various autoscaler error types (e.g., CloudProviderError, ApiCallError)
- Define failed scale-up reasons and initialize metrics (e.g., CloudProviderError, APIError)
- Initialize pod eviction result counters for success and failure cases
- Initialize skipped scale events for CPU and memory resource limits in both scale-up and scale-down directions

Signed-off-by: Thiha Min Thant <[email protected]>
@thiha-min-thant thiha-min-thant changed the title 🐛 Initialize cluster_autoscaler_failed_scale_ups_total metrics with specific reasons 🐛(metrics) Initialize metrics for autoscaler errors, scale events, and pod evictions Nov 2, 2024
@thiha-min-thant
Copy link
Author

Hi @JoelSpeed and @elmiko,

I've made updates to both the code and the PR description to clarify the initialization of metrics and added a metrics log file as a reference.

Thanks for your feedback, and please take a look at these updates when you have a chance!

@JoelSpeed
Copy link
Contributor

Do you know if the following also need initialisation?

pending_node_deletions
errors_total
failed_scale_ups_total
scaled_down_nodes_total
skipped_scale_events_count

@thiha-min-thant
Copy link
Author

Do you know if the following also need initialisation?

pending_node_deletions
errors_total
failed_scale_ups_total
scaled_down_nodes_total
skipped_scale_events_count

pending_node_deletions is already initialized and I already add initialization for the rest.

@JoelSpeed
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 4, 2024
Copy link
Contributor

@elmiko elmiko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is great, will make it easier if we need to update in the future.

/lgtm

@thiha-min-thant
Copy link
Author

Hi @JoelSpeed and @elmiko,

If the code looks good to both of you, could we proceed with the merge? We’re just one /approve label away. Thanks for your review!

@JoelSpeed
Copy link
Contributor

JoelSpeed commented Nov 6, 2024

/assign @BigDarkClown

@BigDarkClown the bot of assignment chose you for this one, could we get this approved?

@elmiko
Copy link
Contributor

elmiko commented Nov 6, 2024

@thiha-min-thant we need one of the maintainers for this area of the code to approve it. that's the only thing we are waiting on currently.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

cluster_autoscaler_failed_scale_ups_total metric is missing from metrics until an event is registered
5 participants