🐛(metrics) Initialize metrics for autoscaler errors, scale events, and pod evictions #7449

thiha-min-thant · 2024-10-31T15:02:53Z

Set initial count to zero for various autoscaler error types (e.g., CloudProviderError, ApiCallError)
Define failed scale-up reasons and initialize metrics (e.g., CloudProviderError, APIError)
Initialize pod eviction result counters for success and failure cases
Initialize skipped scale events for CPU and memory resource limits in both scale-up and scale-down directions

What type of PR is this?

/kind bug

What this PR does / why we need it:

This PR initializes the failedScaleUpCount and other key metrics at startup, setting their values to zero so they appear in Prometheus even if no events have occurred. By pre-defining these metrics, we ensure comprehensive monitoring and avoid gaps in visibility, particularly for scale-up and scale-down events, error types, and pod evictions.

Which issue(s) this PR fixes:

Fixes #7448

Special notes for your reviewer:

Certain node metrics have not been initialized in this PR because they require runtime information. These metrics are tied to dynamic node states and cannot be set at startup.

Metrics Log Reference:
metrics.txt

Metrics requiring runtime data include:

node_group_min_count
node_group_max_count
node_group_target_count
node_group_healthiness
scaled_up_gpu_nodes_total
failed_gpu_scale_ups_total
scaled_down_gpu_nodes_total
unremovable_nodes_count ?
created_node_groups_total
deleted_node_groups_total
node_taints_count

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

k8s-ci-robot · 2024-10-31T15:03:02Z

Welcome @thiha-min-thant!

It looks like this is your first PR to kubernetes/autoscaler 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/autoscaler has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

k8s-ci-robot · 2024-10-31T15:03:03Z

Hi @thiha-min-thant. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot · 2024-10-31T15:03:17Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: thiha-min-thant
Once this PR has been reviewed and has the lgtm label, please assign bigdarkclown for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

cluster-autoscaler/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

thiha-min-thant · 2024-10-31T15:04:16Z

Hi @JoelSpeed @elmiko , could you please review this PR? Thanks!

cluster-autoscaler/metrics/metrics.go

cluster-autoscaler/main.go

elmiko · 2024-11-01T20:42:32Z

thanks for picking this up @thiha-min-thant !

…d pod evictions - Set initial count to zero for various autoscaler error types (e.g., CloudProviderError, ApiCallError) - Define failed scale-up reasons and initialize metrics (e.g., CloudProviderError, APIError) - Initialize pod eviction result counters for success and failure cases - Initialize skipped scale events for CPU and memory resource limits in both scale-up and scale-down directions Signed-off-by: Thiha Min Thant <[email protected]>

thiha-min-thant · 2024-11-02T07:55:43Z

Hi @JoelSpeed and @elmiko,

I've made updates to both the code and the PR description to clarify the initialization of metrics and added a metrics log file as a reference.

Thanks for your feedback, and please take a look at these updates when you have a chance!

JoelSpeed · 2024-11-04T09:41:44Z

Do you know if the following also need initialisation?

pending_node_deletions
errors_total
failed_scale_ups_total
scaled_down_nodes_total
skipped_scale_events_count

thiha-min-thant · 2024-11-04T12:25:07Z

Do you know if the following also need initialisation?
pending_node_deletions
errors_total
failed_scale_ups_total
scaled_down_nodes_total
skipped_scale_events_count

pending_node_deletions is already initialized and I already add initialization for the rest.

JoelSpeed · 2024-11-04T12:28:59Z

/lgtm

elmiko

this is great, will make it easier if we need to update in the future.

/lgtm

thiha-min-thant · 2024-11-06T05:14:10Z

Hi @JoelSpeed and @elmiko,

If the code looks good to both of you, could we proceed with the merge? We’re just one /approve label away. Thanks for your review!

JoelSpeed · 2024-11-06T10:48:03Z

/assign @BigDarkClown

@BigDarkClown the bot of assignment chose you for this one, could we get this approved?

elmiko · 2024-11-06T13:54:13Z

@thiha-min-thant we need one of the maintainers for this area of the code to approve it. that's the only thing we are waiting on currently.

k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Oct 31, 2024

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Oct 31, 2024

k8s-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Oct 31, 2024

k8s-ci-robot requested a review from aleksandra-malinowska October 31, 2024 15:03

k8s-ci-robot added the area/cluster-autoscaler label Oct 31, 2024

k8s-ci-robot requested a review from x13n October 31, 2024 15:03

thiha-min-thant changed the title ~~🐛 Initialize failed scale-up metrics with specific reasons~~ 🐛 Initialize cluster_autoscaler_failed_scale_ups_total metrics with specific reasons Oct 31, 2024

JoelSpeed reviewed Oct 31, 2024

View reviewed changes

cluster-autoscaler/metrics/metrics.go Outdated Show resolved Hide resolved

cluster-autoscaler/main.go Outdated Show resolved Hide resolved

thiha-min-thant force-pushed the failed-scale-ups-metrics branch from 5727d50 to 906dadc Compare November 1, 2024 12:23

k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Nov 1, 2024

thiha-min-thant force-pushed the failed-scale-ups-metrics branch from 906dadc to ffd57af Compare November 2, 2024 07:44

thiha-min-thant changed the title ~~🐛 Initialize cluster_autoscaler_failed_scale_ups_total metrics with specific reasons~~ 🐛(metrics) Initialize metrics for autoscaler errors, scale events, and pod evictions Nov 2, 2024

thiha-min-thant requested review from elmiko and JoelSpeed November 2, 2024 15:43

k8s-ci-robot assigned JoelSpeed Nov 4, 2024

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 4, 2024

elmiko reviewed Nov 4, 2024

View reviewed changes

k8s-ci-robot assigned elmiko Nov 4, 2024

k8s-ci-robot assigned BigDarkClown Nov 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🐛(metrics) Initialize metrics for autoscaler errors, scale events, and pod evictions #7449

🐛(metrics) Initialize metrics for autoscaler errors, scale events, and pod evictions #7449

thiha-min-thant commented Oct 31, 2024 •

edited

Loading

k8s-ci-robot commented Oct 31, 2024

k8s-ci-robot commented Oct 31, 2024

k8s-ci-robot commented Oct 31, 2024

thiha-min-thant commented Oct 31, 2024

elmiko commented Nov 1, 2024

thiha-min-thant commented Nov 2, 2024

JoelSpeed commented Nov 4, 2024

thiha-min-thant commented Nov 4, 2024

JoelSpeed commented Nov 4, 2024

elmiko left a comment

thiha-min-thant commented Nov 6, 2024

JoelSpeed commented Nov 6, 2024 •

edited

Loading

elmiko commented Nov 6, 2024

🐛(metrics) Initialize metrics for autoscaler errors, scale events, and pod evictions #7449

Are you sure you want to change the base?

🐛(metrics) Initialize metrics for autoscaler errors, scale events, and pod evictions #7449

Conversation

thiha-min-thant commented Oct 31, 2024 • edited Loading

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

k8s-ci-robot commented Oct 31, 2024

k8s-ci-robot commented Oct 31, 2024

k8s-ci-robot commented Oct 31, 2024

thiha-min-thant commented Oct 31, 2024

elmiko commented Nov 1, 2024

thiha-min-thant commented Nov 2, 2024

JoelSpeed commented Nov 4, 2024

thiha-min-thant commented Nov 4, 2024

JoelSpeed commented Nov 4, 2024

elmiko left a comment

Choose a reason for hiding this comment

thiha-min-thant commented Nov 6, 2024

JoelSpeed commented Nov 6, 2024 • edited Loading

elmiko commented Nov 6, 2024

thiha-min-thant commented Oct 31, 2024 •

edited

Loading

JoelSpeed commented Nov 6, 2024 •

edited

Loading