Migrate `node-problem-detector` jobs to `eks-prow-build-cluster` #29751

Vyom-Yadav · 2023-06-10T04:27:28Z

This PR transitions the node-problem-detector jobs from the default cluster to eks-prow-build-cluster

ref: #29722

linux-foundation-easycla · 2023-06-10T04:27:31Z

The committers listed above are authorized under a signed CLA.

✅ login: Vyom-Yadav / name: Vyom Yadav (d3d851c)

Vyom-Yadav · 2023-06-10T04:34:23Z

CPU/Memory requests/limits would be added if pull-test-infra-unit-test fails.

marquiz

You need to add resource (cpu and mem) requests and limits to the jobs.

Why not update the periodic jobs, too (in kubernetes/node-problem-detector/node-problem-detector-ci.yaml)

Vyom-Yadav · 2023-06-13T14:28:05Z

You need to add resource (cpu and mem) requests and limits to the jobs.

I have contacted node problem selector folks regarding the limits. Haven't received any response yet, any suggestions?

https://kubernetes.slack.com/archives/CJA25LM6D/p1686373854184919?thread_ts=1686373854.184919&cid=CJA25LM6D

Why not update the periodic jobs, too (in kubernetes/node-problem-detector/node-problem-detector-ci.yaml)

Sure.

rjsadow · 2023-06-14T15:53:18Z

I have contacted node problem selector folks regarding the limits. Haven't received any response yet, any suggestions?

Since you've not heard anything back from them yet. I would recommend starting with 2 cpu and 4GB memory. We can monitor their resource usage through Kabana (https://monitoring-eks.prow.k8s.io/d/96Q8oOOZk/builds?from=now-24h&to=now) after they're merged and fine tune the capacity requirements later.

SergeyKanzhelev · 2023-06-14T17:18:14Z

/assign @mmiranda96

mmiranda96 · 2023-06-14T17:48:14Z

Please specify limits. As @rjsadow mentioned, 2 CPU and 4G memory per job should be a good starting point.

Thanks for your work!

/lgtm

Signed-off-by: Vyom-Yadav <[email protected]>

Vyom-Yadav · 2023-06-15T11:55:50Z

config/jobs/kubernetes/node-problem-detector/node-problem-detector-ci.yaml

+        requests:
+          cpu: 2
+          memory: 4Gi


I kept the req., limit the same. Prow configures request as:

resources: requests: cpu: 250m memory: 1Gi

Ping me if you want me to change request to prow's request.

rjsadow · 2023-06-15T12:25:15Z

/lgtm

Vyom-Yadav · 2023-06-20T07:38:34Z

@marquiz ping

marquiz

/lgtm

ping @mmiranda96 @SergeyKanzhelev

mmiranda96 · 2023-06-23T21:21:31Z

/lgtm

dims · 2023-07-01T12:04:10Z

/approve

k8s-ci-robot · 2023-07-01T12:04:21Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dims, marquiz, Vyom-Yadav

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~config/jobs/kubernetes/node-problem-detector/OWNERS~~ [dims]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2023-07-01T12:32:55Z

@Vyom-Yadav: Updated the job-config configmap in namespace default at cluster test-infra-trusted using the following files:

key node-problem-detector-ci.yaml using file config/jobs/kubernetes/node-problem-detector/node-problem-detector-ci.yaml
key node-problem-detector-presubmits.yaml using file config/jobs/kubernetes/node-problem-detector/node-problem-detector-presubmits.yaml

In response to this:

This PR transitions the node-problem-detector jobs from the default cluster to eks-prow-build-cluster

ref: #29722

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

pacoxu · 2023-07-04T03:48:44Z

config/jobs/kubernetes/node-problem-detector/node-problem-detector-ci.yaml

  annotations:
    testgrid-dashboards: sig-node-node-problem-detector
    testgrid-alert-email: [email protected],[email protected]
    testgrid-num-failures-to-alert: '12'
    testgrid-alert-stale-results-hours: '24'
    testgrid-num-columns-recent: '30'
 - name: ci-npd-e2e-kubernetes-gce-gci
+  cluster: eks-prow-build-cluster


https://prow.k8s.io/job-history/gs/kubernetes-jenkins/logs/ci-npd-e2e-kubernetes-gce-gci-custom-flags?buildId=1675443561614544896

2023/07/01 14:01:41 main.go:328: Something went wrong: failed to prepare test environment: --provider=gce boskos failed to acquire project: resources not found Traceback (most recent call last): File "/workspace/scenarios/kubernetes_e2e.py", line 723, in <module> main(parse_args()) File "/workspace/scenarios/kubernetes_e2e.py", line 569, in main mode.start(runner_args) File "/workspace/scenarios/kubernetes_e2e.py", line 228, in start check_env(env, self.command, *args) File "/workspace/scenarios/kubernetes_e2e.py", line 111, in check_env subprocess.check_call(cmd, env=env) File "/usr/lib/python3.9/subprocess.py", line 373, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command '('kubetest', '--dump=/logs/artifacts', '--gcp-service-account=/etc/service-account/service-account.json', '--up', '--down', '--test', '--provider=gce', '--cluster=e2e-a52e116b34-75293', '--gcp-network=e2e-a52e116b34-75293', '--extract=ci/latest', '--gcp-node-image=gci', '--gcp-zone=us-west1-b', '--ginkgo-parallel=30', '--test_args=--ginkgo.focus=NodeProblemDetector', '--timeout=60m')' returned non-zero exit status 1. + EXIT_VALUE=1 + set +o xtrace

Some test failed after this PR merged.

@pacoxu From the original issue:

NOTE: if you see any entries under label that says gce skip this job and go to the next time as this is not ready to be moved yet.

The test name contains gce but it's not present in labels, could this be the problem? And from the logs too, I think this belongs to gcp and not eks.

I opened #29998. I am not quite sure about that.

k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Jun 10, 2023

k8s-ci-robot requested review from mmiranda96 and SergeyKanzhelev June 10, 2023 04:27

This comment was marked as resolved.

Sign in to view

rjsadow mentioned this pull request Jun 10, 2023

[Umbrella Issue] Migrate prow jobs to community clusters #29722

Closed

54 tasks

marquiz reviewed Jun 13, 2023

View reviewed changes

k8s-ci-robot assigned mmiranda96 Jun 14, 2023

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 14, 2023

Migrate node-problem-detector jobs to eks-prow-build-cluster

d3d851c

Signed-off-by: Vyom-Yadav <[email protected]>

Vyom-Yadav force-pushed the migrate-node-problem-detector-presubmits-jobs branch from fa2a9c9 to d3d851c Compare June 15, 2023 11:53

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 15, 2023

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Jun 15, 2023

Vyom-Yadav commented Jun 15, 2023

View reviewed changes

k8s-ci-robot assigned rjsadow Jun 15, 2023

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 15, 2023

Vyom-Yadav requested a review from marquiz June 16, 2023 08:53

marquiz approved these changes Jun 22, 2023

View reviewed changes

k8s-ci-robot assigned marquiz Jun 22, 2023

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 1, 2023

k8s-ci-robot merged commit 94ba495 into kubernetes:master Jul 1, 2023

pacoxu reviewed Jul 4, 2023

View reviewed changes

pacoxu mentioned this pull request Jul 4, 2023

Failure cluster [c24e5353...] ci-npd failed after migrating to eks #29998

Closed

rjsadow mentioned this pull request Jul 4, 2023

fix: update k/npd community jobs #30000

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Migrate `node-problem-detector` jobs to `eks-prow-build-cluster` #29751

Migrate `node-problem-detector` jobs to `eks-prow-build-cluster` #29751

Vyom-Yadav commented Jun 10, 2023

linux-foundation-easycla bot commented Jun 10, 2023 •

edited

Loading

Vyom-Yadav commented Jun 10, 2023

This comment was marked as resolved.

marquiz left a comment

Vyom-Yadav commented Jun 13, 2023 •

edited

Loading

rjsadow commented Jun 14, 2023

SergeyKanzhelev commented Jun 14, 2023

mmiranda96 commented Jun 14, 2023

Vyom-Yadav Jun 15, 2023

rjsadow commented Jun 15, 2023

Vyom-Yadav commented Jun 20, 2023

marquiz left a comment

mmiranda96 commented Jun 23, 2023

dims commented Jul 1, 2023

k8s-ci-robot commented Jul 1, 2023

k8s-ci-robot commented Jul 1, 2023

pacoxu Jul 4, 2023

Vyom-Yadav Jul 4, 2023 •

edited

Loading

pacoxu Jul 4, 2023

Migrate node-problem-detector jobs to eks-prow-build-cluster #29751

Migrate node-problem-detector jobs to eks-prow-build-cluster #29751

Conversation

Vyom-Yadav commented Jun 10, 2023

linux-foundation-easycla bot commented Jun 10, 2023 • edited Loading

Vyom-Yadav commented Jun 10, 2023

This comment was marked as resolved.

marquiz left a comment

Choose a reason for hiding this comment

Vyom-Yadav commented Jun 13, 2023 • edited Loading

rjsadow commented Jun 14, 2023

SergeyKanzhelev commented Jun 14, 2023

mmiranda96 commented Jun 14, 2023

Vyom-Yadav Jun 15, 2023

Choose a reason for hiding this comment

rjsadow commented Jun 15, 2023

Vyom-Yadav commented Jun 20, 2023

marquiz left a comment

Choose a reason for hiding this comment

mmiranda96 commented Jun 23, 2023

dims commented Jul 1, 2023

k8s-ci-robot commented Jul 1, 2023

k8s-ci-robot commented Jul 1, 2023

pacoxu Jul 4, 2023

Choose a reason for hiding this comment

Vyom-Yadav Jul 4, 2023 • edited Loading

Choose a reason for hiding this comment

pacoxu Jul 4, 2023

Choose a reason for hiding this comment

Migrate `node-problem-detector` jobs to `eks-prow-build-cluster` #29751

Migrate `node-problem-detector` jobs to `eks-prow-build-cluster` #29751

linux-foundation-easycla bot commented Jun 10, 2023 •

edited

Loading

Vyom-Yadav commented Jun 13, 2023 •

edited

Loading

Vyom-Yadav Jul 4, 2023 •

edited

Loading