Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate node-problem-detector jobs to eks-prow-build-cluster #29751

Conversation

Vyom-Yadav
Copy link
Member

This PR transitions the node-problem-detector jobs from the default cluster to eks-prow-build-cluster

ref: #29722

@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Jun 10, 2023

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: Vyom-Yadav / name: Vyom Yadav (d3d851c)

@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Jun 10, 2023
@k8s-ci-robot k8s-ci-robot added area/config Issues or PRs related to code in /config area/jobs sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Jun 10, 2023
@Vyom-Yadav
Copy link
Member Author

CPU/Memory requests/limits would be added if pull-test-infra-unit-test fails.

@Vyom-Yadav

This comment was marked as resolved.

Copy link
Contributor

@marquiz marquiz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to add resource (cpu and mem) requests and limits to the jobs.

Why not update the periodic jobs, too (in kubernetes/node-problem-detector/node-problem-detector-ci.yaml)

@Vyom-Yadav
Copy link
Member Author

Vyom-Yadav commented Jun 13, 2023

You need to add resource (cpu and mem) requests and limits to the jobs.

I have contacted node problem selector folks regarding the limits. Haven't received any response yet, any suggestions?

https://kubernetes.slack.com/archives/CJA25LM6D/p1686373854184919?thread_ts=1686373854.184919&cid=CJA25LM6D

Why not update the periodic jobs, too (in kubernetes/node-problem-detector/node-problem-detector-ci.yaml)

Sure.

@rjsadow
Copy link
Contributor

rjsadow commented Jun 14, 2023

I have contacted node problem selector folks regarding the limits. Haven't received any response yet, any suggestions?

Since you've not heard anything back from them yet. I would recommend starting with 2 cpu and 4GB memory. We can monitor their resource usage through Kabana (https://monitoring-eks.prow.k8s.io/d/96Q8oOOZk/builds?from=now-24h&to=now) after they're merged and fine tune the capacity requirements later.

@SergeyKanzhelev
Copy link
Member

/assign @mmiranda96

@mmiranda96
Copy link
Contributor

Please specify limits. As @rjsadow mentioned, 2 CPU and 4G memory per job should be a good starting point.

Thanks for your work!

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 14, 2023
@Vyom-Yadav Vyom-Yadav force-pushed the migrate-node-problem-detector-presubmits-jobs branch from fa2a9c9 to d3d851c Compare June 15, 2023 11:53
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 15, 2023
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Jun 15, 2023
Comment on lines +64 to +66
requests:
cpu: 2
memory: 4Gi
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept the req., limit the same. Prow configures request as:

resources:                                                                   
  requests:                                                                  
    cpu: 250m                                                                
    memory: 1Gi        

Ping me if you want me to change request to prow's request.

@rjsadow
Copy link
Contributor

rjsadow commented Jun 15, 2023

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 15, 2023
@Vyom-Yadav Vyom-Yadav requested a review from marquiz June 16, 2023 08:53
@Vyom-Yadav
Copy link
Member Author

@marquiz ping

Copy link
Contributor

@marquiz marquiz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mmiranda96
Copy link
Contributor

/lgtm

@dims
Copy link
Member

dims commented Jul 1, 2023

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dims, marquiz, Vyom-Yadav

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 1, 2023
@k8s-ci-robot k8s-ci-robot merged commit 94ba495 into kubernetes:master Jul 1, 2023
@k8s-ci-robot
Copy link
Contributor

@Vyom-Yadav: Updated the job-config configmap in namespace default at cluster test-infra-trusted using the following files:

  • key node-problem-detector-ci.yaml using file config/jobs/kubernetes/node-problem-detector/node-problem-detector-ci.yaml
  • key node-problem-detector-presubmits.yaml using file config/jobs/kubernetes/node-problem-detector/node-problem-detector-presubmits.yaml

In response to this:

This PR transitions the node-problem-detector jobs from the default cluster to eks-prow-build-cluster

ref: #29722

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

annotations:
testgrid-dashboards: sig-node-node-problem-detector
testgrid-alert-email: [email protected],[email protected]
testgrid-num-failures-to-alert: '12'
testgrid-alert-stale-results-hours: '24'
testgrid-num-columns-recent: '30'
- name: ci-npd-e2e-kubernetes-gce-gci
cluster: eks-prow-build-cluster
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://prow.k8s.io/job-history/gs/kubernetes-jenkins/logs/ci-npd-e2e-kubernetes-gce-gci-custom-flags?buildId=1675443561614544896

2023/07/01 14:01:41 main.go:328: Something went wrong: failed to prepare test environment: --provider=gce boskos failed to acquire project: resources not found
Traceback (most recent call last):
  File "/workspace/scenarios/kubernetes_e2e.py", line 723, in <module>
    main(parse_args())
  File "/workspace/scenarios/kubernetes_e2e.py", line 569, in main
    mode.start(runner_args)
  File "/workspace/scenarios/kubernetes_e2e.py", line 228, in start
    check_env(env, self.command, *args)
  File "/workspace/scenarios/kubernetes_e2e.py", line 111, in check_env
    subprocess.check_call(cmd, env=env)
  File "/usr/lib/python3.9/subprocess.py", line 373, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '('kubetest', '--dump=/logs/artifacts', '--gcp-service-account=/etc/service-account/service-account.json', '--up', '--down', '--test', '--provider=gce', '--cluster=e2e-a52e116b34-75293', '--gcp-network=e2e-a52e116b34-75293', '--extract=ci/latest', '--gcp-node-image=gci', '--gcp-zone=us-west1-b', '--ginkgo-parallel=30', '--test_args=--ginkgo.focus=NodeProblemDetector', '--timeout=60m')' returned non-zero exit status 1.
+ EXIT_VALUE=1
+ set +o xtrace

Some test failed after this PR merged.

Copy link
Member Author

@Vyom-Yadav Vyom-Yadav Jul 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pacoxu From the original issue:

NOTE: if you see any entries under label that says gce skip this job and go to the next time as this is not ready to be moved yet.

The test name contains gce but it's not present in labels, could this be the problem? And from the logs too, I think this belongs to gcp and not eks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opened #29998. I am not quite sure about that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/config Issues or PRs related to code in /config area/jobs cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

8 participants