-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Migrate node-problem-detector
jobs to eks-prow-build-cluster
#29751
Migrate node-problem-detector
jobs to eks-prow-build-cluster
#29751
Conversation
|
CPU/Memory requests/limits would be added if |
This comment was marked as resolved.
This comment was marked as resolved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to add resource (cpu and mem) requests and limits to the jobs.
Why not update the periodic jobs, too (in kubernetes/node-problem-detector/node-problem-detector-ci.yaml
)
I have contacted node problem selector folks regarding the limits. Haven't received any response yet, any suggestions?
Sure. |
Since you've not heard anything back from them yet. I would recommend starting with 2 cpu and 4GB memory. We can monitor their resource usage through Kabana (https://monitoring-eks.prow.k8s.io/d/96Q8oOOZk/builds?from=now-24h&to=now) after they're merged and fine tune the capacity requirements later. |
/assign @mmiranda96 |
Please specify limits. As @rjsadow mentioned, 2 CPU and 4G memory per job should be a good starting point. Thanks for your work! /lgtm |
Signed-off-by: Vyom-Yadav <[email protected]>
fa2a9c9
to
d3d851c
Compare
requests: | ||
cpu: 2 | ||
memory: 4Gi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kept the req., limit the same. Prow configures request as:
resources:
requests:
cpu: 250m
memory: 1Gi
Ping me if you want me to change request
to prow's request
.
/lgtm |
@marquiz ping |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/lgtm |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dims, marquiz, Vyom-Yadav The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@Vyom-Yadav: Updated the
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
annotations: | ||
testgrid-dashboards: sig-node-node-problem-detector | ||
testgrid-alert-email: [email protected],[email protected] | ||
testgrid-num-failures-to-alert: '12' | ||
testgrid-alert-stale-results-hours: '24' | ||
testgrid-num-columns-recent: '30' | ||
- name: ci-npd-e2e-kubernetes-gce-gci | ||
cluster: eks-prow-build-cluster |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2023/07/01 14:01:41 main.go:328: Something went wrong: failed to prepare test environment: --provider=gce boskos failed to acquire project: resources not found
Traceback (most recent call last):
File "/workspace/scenarios/kubernetes_e2e.py", line 723, in <module>
main(parse_args())
File "/workspace/scenarios/kubernetes_e2e.py", line 569, in main
mode.start(runner_args)
File "/workspace/scenarios/kubernetes_e2e.py", line 228, in start
check_env(env, self.command, *args)
File "/workspace/scenarios/kubernetes_e2e.py", line 111, in check_env
subprocess.check_call(cmd, env=env)
File "/usr/lib/python3.9/subprocess.py", line 373, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '('kubetest', '--dump=/logs/artifacts', '--gcp-service-account=/etc/service-account/service-account.json', '--up', '--down', '--test', '--provider=gce', '--cluster=e2e-a52e116b34-75293', '--gcp-network=e2e-a52e116b34-75293', '--extract=ci/latest', '--gcp-node-image=gci', '--gcp-zone=us-west1-b', '--ginkgo-parallel=30', '--test_args=--ginkgo.focus=NodeProblemDetector', '--timeout=60m')' returned non-zero exit status 1.
+ EXIT_VALUE=1
+ set +o xtrace
Some test failed after this PR merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pacoxu From the original issue:
NOTE: if you see any entries under label that says gce skip this job and go to the next time as this is not ready to be moved yet.
The test name contains gce
but it's not present in labels, could this be the problem? And from the logs too, I think this belongs to gcp and not eks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opened #29998. I am not quite sure about that.
This PR transitions the node-problem-detector jobs from the default cluster to eks-prow-build-cluster
ref: #29722