Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support automatic discovery of MIG devices #992

Open
DrAuYueng opened this issue Oct 15, 2024 · 1 comment
Open

Support automatic discovery of MIG devices #992

DrAuYueng opened this issue Oct 15, 2024 · 1 comment

Comments

@DrAuYueng
Copy link

DrAuYueng commented Oct 15, 2024

Using k8s-device-plugin in our kubernetes cluster, we found that in MIG mode:

  1. The device plug-in instance corresponding to the newly created GI is not started
  2. The status of the newly created CI in the node is not displayed

When we delete the Pod corresponding to k8s-device-plugin and trigger a rebuild, the resources are displayed normally.
It seems that the newly created MIG resources are not automatically discovered.

@klueska
Copy link
Contributor

klueska commented Oct 15, 2024

That is correct. The device-plugin needs to be restarted after a MIG reconfiguration.

If you use the GPU operator, this process is automated for you by a component called the mig-manager, so that you don't have to manager this complexity yourself.

Using the mig-manager you can dynamically reconfiguration the set the available MIG devices on a node by setting a node-label. Details can be found here: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-operator-mig.html#example-reconfiguring-mig-profiles

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants