-
Notifications
You must be signed in to change notification settings - Fork 282
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add AMD Support #173
base: master
Are you sure you want to change the base?
Add AMD Support #173
Conversation
@wookayin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add some mocking tests for ROCM devices?
I'm not super familiar with mockito, but I've started looking into this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
for the testing part, we can mock a ROCML based NVML library call like NVMLGetFanSpeed
to return constant values.
@@ -612,6 +618,8 @@ def _wrapped(*args, **kwargs): | |||
gpu_stat = InvalidGPU(index, "((Unknown Error))", e) | |||
except N.NVMLError_GpuIsLost as e: | |||
gpu_stat = InvalidGPU(index, "((GPU is lost))", e) | |||
except Exception as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we raise the N.NVMLError_Unknown
Error for consistency?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ps: we can catch NVMLError instead of Base Exception, since you may ignore some python native errors
super().__init__(self.message) | ||
|
||
|
||
class NVMLError_Unknown(Exception): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should these NVMLError_xxx
inherit NVMLError
?
except (ImportError, SyntaxError, RuntimeError) as e: | ||
_rocmi = sys.modules.get("rocmi", None) | ||
|
||
raise ImportError( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we make this a dedicated NVMLError subclass?
Fixes #137
Design
To do this I duplicate the
pynvml
interface already used by gpustat in a wrapper around rocmi and dynamically import the correct library based on what hardware is present.Current Status
The base functionality is currently working:
Remaining Tasks