-
Notifications
You must be signed in to change notification settings - Fork 282
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add AMD Support #173
Open
bethune-bryant
wants to merge
20
commits into
wookayin:master
Choose a base branch
from
bethune-bryant:brnelson/add_amd_spport
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Add AMD Support #173
Changes from all commits
Commits
Show all changes
20 commits
Select commit
Hold shift + click to select a range
65ba474
Begin adding AMD support.
bethune-bryant 261faf7
Add pyrsmi depedency.
bethune-bryant ca650ba
Add simple hardware switch functionalty.
bethune-bryant 5b229f8
Move default exception to end
bethune-bryant 3c1a744
Typo
bethune-bryant 8ba8134
Default to nvidia.
bethune-bryant 9f07c49
Typo...
bethune-bryant 85d0dbf
Hide output from rocml.
bethune-bryant 2c9aadf
add frequency.
bethune-bryant cc2d0f0
Switching to amdsmi
bethune-bryant c2ea30e
Fix index lookup.
bethune-bryant 3e0c2b1
Remove frequency stuff for now.
bethune-bryant 173d144
Check for amdsmi.
bethune-bryant 800bd0d
Get driver version
bethune-bryant bf1a00a
Format new file.
bethune-bryant 6b731eb
Typo.
bethune-bryant 1a09222
Switch to rocmi.
bethune-bryant dfce699
Cleanup unneeded code.
bethune-bryant f1abc19
Add driver version.
bethune-bryant 9a2e2af
Fix power divisor.
bethune-bryant File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,181 @@ | ||
"""Imports rocmi and wraps it in a pynvml compatible interface.""" | ||
|
||
import sys | ||
import textwrap | ||
import warnings | ||
|
||
from collections import namedtuple | ||
|
||
try: | ||
# Check for rocmi. | ||
import rocmi | ||
except (ImportError, SyntaxError, RuntimeError) as e: | ||
_rocmi = sys.modules.get("rocmi", None) | ||
|
||
raise ImportError( | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should we make this a dedicated NVMLError subclass? |
||
textwrap.dedent( | ||
"""\ | ||
rocmi is missing or an outdated version is installed. | ||
|
||
The root cause: """ | ||
+ str(e) | ||
+ """ | ||
|
||
Your rocmi installation: """ | ||
+ repr(_rocmi) | ||
+ """ | ||
|
||
----------------------------------------------------------- | ||
(Suggested Fix) Please install rocmi using pip. | ||
""" | ||
) | ||
) from e | ||
|
||
NVML_TEMPERATURE_GPU = 1 | ||
|
||
|
||
class NVMLError(Exception): | ||
def __init__(self, message="ROCM Error"): | ||
self.message = message | ||
super().__init__(self.message) | ||
|
||
|
||
class NVMLError_Unknown(Exception): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should these |
||
def __init__(self, message="An unknown ROCM Error has occurred"): | ||
self.message = message | ||
super().__init__(self.message) | ||
|
||
|
||
class NVMLError_GpuIsLost(Exception): | ||
def __init__(self, message="ROCM Device is lost."): | ||
self.message = message | ||
super().__init__(self.message) | ||
|
||
|
||
def nvmlDeviceGetCount(): | ||
return len(rocmi.get_devices()) | ||
|
||
|
||
def nvmlDeviceGetHandleByIndex(dev): | ||
return rocmi.get_devices()[dev] | ||
|
||
|
||
def nvmlDeviceGetIndex(handle): | ||
for i, d in enumerate(rocmi.get_devices()): | ||
if d.bus_id == handle.bus_id: | ||
return i | ||
|
||
return -1 | ||
|
||
|
||
def nvmlDeviceGetName(handle): | ||
return handle.name | ||
|
||
|
||
def nvmlDeviceGetUUID(handle): | ||
return handle.unique_id | ||
|
||
|
||
def nvmlDeviceGetTemperature(handle, loc=NVML_TEMPERATURE_GPU): | ||
metrics = handle.get_metrics() | ||
return metrics.temperature_hotspot | ||
|
||
|
||
def nvmlSystemGetDriverVersion(): | ||
retval = rocmi.get_driver_version() | ||
if retval is None: | ||
return "" | ||
return retval | ||
|
||
|
||
def check_driver_nvml_version(driver_version_str: str): | ||
"""Show warnings when an incompatible driver is used.""" | ||
|
||
def safeint(v) -> int: | ||
try: | ||
return int(v) | ||
except (ValueError, TypeError): | ||
return 0 | ||
|
||
driver_version = tuple(safeint(v) for v in driver_version_str.strip().split(".")) | ||
|
||
if len(driver_version) == 0 or driver_version <= (0,): | ||
return | ||
if driver_version < (6, 7, 8): | ||
warnings.warn(f"This version of ROCM Driver {driver_version_str} is untested, ") | ||
|
||
|
||
def nvmlDeviceGetFanSpeed(handle): | ||
try: | ||
speed = handle.get_metrics().current_fan_speed | ||
except AttributeError: | ||
return None | ||
|
||
return speed | ||
|
||
|
||
MemoryInfo = namedtuple("MemoryInfo", ["total", "used"]) | ||
|
||
|
||
def nvmlDeviceGetMemoryInfo(handle): | ||
|
||
return MemoryInfo( | ||
total=handle.vram_total, | ||
used=handle.vram_used, | ||
) | ||
|
||
|
||
UtilizationRates = namedtuple("UtilizationRates", ["gpu"]) | ||
|
||
|
||
def nvmlDeviceGetUtilizationRates(handle): | ||
metrics = handle.get_metrics() | ||
return UtilizationRates(gpu=metrics.average_gfx_activity) | ||
|
||
|
||
def nvmlDeviceGetEncoderUtilization(dev): | ||
return None | ||
|
||
|
||
def nvmlDeviceGetDecoderUtilization(dev): | ||
return None | ||
|
||
|
||
def nvmlDeviceGetPowerUsage(handle): | ||
return handle.current_power / 1000 | ||
|
||
|
||
def nvmlDeviceGetEnforcedPowerLimit(handle): | ||
return handle.power_limit / 1000 | ||
|
||
|
||
ComputeProcess = namedtuple("ComputeProcess", ["pid", "usedGpuMemory"]) | ||
|
||
|
||
def nvmlDeviceGetComputeRunningProcesses(handle): | ||
results = handle.get_processes() | ||
return [ComputeProcess(pid=x.pid, usedGpuMemory=x.vram_usage) for x in results] | ||
|
||
|
||
def nvmlDeviceGetGraphicsRunningProcesses(dev): | ||
return None | ||
|
||
|
||
def nvmlDeviceGetClockInfo(handle): | ||
metrics = handle.get_metrics() | ||
|
||
try: | ||
clk = metrics.current_gfxclks[0] | ||
except AttributeError: | ||
clk = metrics.current_gfxclk | ||
|
||
return clk | ||
|
||
|
||
def nvmlDeviceGetMaxClockInfo(handle): | ||
return handle.get_clock_info()[-1] | ||
|
||
|
||
# rocmi does not require initialization | ||
def ensure_initialized(): | ||
pass |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we raise the
N.NVMLError_Unknown
Error for consistency?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ps: we can catch NVMLError instead of Base Exception, since you may ignore some python native errors