-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix: forbid repeated deepspeed.initialize on training objects #6874
base: master
Are you sure you want to change the base?
Fix: forbid repeated deepspeed.initialize on training objects #6874
Conversation
@microsoft-github-policy-service agree |
This fix still has interference with existing unit tests. Let me double check before we proceed. |
…peedEngine propagates flag from the internal model
dc81325
to
d1e7777
Compare
The unit tests in I am not able to check other unit tests due to GPU memory constraint. |
deepspeed/__init__.py
Outdated
if _is_initialized(model): | ||
raise ValueError( | ||
"Model has already been initialized, please make sure to only call deepspeed.initialize on a model once.") | ||
if optimizer is not None and _is_initialized(optimizer): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that optimizer could be a Callable, not an object
DeepSpeed/deepspeed/__init__.py
Line 71 in 4cd1d97
optimizer: Optional[Union[Optimizer, DeepSpeedOptimizerCallable]] = None, |
deepspeed/__init__.py
Outdated
raise ValueError( | ||
"Optimizer has already been initialized, please make sure to only call deepspeed.initialize on an optimizer once." | ||
) | ||
if lr_scheduler is not None and _is_initialized(lr_scheduler): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto for lr_scheduler
DeepSpeed/deepspeed/__init__.py
Line 74 in 4cd1d97
lr_scheduler: Optional[Union[_LRScheduler, DeepSpeedSchedulerCallable]] = None, |
deepspeed/__init__.py
Outdated
@@ -137,6 +181,10 @@ def initialize(args=None, | |||
zero.partition_parameters.shutdown_init_context() | |||
|
|||
assert model is not None, "deepspeed.initialize requires a model" | |||
# enforce that model, optimizer, and lr_scheduler have not been used in a previous deepspeed.initialize call | |||
_assert_trainobjs_not_inited(model, optimizer, lr_scheduler) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this call should be moved into `_mark_trainobjs_initialized()
Thanks for the review @tjruwase.
Regarding,
I think If we still want to keep
|
Thanks for raising this important design issue. Below are my thoughts and/or clarifications.
Based on the above, I am aligned with your 3rd suggestion (below) of leveraging type information to simplify this PR.
What do you think? |
Got it. Yeah 3 is technically what should be implemented instead of letting flags just flying around. Will get the update EOD. |
7f6fc1f
to
2c5806b
Compare
@tjruwase Thanks for the suggestion. Please check the updated PR. |
@tjruwase Hi Olatunji, are there any further modifications that you'd like me to do? |
@traincheck-team - could you look into the failures on the nv-torch-latest-v100 test? We've stabilized the CI and these look to be real failures. |
@loadams Thanks for getting back so quickly and I apologize for the late reply. I inspected the 9 tests that failed. It appears to me that these tests failed because they initialized deepspeed using the same model multiple times, which is the exact behavior that this PR is trying to forbid. The PR itself is not buggy. For example:
As far as I see there are two solutions:
|
Previously closed PR:
#6848
Partially Fixes: #6772 #6771 #6770 by forbidding repeated initialization.
What are changed:
deepspeed.initialize
with the flagds_is_inited = True
.deepspeed.initialize
with the flagds_is_inited = True
.deepspeed.initialize
, raise an exception if detectedds_is_inited == True
in the inputmodel
,optimizer
orlr_scheduler
Expected Behavior:
Forbid repeated
deepspeed.initialize
invocations onmodel
,optimizer
andlr_scheduler
objects.