Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: forbid repeated deepspeed.initialize on training objects #6874

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

traincheck-team
Copy link

@traincheck-team traincheck-team commented Dec 16, 2024

Previously closed PR:
#6848

Partially Fixes: #6772 #6771 #6770 by forbidding repeated initialization.

What are changed:

  1. Marking 'model', 'optimizer' and 'lr_scheduler' in the arguments of deepspeed.initialize with the flag ds_is_inited = True.
  2. Marking 'engine', 'engine.optimizer' and 'engine.lr_scheduler' in the return values of deepspeed.initialize with the flag ds_is_inited = True.
  3. When calling deepspeed.initialize, raise an exception if detected ds_is_inited == True in the input model, optimizer or lr_scheduler

Expected Behavior:
Forbid repeated deepspeed.initialize invocations on model, optimizer and lr_scheduler objects.

@traincheck-team
Copy link
Author

@microsoft-github-policy-service agree

@traincheck-team
Copy link
Author

This fix still has interference with existing unit tests. Let me double check before we proceed.

…peedEngine propagates flag from the internal model
@traincheck-team traincheck-team force-pushed the fix-6848-forbid-repeated-init branch from dc81325 to d1e7777 Compare December 16, 2024 21:02
@traincheck-team
Copy link
Author

traincheck-team commented Dec 16, 2024

The unit tests in tests/unit/runtime/test_ds_initialize.py all passed.
The PR is ready for review @tjruwase.

I am not able to check other unit tests due to GPU memory constraint.

deepspeed/__init__.py Outdated Show resolved Hide resolved
deepspeed/__init__.py Outdated Show resolved Hide resolved
if _is_initialized(model):
raise ValueError(
"Model has already been initialized, please make sure to only call deepspeed.initialize on a model once.")
if optimizer is not None and _is_initialized(optimizer):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that optimizer could be a Callable, not an object

optimizer: Optional[Union[Optimizer, DeepSpeedOptimizerCallable]] = None,

raise ValueError(
"Optimizer has already been initialized, please make sure to only call deepspeed.initialize on an optimizer once."
)
if lr_scheduler is not None and _is_initialized(lr_scheduler):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto for lr_scheduler

lr_scheduler: Optional[Union[_LRScheduler, DeepSpeedSchedulerCallable]] = None,

@@ -137,6 +181,10 @@ def initialize(args=None,
zero.partition_parameters.shutdown_init_context()

assert model is not None, "deepspeed.initialize requires a model"
# enforce that model, optimizer, and lr_scheduler have not been used in a previous deepspeed.initialize call
_assert_trainobjs_not_inited(model, optimizer, lr_scheduler)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this call should be moved into `_mark_trainobjs_initialized()

@traincheck-team
Copy link
Author

traincheck-team commented Dec 19, 2024

Thanks for the review @tjruwase.

  • I added handling for callable types for optimizer and lr_scheduler. The handling is to only mark is_ds_inited for objects instead of callables, as the callables are not stateful and reuse should be allowed.

Regarding,

I think this call should be moved into _mark_trainobjs_initialized()

I think _assert_trainobjs_not_inited should still be separated from _mark_trainobjs_initialized as _mark_trainobjs_initialized is also called on the wrapped model and optimizers before exiting from deepspeed.initialize. The wrapped models may already have is_ds_inited being True since in DeepSpeedEngine all model flags will be passed through on the wrapper.

If we still want to keep _assert_trainobjs_not_inited inside _mark_trainobjs_initialized, we can do either of the three:

  1. add a flag to _mark_trainobjs_initialized to indicate whether the *not_inited` check should be performed
  2. add/check a flag using __dict__ rather than setattr/getattr
  3. check whether inited use type information as well, i.e. for types of DeepSpeedEngine we directly return inited == True instead of checking for flags.

@tjruwase
Copy link
Contributor

If we still want to keep _assert_trainobjs_not_inited inside _mark_trainobjs_initialized, we can do either of the three:

Thanks for raising this important design issue. Below are my thoughts and/or clarifications.

  1. DeepSpeed is designed to optimize PyTorch, and currently does this by wrapping PyTorch constructs, e.g., nn.Module, optim.Optimizer, optim.lr_scheduler, etc.
  2. To avoid nesting issues, DeepSpeed should not wrap DeepSpeed data structures, such as DeepSpeedEngine, DeepSpeedOptimizer, DeepSpeedDataLoader, etc.

Based on the above, I am aligned with your 3rd suggestion (below) of leveraging type information to simplify this PR.

3. check whether inited use type information as well, i.e. for types of DeepSpeedEngine we directly return inited == True instead of checking for flags.

What do you think?

@traincheck-team
Copy link
Author

Got it. Yeah 3 is technically what should be implemented instead of letting flags just flying around. Will get the update EOD.

@traincheck-team traincheck-team force-pushed the fix-6848-forbid-repeated-init branch from 7f6fc1f to 2c5806b Compare December 30, 2024 02:05
@traincheck-team
Copy link
Author

@tjruwase Thanks for the suggestion. Please check the updated PR.

@traincheck-team
Copy link
Author

@tjruwase Hi Olatunji, are there any further modifications that you'd like me to do?

@loadams
Copy link
Contributor

loadams commented Jan 13, 2025

@tjruwase Hi Olatunji, are there any further modifications that you'd like me to do?

@traincheck-team - could you look into the failures on the nv-torch-latest-v100 test? We've stabilized the CI and these look to be real failures.

@traincheck-team
Copy link
Author

traincheck-team commented Jan 16, 2025

@loadams Thanks for getting back so quickly and I apologize for the late reply.

I inspected the 9 tests that failed. It appears to me that these tests failed because they initialized deepspeed using the same model multiple times, which is the exact behavior that this PR is trying to forbid. The PR itself is not buggy.

For example:

  1. in tests/unit/checkpoint/test_other_optimizer.TestOtherOptimizerCheckpoint::test_checkpoint_unfused_optimizer at
    checkpoint_correctness_verification(config_dict,
    models=models,
    hidden_dim=hidden_dim,
    tmpdir=tmpdir,
    load_optimizer_states=True)
    # Ignore optimizer states
    checkpoint_correctness_verification(config_dict,
    models=models,
    . Two checkpoint_correctness_verification were invoked with the same models argument.
  2. in tests/unit/runtime/half_precision/onebit/test_onebit.TestZeroOneAdamCheckpointing::test at
    model_1, optimizer_1, _, _ = deepspeed.initialize(
    config=config_dict,
    model=model,
    model_parameters=optimizer_grouped_parameters_1,
    )
    data_loader = random_dataloader(
    model=model_1,
    total_samples=10,
    hidden_dim=hidden_dim,
    device=model_1.device,
    )
    for n, batch in enumerate(data_loader):
    loss = model_1(batch[0], batch[1])
    model_1.backward(loss)
    model_1.step()
    # Test whether momentum mask still exist after saving checkpoint
    mask1 = mask1.to(device=optimizer_1.param_groups[0]["exp_avg_mask"].device)
    assert torch.allclose(optimizer_1.param_groups[0]["exp_avg_mask"], mask1,
    atol=1e-07), f"Incorrect momentum mask"
    save_folder = os.path.join(tmpdir, "saved_checkpoint")
    model_1.save_checkpoint(save_folder, tag=None)
    assert torch.allclose(optimizer_1.param_groups[0]["exp_avg_mask"], mask1,
    atol=1e-07), f"Momentum mask should not change after saving checkpoint"
    model_2, optimizer_2, _, _ = deepspeed.initialize(
    config=config_dict,
    model=model,
    model_parameters=optimizer_grouped_parameters_2,
    )
    . Two deepspeed.initialize were invoked with exactly the same model parameter.

As far as I see there are two solutions:

  1. Modify the unit tests to use separate model copies. But I don't think you guys may want to risk doing this, but I feel this is the "right" thing to do.
  2. **Optionally suppress the exceptions by env or config vars". This might be an easier and more practical solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] [Fix-Suggested] ZeRO Stage 3 Overwrites Module ID Attribute Causing Incorrect Expert Placement on GPUs
3 participants