Skip to content

[bnb] resume with more replicas test#198

Draft
stas00 wants to merge 4 commits intomainfrom
bnb-resume-2x
Draft

[bnb] resume with more replicas test#198
stas00 wants to merge 4 commits intomainfrom
bnb-resume-2x

Conversation

@stas00
Copy link
Contributor

@stas00 stas00 commented Nov 19, 2021

a new test to reproduce the issue with BNB when switching from 1 replica to 2 (i.e. DP degree changes, while keeping PP and TP degrees the same):

the original error is here:

Traceback (most recent call last):
  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed-bnb/pretrain_gpt.py", line 268, in <module>
    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed-bnb/megatron/training.py", line 135, in pretrain
    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed-bnb/megatron/training.py", line 397, in setup_model_and_optimizer
    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed-bnb/megatron/checkpointing.py", line 272, in load_checkpoint
    loaded_dir, state_dict = model[0].load_checkpoint(load_dir)
  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed/deepspeed/runtime/engine.py", line 2037, in load_checkpoint
    success = self._load_zero_checkpoint(
  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed/deepspeed/runtime/engine.py", line 2149, in _load_zero_checkpoint
    self.optimizer.load_state_dict(
  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed/deepspeed/runtime/zero/stage2.py", line 2110, in load_state_dict
    self._restore_base_optimizer_state(state_dict_list)
  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed/deepspeed/runtime/zero/stage2.py", line 2059, in _restore_base_optimizer_state
    self.optimizer.state[p][key].data.copy_(saved.data)
RuntimeError: The size of tensor a (256) must match the size of tensor b (128) at non-singleton dimension 0

That was when moving from 128 gpus to 256 gpus. But actually the test that only moves from 1 gpu to 2 gpus has the same error.

@TimDettmers, to run the test:

# skip if you already have apex built - it can be very slow to build
git clone https://github.com/NVIDIA/apex
cd apex
pip install --global-option="--cpp_ext" --global-option="--cuda_ext" --no-cache -v --disable-pip-version-check .
git clone https://github.com/bigscience-workshop/Megatron-DeepSpeed
cd Megatron-DeepSpeed
pip install -r requirements.txt
gh pr checkout 198 # this PR

Now run the test:

pytest tests/test_training.py::MegDSTestTraining::test_training_bnb_resume_more_replicas -sv

Unfortunately Megatron wants to recompile its kernels on each num gpus changes, so this test takes 10min to finish as it takes forever to build the kernels.

Now despite me using just 1 and 2 gpus, I get the same error as when I moved from 128 to 256 gpus. Therefore this error is not related to the actual number of GPUs directly, but the correlation is only to the multiplier so 2x here.

the CI is failing on this test.

Comment on lines +320 to +329
launcher = get_launcher(1)
cmd = launcher + cmd_no_launcher
# keep for quick debug
# print(" ".join([f"\nPYTHONPATH={self.src_dir_str}"] +cmd)); die
execute_subprocess_async(cmd, env=self.get_env())

# 2. DP=2 TP=1 PP=1 num_gpus=2 resume from the checkpoint of DP=1
launcher = get_launcher(2)
cmd = launcher + cmd_no_launcher
execute_subprocess_async(cmd, env=self.get_env())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tim, the test is just this part.

adammoody pushed a commit to adammoody/Megatron-DeepSpeed that referenced this pull request Aug 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant