[bnb] resume with more replicas test by stas00 · Pull Request #198 · bigscience-workshop/Megatron-DeepSpeed

stas00 · 2021-11-19T20:24:19Z

a new test to reproduce the issue with BNB when switching from 1 replica to 2 (i.e. DP degree changes, while keeping PP and TP degrees the same):

the original error is here:

Traceback (most recent call last):
  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed-bnb/pretrain_gpt.py", line 268, in <module>
    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed-bnb/megatron/training.py", line 135, in pretrain
    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed-bnb/megatron/training.py", line 397, in setup_model_and_optimizer
    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed-bnb/megatron/checkpointing.py", line 272, in load_checkpoint
    loaded_dir, state_dict = model[0].load_checkpoint(load_dir)
  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed/deepspeed/runtime/engine.py", line 2037, in load_checkpoint
    success = self._load_zero_checkpoint(
  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed/deepspeed/runtime/engine.py", line 2149, in _load_zero_checkpoint
    self.optimizer.load_state_dict(
  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed/deepspeed/runtime/zero/stage2.py", line 2110, in load_state_dict
    self._restore_base_optimizer_state(state_dict_list)
  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed/deepspeed/runtime/zero/stage2.py", line 2059, in _restore_base_optimizer_state
    self.optimizer.state[p][key].data.copy_(saved.data)
RuntimeError: The size of tensor a (256) must match the size of tensor b (128) at non-singleton dimension 0

That was when moving from 128 gpus to 256 gpus. But actually the test that only moves from 1 gpu to 2 gpus has the same error.

@TimDettmers, to run the test:

# skip if you already have apex built - it can be very slow to build
git clone https://github.com/NVIDIA/apex
cd apex
pip install --global-option="--cpp_ext" --global-option="--cuda_ext" --no-cache -v --disable-pip-version-check .

git clone https://github.com/bigscience-workshop/Megatron-DeepSpeed
cd Megatron-DeepSpeed
pip install -r requirements.txt
gh pr checkout 198 # this PR

Now run the test:

pytest tests/test_training.py::MegDSTestTraining::test_training_bnb_resume_more_replicas -sv

Unfortunately Megatron wants to recompile its kernels on each num gpus changes, so this test takes 10min to finish as it takes forever to build the kernels.

Now despite me using just 1 and 2 gpus, I get the same error as when I moved from 128 to 256 gpus. Therefore this error is not related to the actual number of GPUs directly, but the correlation is only to the multiplier so 2x here.

the CI is failing on this test.

stas00 · 2021-11-19T20:35:22Z

tests/test_training.py

+        launcher = get_launcher(1)
+        cmd = launcher + cmd_no_launcher
+        # keep for quick debug
+        # print(" ".join([f"\nPYTHONPATH={self.src_dir_str}"] +cmd)); die
+        execute_subprocess_async(cmd, env=self.get_env())
+
+        # 2. DP=2 TP=1 PP=1 num_gpus=2 resume from the checkpoint of DP=1
+        launcher = get_launcher(2)
+        cmd = launcher + cmd_no_launcher
+        execute_subprocess_async(cmd, env=self.get_env())


Tim, the test is just this part.

…shop#198)

bnb resume with more replicas test

0621cb8

stas00 commented Nov 19, 2021

View reviewed changes

stas00 added 2 commits November 19, 2021 17:04

install bnb on CI

fe50722

fix the test suite

caec0ec

TimDettmers mentioned this pull request Nov 22, 2021

Optimizer state loading fix for bitsandbytes 8-bit optimizers. deepspeedai/DeepSpeed#1582

Open

extension build improvements

8302c96

adammoody pushed a commit to adammoody/Megatron-DeepSpeed that referenced this pull request Aug 16, 2023

fix the passing parameter name of GatheredParameters (bigscience-work…

345cb10

…shop#198)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bnb] resume with more replicas test#198

[bnb] resume with more replicas test#198
stas00 wants to merge 4 commits intomainfrom
bnb-resume-2x

stas00 commented Nov 19, 2021 •

edited

Loading

Uh oh!

stas00 Nov 19, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

stas00 commented Nov 19, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stas00 Nov 19, 2021

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

stas00 commented Nov 19, 2021 •

edited

Loading