-
Notifications
You must be signed in to change notification settings - Fork 3.3k
megatron:Update megatron-lm to core_r0.11.0
#392
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
afb9a79
93f6a7e
65e7ace
0ef4e03
fcbf9b7
4f85cfe
601e090
0afb3fe
9e1f122
102e924
dac9cb3
7a45970
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,9 @@ | ||
| FROM verlai/verl:vemlp-th2.4.0-cu124-vllm0.6.3-ray2.10-te1.7-v0.0.3 | ||
|
|
||
| RUN pip install git+https://github.com/NVIDIA/TransformerEngine.git@stable | ||
|
|
||
| RUN cd /opt/nvidia && git clone --single-branch --branch core_r0.11.0 https://github.com/NVIDIA/Megatron-LM.git Megatron-LM | ||
|
|
||
| # only config pip index with https://pypi.tuna.tsinghua.edu.cn/simple if needed | ||
| # unset for now | ||
| RUN cd /opt/nvidia/Megatron-LM && pip3 install --no-deps -e . |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -12,17 +12,21 @@ | |
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
|
|
||
| import megatron | ||
| from megatron.core import mpu | ||
| from megatron.utils import print_rank_0, unwrap_model | ||
| from megatron.model import Float16Module | ||
| from megatron.model import DistributedDataParallel as LocalDDP | ||
| import importlib | ||
| from packaging.version import Version | ||
| from torch.nn.parallel import DistributedDataParallel as torchDDP | ||
| import torch | ||
| import time | ||
| from typing import Optional | ||
| import torch.distributed as dist | ||
|
|
||
| import megatron | ||
| from megatron import get_args | ||
| from megatron.core import mpu | ||
| from megatron.core.transformer.module import Float16Module | ||
| from megatron.core.distributed import DistributedDataParallel as LocalDDP | ||
|
|
||
| from megatron.training.utils import print_rank_0, unwrap_model | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We copied several module and util functions from the Megatron-LM package into the
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. sure, fixing this |
||
|
|
||
|
|
||
| def _megatron_calc_global_rank(tp_rank: int = 0, dp_rank: int = 0, pp_rank: int = 0): | ||
|
|
@@ -77,7 +81,7 @@ def merge_megatron_ckpt_llama(wrapped_models, config, is_value_model=False, dtyp | |
| """Merge sharded parameters of a Megatron module into a merged checkpoint. | ||
|
|
||
| Args: | ||
| wrapped_models (list of megatron.model.DistributedDataParallel): | ||
| wrapped_models (list of megatron.core.distributed.DistributedDataParallel): | ||
| The local DDP wrapped megatron modules. | ||
| dtype (str or None): | ||
| The data type of state_dict. if None, the data type of the original parameters | ||
|
|
||
This file was deleted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could u keep the v0.4 patch file for now in case others are want to run v0.4 for comparison. thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can remove the v0.4 patch after the next stable release of verl
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK,I will add that back