Skip to content

Mcore path FP8 training w/ blockwise quantization token_mult_prob_error is NaN when TP>2 #1164

@guyueh1

Description

@guyueh1

Describe the bug

In the following setup, token_mult_prob_error becomes NaN since 2nd step and generated sequence is always to the max length, indicating there is some bug in the refit:

  • fp8 training in mcore path
  • fp8_param=True
  • TP>2
  • blockwise quantization recipe
    it works when TP=1, or when fp8_param=False, or when per-tensor quant is used.

Steps/Code to reproduce bug

export EXP_SUFFIX="grpo_math_8B_megatron_fp8_dev"
# Set up paths and names based on experiment suffix
export CHECKPOINT_DIR="results/${EXP_SUFFIX}"
export WANDB_NAME=${EXP_SUFFIX}

export RAY_DEDUP_LOGS=0
export BASE_LOG_DIR="logs/${EXP_SUFFIX}"

export CONTAINER=<container tag>

export MOUNTS="${PWD}:/opt/nemo-rl"

export NUM_ACTOR_NODES=${NUM_NODES:-1}

export NVTE_FP8_BLOCK_SCALING_FP32_SCALES=1

export COMMAND="\
uv run python examples/run_grpo_math.py \
--config examples/configs/grpo_math_8B_megatron_fp8.yaml \
policy.megatron_cfg.tensor_model_parallel_size=2 \
logger.wandb_enabled=true \
logger.wandb.project="nemo-rl-grpo-dev-guyueh" \
logger.wandb.name="${WANDB_NAME}" \
checkpointing.enabled=true \
checkpointing.checkpoint_dir="${CHECKPOINT_DIR}" \
cluster.num_nodes=${NUM_ACTOR_NODES}"

INTERACTIVE=${INTERACTIVE:-0}
if [ $INTERACTIVE -eq 1 ]; then
    export COMMAND=
fi

export PARTITION=batch

sbatch \
    --nodes=${NUM_ACTOR_NODES} \
    --account=coreai_dlalgo_nemorl \
    --job-name=coreai_dlalgo_nemorl-grpo.${EXP_SUFFIX} \
    --partition=${PARTITION} \
    --gres=gpu:8 \
    --time=04:00:00 \
    ray.sub

Expected behavior

token_mult_prob_error < 1.05 and mean generated seqlen is <3000 for at least the first 5 steps

Environment overview (please complete the following information)

  • Environment location: [Bare-metal, Docker, Cloud(specify cloud provider - AWS, Azure, GCP, Collab)]
  • Method of install: [pip install or from source]. Please specify exact commands you used to install.
  • If method of install is [Docker], provide docker pull & docker run commands used

Environment details

If NVIDIA docker image is used you don't need to specify these.
Otherwise, please provide:

  • OS version
  • PyTorch version
  • Python version

Additional context

Add any other context about the problem here.
Example: GPU model

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions