Mcore path FP8 training w/ blockwise quantization token_mult_prob_error is NaN when TP>2

**Describe the bug**

In the following setup, token_mult_prob_error becomes NaN since 2nd step and generated sequence is always to the max length, indicating there is some bug in the refit:
* fp8 training in mcore path
* fp8_param=True
* TP>2
* blockwise quantization recipe
it works when TP=1, or when fp8_param=False, or when per-tensor quant is used.

**Steps/Code to reproduce bug**

```bash
export EXP_SUFFIX="grpo_math_8B_megatron_fp8_dev"
# Set up paths and names based on experiment suffix
export CHECKPOINT_DIR="results/${EXP_SUFFIX}"
export WANDB_NAME=${EXP_SUFFIX}

export RAY_DEDUP_LOGS=0
export BASE_LOG_DIR="logs/${EXP_SUFFIX}"

export CONTAINER=<container tag>

export MOUNTS="${PWD}:/opt/nemo-rl"

export NUM_ACTOR_NODES=${NUM_NODES:-1}

export NVTE_FP8_BLOCK_SCALING_FP32_SCALES=1

export COMMAND="\
uv run python examples/run_grpo_math.py \
--config examples/configs/grpo_math_8B_megatron_fp8.yaml \
policy.megatron_cfg.tensor_model_parallel_size=2 \
logger.wandb_enabled=true \
logger.wandb.project="nemo-rl-grpo-dev-guyueh" \
logger.wandb.name="${WANDB_NAME}" \
checkpointing.enabled=true \
checkpointing.checkpoint_dir="${CHECKPOINT_DIR}" \
cluster.num_nodes=${NUM_ACTOR_NODES}"

INTERACTIVE=${INTERACTIVE:-0}
if [ $INTERACTIVE -eq 1 ]; then
    export COMMAND=
fi

export PARTITION=batch

sbatch \
    --nodes=${NUM_ACTOR_NODES} \
    --account=coreai_dlalgo_nemorl \
    --job-name=coreai_dlalgo_nemorl-grpo.${EXP_SUFFIX} \
    --partition=${PARTITION} \
    --gres=gpu:8 \
    --time=04:00:00 \
    ray.sub
```

**Expected behavior**

token_mult_prob_error < 1.05 and mean generated seqlen is <3000 for at least the first 5 steps

**Environment overview (please complete the following information)**

 - Environment location: [Bare-metal, Docker, Cloud(specify cloud provider - AWS, Azure, GCP, Collab)]
 - Method of install: [pip install or from source]. Please specify exact commands you used to install.
 - If method of install is [Docker], provide `docker pull` & `docker run` commands used

**Environment details**

If NVIDIA docker image is used you don't need to specify these.
Otherwise, please provide:
- OS version
- PyTorch version
- Python version

**Additional context**

Add any other context about the problem here.
Example: GPU model


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mcore path FP8 training w/ blockwise quantization token_mult_prob_error is NaN when TP>2 #1164

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Mcore path FP8 training w/ blockwise quantization token_mult_prob_error is NaN when TP>2 #1164

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions