Multigpu saving fails if model is small

## Problem
The unit test `test_convert_dcp_to_hf` show that if the model is small enough, the model does not save properly with online method.

I see these logs when the test tries to load from what's saved on disk from the online saving method:
```
Some weights of the model checkpoint at /tmp/tmpgsnuh3yy/test_hf_and_dcp-hf were not used when initializing LlamaForCausalLM: {'_flat_param'}
- This IS expected if you are initializing LlamaForCausalLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing LlamaForCausalLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of LlamaForCausalLM were not initialized from the model checkpoint at /tmp/tmpgsnuh3yy/test_hf_and_dcp-hf and are newly initialized: ['embed_tokens.weight', 'layers.0.input_layernorm.weight', 'layers.0.mlp.down_proj.weight', 'layers.0.mlp.gate_proj.weight', 'layers.0.mlp.up_proj.weight', 'layers.0.post_attention_layernorm.weight', 'layers.0.self_attn.k_proj.weight', 'layers.0.self_attn.o_proj.weight', 'layers.0.self_attn.q_proj.weight', 'layers.0.self_attn.v_proj.weight', 'layers.1.input_layernorm.weight', 'layers.1.mlp.down_proj.weight', 'layers.1.mlp.gate_proj.weight', 'layers.1.mlp.up_proj.weight', 'layers.1.post_attention_layernorm.weight', 'layers.1.self_attn.k_proj.weight', 'layers.1.self_attn.o_proj.weight', 'layers.1.self_attn.q_proj.weight', 'layers.1.self_attn.v_proj.weight', 'lm_head.weight', 'norm.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
```

## Repro

1. Change these [model refs](https://github.com/NVIDIA/reinforcer/blob/ce2d1217591fbdb2d6212f0cd2a64497578c1a6b/tests/unit/utils/test_native_checkpoint.py#L36-L38) to

```
from tests.unit.conftest import TEST_ASSETS
TEST_ASSETS.TINY_LLAMA_MODEL_PATH # <-- replace with this
```

2. run
```
uv run --group test bash tests/run_unit.sh -k test_convert_dcp_to_hf
```

## Work around

Oddly if you increase that TINY_LLAMA_MODEL_PATH config to:
```
config = LlamaConfig(
        num_hidden_layers=2*20,
        hidden_size=64*16,
        intermediate_size=32*16,
        num_attention_heads=2*16,
        vocab_size=128256,
        tie_word_embeddings=False,
        num_key_value_heads=None,
    )
```
the test passes. So it's unexpected that checkpointing is model size dependent. AFAICT the [offline script ](https://github.com/NVIDIA/reinforcer/blob/main/docs/design-docs/checkpointing.md#checkpoint-format)works even for that small model.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multigpu saving fails if model is small #231

Problem

Repro

Work around

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Multigpu saving fails if model is small #231

Description

Problem

Repro

Work around

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions