Skip to content

Multigpu saving fails if model is small #231

@terrykong

Description

@terrykong

Problem

The unit test test_convert_dcp_to_hf show that if the model is small enough, the model does not save properly with online method.

I see these logs when the test tries to load from what's saved on disk from the online saving method:

Some weights of the model checkpoint at /tmp/tmpgsnuh3yy/test_hf_and_dcp-hf were not used when initializing LlamaForCausalLM: {'_flat_param'}
- This IS expected if you are initializing LlamaForCausalLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing LlamaForCausalLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of LlamaForCausalLM were not initialized from the model checkpoint at /tmp/tmpgsnuh3yy/test_hf_and_dcp-hf and are newly initialized: ['embed_tokens.weight', 'layers.0.input_layernorm.weight', 'layers.0.mlp.down_proj.weight', 'layers.0.mlp.gate_proj.weight', 'layers.0.mlp.up_proj.weight', 'layers.0.post_attention_layernorm.weight', 'layers.0.self_attn.k_proj.weight', 'layers.0.self_attn.o_proj.weight', 'layers.0.self_attn.q_proj.weight', 'layers.0.self_attn.v_proj.weight', 'layers.1.input_layernorm.weight', 'layers.1.mlp.down_proj.weight', 'layers.1.mlp.gate_proj.weight', 'layers.1.mlp.up_proj.weight', 'layers.1.post_attention_layernorm.weight', 'layers.1.self_attn.k_proj.weight', 'layers.1.self_attn.o_proj.weight', 'layers.1.self_attn.q_proj.weight', 'layers.1.self_attn.v_proj.weight', 'lm_head.weight', 'norm.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Repro

  1. Change these model refs to
from tests.unit.conftest import TEST_ASSETS
TEST_ASSETS.TINY_LLAMA_MODEL_PATH # <-- replace with this
  1. run
uv run --group test bash tests/run_unit.sh -k test_convert_dcp_to_hf

Work around

Oddly if you increase that TINY_LLAMA_MODEL_PATH config to:

config = LlamaConfig(
        num_hidden_layers=2*20,
        hidden_size=64*16,
        intermediate_size=32*16,
        num_attention_heads=2*16,
        vocab_size=128256,
        tie_word_embeddings=False,
        num_key_value_heads=None,
    )

the test passes. So it's unexpected that checkpointing is model size dependent. AFAICT the offline script works even for that small model.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions