Skip to content

model.save_pretrained error  #50

@chengchichu

Description

@chengchichu

I tried to run the fine-tuning tutorial and however I got the error in the last step(save_pretrained).

I have 8 gpus in one node and it appeared that part of the saving is complete but about half of them showing error messages.

not sure this is because of memory or other issues, I google it but with no luck. help someone could help.

my code is below:
CUDA_LAUNCH_BLOCKING=1 torchrun --nproc_per_node=8 --master_port=9876 medalpaca/train.py
--model 'medalpaca/medalpaca-7b'
--data_path 'medical_meadow_small.json'
--output_dir './alpaca-7b'
--train_in_8bit False
--use_lora False
--bf16 False
--tf32 False
--fp16 True
--gradient_checkpointing True
--global_batch_size 256
--per_device_batch_size 4
--wandb_project 'medalpaca'
--prompt_template '/home/anpo/medAlpaca/medAlpaca/medalpaca/prompt_templates/medalpaca.json'
--use_wandb False

The error message below:

Traceback (most recent call last):
File "/home/anpo/medAlpaca/medAlpaca/medalpaca/train.py", line 285, in
fire.Fire(main)
File "/home/anpo/.conda/envs/medAlpaca/lib/python3.11/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/anpo/.conda/envs/medAlpaca/lib/python3.11/site-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
^^^^^^^^^^^^^^^^^^^^
File "/home/anpo/.conda/envs/medAlpaca/lib/python3.11/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^
File "/home/anpo/medAlpaca/medAlpaca/medalpaca/train.py", line 281, in main
model.save_pretrained(output_dir)
File "/home/anpo/.conda/envs/medAlpaca/lib/python3.11/site-packages/transformers/modeling_utils.py", line 2376, in save_pretrained
safe_save_file(shard, os.path.join(save_directory, shard_file), metadata={"format": "pt"})
File "/home/anpo/.conda/envs/medAlpaca/lib/python3.11/site-packages/safetensors/torch.py", line 281, in save_file
serialize_file(_flatten(tensors), filename, metadata=metadata)
^^^^^^^^^^^^^^^^^
File "/home/anpo/.conda/envs/medAlpaca/lib/python3.11/site-packages/safetensors/torch.py", line 475, in _flatten
return {
^
File "/home/anpo/.conda/envs/medAlpaca/lib/python3.11/site-packages/safetensors/torch.py", line 479, in
"data": _tobytes(v, k),
^^^^^^^^^^^^^^
File "/home/anpo/.conda/envs/medAlpaca/lib/python3.11/site-packages/safetensors/torch.py", line 404, in _tobytes
tensor = tensor.to("cpu")
^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: invalid argument
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions