-
Notifications
You must be signed in to change notification settings - Fork 69
Description
I tried to run the fine-tuning tutorial and however I got the error in the last step(save_pretrained).
I have 8 gpus in one node and it appeared that part of the saving is complete but about half of them showing error messages.
not sure this is because of memory or other issues, I google it but with no luck. help someone could help.
my code is below:
CUDA_LAUNCH_BLOCKING=1 torchrun --nproc_per_node=8 --master_port=9876 medalpaca/train.py
--model 'medalpaca/medalpaca-7b'
--data_path 'medical_meadow_small.json'
--output_dir './alpaca-7b'
--train_in_8bit False
--use_lora False
--bf16 False
--tf32 False
--fp16 True
--gradient_checkpointing True
--global_batch_size 256
--per_device_batch_size 4
--wandb_project 'medalpaca'
--prompt_template '/home/anpo/medAlpaca/medAlpaca/medalpaca/prompt_templates/medalpaca.json'
--use_wandb False
The error message below:
Traceback (most recent call last):
File "/home/anpo/medAlpaca/medAlpaca/medalpaca/train.py", line 285, in
fire.Fire(main)
File "/home/anpo/.conda/envs/medAlpaca/lib/python3.11/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/anpo/.conda/envs/medAlpaca/lib/python3.11/site-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
^^^^^^^^^^^^^^^^^^^^
File "/home/anpo/.conda/envs/medAlpaca/lib/python3.11/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^
File "/home/anpo/medAlpaca/medAlpaca/medalpaca/train.py", line 281, in main
model.save_pretrained(output_dir)
File "/home/anpo/.conda/envs/medAlpaca/lib/python3.11/site-packages/transformers/modeling_utils.py", line 2376, in save_pretrained
safe_save_file(shard, os.path.join(save_directory, shard_file), metadata={"format": "pt"})
File "/home/anpo/.conda/envs/medAlpaca/lib/python3.11/site-packages/safetensors/torch.py", line 281, in save_file
serialize_file(_flatten(tensors), filename, metadata=metadata)
^^^^^^^^^^^^^^^^^
File "/home/anpo/.conda/envs/medAlpaca/lib/python3.11/site-packages/safetensors/torch.py", line 475, in _flatten
return {
^
File "/home/anpo/.conda/envs/medAlpaca/lib/python3.11/site-packages/safetensors/torch.py", line 479, in
"data": _tobytes(v, k),
^^^^^^^^^^^^^^
File "/home/anpo/.conda/envs/medAlpaca/lib/python3.11/site-packages/safetensors/torch.py", line 404, in _tobytes
tensor = tensor.to("cpu")
^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: invalid argument
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.