-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Description
how to deal with the error below , 1A100 PCIe 80gb . Followed the instruction with error below . 2A100 80gb working fine but out of memory . I guess the code default to multiple GPUs . the only workable solution is 2a100 80gb for Qwen/Qwen2.5-1.5B . For my training , Qwen/Qwen2.5-1.5B does not have very good training result , Qwen/Qwen2.5-3B with 2H200 has very good training result .
TWO gpus:
export N_GPUS=2
export BASE_MODEL=Qwen/Qwen2.5-3B
export DATA_DIR=Jiayi-Pan/Countdown-Tasks-3to4
export ROLLOUT_TP_SIZE=2
export EXPERIMENT_NAME=countdown-qwen2.5-3b
export VLLM_ATTENTION_BACKEND=XFORMERS
one gpu:
export N_GPUS=1
export BASE_MODEL=Qwen/Qwen2.5-1.5B
export DATA_DIR=Jiayi-Pan/Countdown-Tasks-3to4
export EXPERIMENT_NAME=countdown-qwen2.5-1.5b
export VLLM_ATTENTION_BACKEND=XFORMERS
Actor use_remove_padding=False
Error executing job with overrides: ['data.train_files=Jiayi-Pan/Countdown-Tasks-3to4/train.parquet', 'data.val_files=Jiayi-Pan/Countdown-Tasks-3to4/test.parquet', 'data.train_batch_size=256', 'data.val_batch_size=1312', 'data.max_prompt_length=256', 'data.max_response_length=1024', 'actor_rollout_ref.model.path=Qwen/Qwen2.5-1.5B', 'actor_rollout_ref.actor.optim.lr=1e-6', 'actor_rollout_ref.actor.ppo_mini_batch_size=128', 'actor_rollout_ref.actor.ppo_micro_batch_size=8', 'actor_rollout_ref.rollout.log_prob_micro_batch_size=8', 'actor_rollout_ref.rollout.tensor_model_parallel_size=', 'actor_rollout_ref.rollout.gpu_memory_utilization=0.4', 'actor_rollout_ref.ref.log_prob_micro_batch_size=4', 'critic.optim.lr=1e-5', 'critic.model.path=Qwen/Qwen2.5-1.5B', 'critic.ppo_micro_batch_size=8', 'algorithm.kl_ctrl.kl_coef=0.001', 'trainer.logger=[wandb]', '+trainer.val_before_train=False', 'trainer.default_hdfs_dir=null', 'trainer.n_gpus_per_node=1', 'trainer.nnodes=1', 'trainer.save_freq=100', 'trainer.test_freq=100', 'trainer.project_name=TinyZero', 'trainer.experiment_name=countdown-qwen2.5-1.5b', 'trainer.total_epochs=15']
Traceback (most recent call last):
File "/workspace/TinyZero/verl/trainer/main_ppo.py", line 103, in main
ray.get(main_task.remote(config))
File "/root/miniconda3/envs/zero/lib/python3.9/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
return fn(*args, **kwargs)
File "/root/miniconda3/envs/zero/lib/python3.9/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
File "/root/miniconda3/envs/zero/lib/python3.9/site-packages/ray/_private/worker.py", line 2772, in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
File "/root/miniconda3/envs/zero/lib/python3.9/site-packages/ray/_private/worker.py", line 919, in get_objects
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(TypeError): ray::main_task() (pid=2641, ip=172.19.0.2)
File "/workspace/TinyZero/verl/trainer/main_ppo.py", line 188, in main_task
trainer.init_workers()
File "/workspace/TinyZero/verl/trainer/ppo/ray_trainer.py", line 514, in init_workers
self.actor_rollout_wg.init_model()
File "/workspace/TinyZero/verl/single_controller/ray/base.py", line 42, in func
output = ray.get(output)
ray.exceptions.RayTaskError(TypeError): ray::WorkerDict.actor_rollout_init_model() (pid=2892, ip=172.19.0.2, actor_id=9b3727d88709f75f8ee9f78401000000, repr=<verl.single_controller.ray.base.WorkerDict object at 0x7a94a9f27d30>)
File "/workspace/TinyZero/verl/single_controller/ray/base.py", line 399, in func
return getattr(self.worker_dict[key], name)(*args, **kwargs)
File "/workspace/TinyZero/verl/single_controller/base/decorator.py", line 404, in inner
return func(*args, **kwargs)
File "/workspace/TinyZero/verl/workers/fsdp_workers.py", line 332, in init_model
self.rollout, self.rollout_sharding_manager = self._build_rollout()
File "/workspace/TinyZero/verl/workers/fsdp_workers.py", line 254, in _build_rollout
dp = self.world_size // infer_tp
TypeError: unsupported operand type(s) for //: 'int' and 'str'
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.