Skip to content

1 gpu is not working , 2 gpus out of memory  #5

@deter3

Description

@deter3

how to deal with the error below , 1A100 PCIe 80gb . Followed the instruction with error below . 2A100 80gb working fine but out of memory . I guess the code default to multiple GPUs . the only workable solution is 2a100 80gb for Qwen/Qwen2.5-1.5B . For my training , Qwen/Qwen2.5-1.5B does not have very good training result , Qwen/Qwen2.5-3B with 2H200 has very good training result .

TWO gpus:
export N_GPUS=2
export BASE_MODEL=Qwen/Qwen2.5-3B
export DATA_DIR=Jiayi-Pan/Countdown-Tasks-3to4
export ROLLOUT_TP_SIZE=2
export EXPERIMENT_NAME=countdown-qwen2.5-3b
export VLLM_ATTENTION_BACKEND=XFORMERS

one gpu:
export N_GPUS=1
export BASE_MODEL=Qwen/Qwen2.5-1.5B
export DATA_DIR=Jiayi-Pan/Countdown-Tasks-3to4
export EXPERIMENT_NAME=countdown-qwen2.5-1.5b
export VLLM_ATTENTION_BACKEND=XFORMERS

Actor use_remove_padding=False
Error executing job with overrides: ['data.train_files=Jiayi-Pan/Countdown-Tasks-3to4/train.parquet', 'data.val_files=Jiayi-Pan/Countdown-Tasks-3to4/test.parquet', 'data.train_batch_size=256', 'data.val_batch_size=1312', 'data.max_prompt_length=256', 'data.max_response_length=1024', 'actor_rollout_ref.model.path=Qwen/Qwen2.5-1.5B', 'actor_rollout_ref.actor.optim.lr=1e-6', 'actor_rollout_ref.actor.ppo_mini_batch_size=128', 'actor_rollout_ref.actor.ppo_micro_batch_size=8', 'actor_rollout_ref.rollout.log_prob_micro_batch_size=8', 'actor_rollout_ref.rollout.tensor_model_parallel_size=', 'actor_rollout_ref.rollout.gpu_memory_utilization=0.4', 'actor_rollout_ref.ref.log_prob_micro_batch_size=4', 'critic.optim.lr=1e-5', 'critic.model.path=Qwen/Qwen2.5-1.5B', 'critic.ppo_micro_batch_size=8', 'algorithm.kl_ctrl.kl_coef=0.001', 'trainer.logger=[wandb]', '+trainer.val_before_train=False', 'trainer.default_hdfs_dir=null', 'trainer.n_gpus_per_node=1', 'trainer.nnodes=1', 'trainer.save_freq=100', 'trainer.test_freq=100', 'trainer.project_name=TinyZero', 'trainer.experiment_name=countdown-qwen2.5-1.5b', 'trainer.total_epochs=15']
Traceback (most recent call last):
  File "/workspace/TinyZero/verl/trainer/main_ppo.py", line 103, in main
    ray.get(main_task.remote(config))
  File "/root/miniconda3/envs/zero/lib/python3.9/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/root/miniconda3/envs/zero/lib/python3.9/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/root/miniconda3/envs/zero/lib/python3.9/site-packages/ray/_private/worker.py", line 2772, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
  File "/root/miniconda3/envs/zero/lib/python3.9/site-packages/ray/_private/worker.py", line 919, in get_objects
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(TypeError): ray::main_task() (pid=2641, ip=172.19.0.2)
  File "/workspace/TinyZero/verl/trainer/main_ppo.py", line 188, in main_task
    trainer.init_workers()
  File "/workspace/TinyZero/verl/trainer/ppo/ray_trainer.py", line 514, in init_workers
    self.actor_rollout_wg.init_model()
  File "/workspace/TinyZero/verl/single_controller/ray/base.py", line 42, in func
    output = ray.get(output)
ray.exceptions.RayTaskError(TypeError): ray::WorkerDict.actor_rollout_init_model() (pid=2892, ip=172.19.0.2, actor_id=9b3727d88709f75f8ee9f78401000000, repr=<verl.single_controller.ray.base.WorkerDict object at 0x7a94a9f27d30>)
  File "/workspace/TinyZero/verl/single_controller/ray/base.py", line 399, in func
    return getattr(self.worker_dict[key], name)(*args, **kwargs)
  File "/workspace/TinyZero/verl/single_controller/base/decorator.py", line 404, in inner
    return func(*args, **kwargs)
  File "/workspace/TinyZero/verl/workers/fsdp_workers.py", line 332, in init_model
    self.rollout, self.rollout_sharding_manager = self._build_rollout()
  File "/workspace/TinyZero/verl/workers/fsdp_workers.py", line 254, in _build_rollout
    dp = self.world_size // infer_tp
TypeError: unsupported operand type(s) for //: 'int' and 'str'

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions