Skip to content

CUDA Out of Memory Error in Stage 2 Fine-Tuning with A100 80GB #7

@smhassanraza1

Description

@smhassanraza1

Hi,

I’m currently working on fine-tuning trace-uni on a downstream task and have successfully completed stage 1 (tuning the MLP adapters). However, during stage 2, where I try to tune the backbone, I’m encountering a CUDA out of memory error even though I’m using a single A100 80GB GPU—the highest-end GPU available.

Below is my training script:

#!/bin/bash 

############################################################################### 
# Stage 2: Training LLM Backbone
############################################################################### 

# Environment Variables 
WORLD_SIZE=0.0073 USD

MASTER_ADDR="127.0.0.1" 
MASTER_PORT=16666 
RANK=0 

# Training Arguments 
GLOBAL_BATCH_SIZE=1  # load number of videos in epoch 
GRADIENT_ACCUMULATION_STEPS=1 
LOCAL_BATCH_SIZE=1 
echo "Local Batch Size: ${LOCAL_BATCH_SIZE}" 

# CUDA and Logging Arguments 
# Set max_split_size_mb to help with fragmentation and avoid OOM errors. 
export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:128 
export WANDB_PROJECT=trace_greatest_hits 
export NCCL_P2P_LEVEL=NVL 
export HCCL_BUFFSIZE=1024 

RUN_NAME=greatest_hits_stage2 
DATA_DIR=datasets 
OUTP_DIR=/path/to/trace_finetune/stage2 

# Optionally, if you have control of the train script, add the following line 
# at the very start of your Python code to clear cached memory: 
#    import torch; torch.cuda.empty_cache() 

# Note: Remove any trailing spaces after the backslashes below. 
ASCEND_LAUNCH_BLOCKING=1 torchrun --nnodes "${WORLD_SIZE}" \ 
    --nproc_per_node "${NPROC_PER_NODE}" \ 
    --master_addr="${MASTER_ADDR}" \ 
    --master_port="${MASTER_PORT}" \ 
    --node_rank "${RANK}" \ 
    /path/to/trace_module/train_mt.py \ 
    --version v1_mistral \ 
    --vision_tower model/clip-vit-large-patch14-336 \ 
    --mm_projector_type spatial_slot \ 
    --freeze_mm_mlp_adapter True \ 
    --tune_mm_mlp_adapter False \ 
    --tune_mm_embed_head False \ 
    --tune_lm_embed_head False \ 
    --model_name_or_path /path/to/trace_finetune/stage1/model \ 
    --data_path /path/to/final_annotations.json \ 
    --data_folder /path/to/videos \ 
    --mm_vision_select_layer -2 \ 
    --mm_use_im_start_end False \ 
    --mm_use_im_patch_token False \ 
    --downsample_num 1 \ 
    --image_aspect_ratio pad \ 
    --freeze_backbone False \ 
    --num_frames 128 \ 
    --bf16 True \ 
    --fp16 False \ 
    --output_dir "${OUTP_DIR}/model" \ 
    --num_train_epochs 5 \ 
    --per_device_train_batch_size "${LOCAL_BATCH_SIZE}" \ 
    --per_device_eval_batch_size 1 \ 
    --gradient_accumulation_steps "${GRADIENT_ACCUMULATION_STEPS}" \ 
    --evaluation_strategy "no" \ 
    --save_strategy "epoch" \ 
    --save_steps 5000 \ 
    --save_total_limit 99 \ 
    --learning_rate 5e-6 \ 
    --weight_decay 0.0 \ 
    --warmup_ratio 0.03 \ 
    --lr_scheduler_type "cosine" \ 
    --logging_steps 1 \ 
    --model_max_length 4096 \ 
    --gradient_checkpointing True \ 
    --dataloader_num_workers 2 \ 
    --run_name "${RUN_NAME}" \ 
    --lazy_preprocess True \ 
    --sample_scheme "rand" \ 
    2> "${OUTP_DIR}/stage2.err"

Despite configuring the environment with PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:128, I still get the following error during execution:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 32.00 MiB (GPU 0; 79.14 GiB total capacity; 77.72 GiB already allocated; 21.62 MiB free; 78.46 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Could you please provide guidance on how to resolve this issue? Are there specific modifications or additional environment settings that might help avoid these OOM errors during stage 2 fine-tuning, especially when tuning the backbone with the current configuration?

Thanks for your help and support!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions