We provide a SFT training guide for Spatial-MLLM-Instruct-v1.1 models.
First, prepare the necessary pretrained model checkpoints and place them in the checkpoints directory.
mkdir -p checkpoints
# Download Qwen2.5-VL-3B-Instruct and VGGT-1B checkpoints
hf download Qwen/Qwen2.5-VL-3B-Instruct --local-dir checkpoints/Qwen2.5-VL-3B-Instruct
hf download facebook/VGGT-1B --local-dir checkpoints/VGGT-1BThe Spatial-MLLM-v1.1-Instruct-135k model is trained on the following datasets:
spatial_mllm_mix_133k: A mixture of our self-created data and ScanQA/SQA3D data. The annotations are available here.route_plan_scannet_2k: A subset of route planning data used in VLM-3R, containing around 2k samples from ScanNet.
The Spatial-MLLM-v1.1-Instruct-820k model is trained on the following datasets:
spatial_mllm_mix_203k: A mixture of our self-created data and ScanQA/SQA3D data. The annotations are available here.route_plan_4k: Route planning data used in VLM-3R.vsi_590k: The 590k dataset from Cambrian-S.mindcube_21k: The 21k dataset from MindCube.
For spatial_mllm_mix_133k and spatial_mllm_mix_203k, please download the annotations from the provided links and place them in the datasets/annotations directory.
For other annotation files, you may need to process them to align with our expected format (similar to this instruction). We provide some scripts in the scripts/preprocess for your reference.
For vsi_590k and mindcube_21k, they provide the corresponding visual data.
For spatial_mllm_mix and route_plan data, you need download and process raw video data from scannet, scannetpp and arkitscenes, and place them in the datasets/visuals directory.
Before starting training, you may need to modify the dataset configuration file to ensure annotation_path and data_path are set correctly.
You can follow the instructions in scripts/training/spatial_mllm_train_demo.sh to start training.