This is the official implementation of VEON (ECCV2024).
This repo includes the reproduced version of VEON, and its extended journal variant VEON*. The models learn Vocabulary-Enhanced 3D representation for Open-vocabulary Occupancy PredictioN in the autonomous driving scenario.
The recipe of VEON is assembling and adapting both a depth foundation model and a vision-language foundation model for 3D open-vocabulary occupancy. Please refer to our paper for model details.
Suppose the VEON codebase path is ${VEON_HOME}. Then, follow the subsequent procedures.
Prepare the base environments (BEVDet & SAN & Depth).
Step 1.1 BEVDet Environment
Please prepare the environment as that in BEVDet. VEON directly adopts the BEVDet framework (v2.1) for development.
Step 1.2 SAN Environment
Please prepare the environment as that in SAN. VEON integrates SAN into the BEVDet framework for open-vocabulary recognition.
Then, download the pretrained SAN checkpoints (san_vit_b_16.pth and san_vit_large_14.pth) into folder ${VEON_HOME}/ckpts/clipsan. Then, run the following script to reformat them.
cd ${VEON_HOME}
python tools/misc/process_san_pth.pyThe reformatted checkpoints are named SAN_ViT-B.pth and SAN_ViT-L.pth, also placed in folder ${VEON_HOME}/ckpts/clipsan. The path and names of these checkpoints can be revised if you are familiar with the config files.
Note 1: For those having network problem for automatically downloading openai CLIP backbones (ViT-B-16.pt and ViT-L-14-336px.pt) in function open_clip.create_model_and_transforms(), you may need to manually download the pretrained weights, and load them offline from the disk.
Note 2: Environments of SAN and BEVDet are basically compatible with each other, but you may need to install detectron2 with detectron2-xyz for compatibility in certain Python versions (e.g., Python 3.7).
Step 1.3 Depth Environment
Prepare the depth environments. There exists two branches that can be conducted according to the depth foundation model you leverage, including Branch 1.3.1 for ZoeDepth, and Branch 1.3.2 for DepthAnythingV2.
In fact, we use ZoeDepth variants for most experiments in our paper, but DepthAnythingV2 variants are often more stable and well-performing. Thus, we strongly recommend using the DepthAnythingV2 variants.
Branch 1.3.1 MiDaS Environment
If you adopt MiDaS + ZoeDepth as the depth foundation model, please prepare the environment as that in ZoeDepth.
Then, download the pretrained ZoeDepth-NK model (ZoeD_M12_NK.pt) and place it into the folder ${VEON_HOME}/ckpts/zoedepth.
Run the following script to reformat it.
cd ${VEON_HOME}
python tools/misc/process_zoe_pth.pyThe reformatted checkpoint is named ZoeD_M12_NK_p.pt and also placed in folder ${VEON_HOME}/ckpts/zoedepth.
Branch 1.3.2 DepthAnythingV2 Environment
If you adopt DepthAnythingV2 as the depth foundation model, please prepare the environment as that in Depth-Anthing-V2/metric.
Then, download the pretrained DA-V2 models (outdoor metric models including depth_anything_v2_metric_vkitti_vitb.pth and depth_anything_v2_metric_vkitti_vitl.pth) and place them into the folder ${VEON_HOME}/ckpts/depthanythingv2.
Prepare the nuScenes dataset folder as introduced in nuscenes_det.md and create the pkl files for BEVDet by running the following script.
cd ${VEON_HOME}
python tools/create_data_bevdet.pyPlease refer to issues of BEVDet if you are faced with any problems on nuScenes.
This repository supports both Occ3D-nuScenes close-set occupancy dataset and POP-3D language-driven retrieval benchmark.
Step 3.1 Occ3D-nuScenes Dataset
For the close-set Occ3D-nuScenes occupancy prediction task, download (only) the 'gts' from
CVPR2023-3D-Occupancy-Prediction
and arrange the nuScenes dataset folder ${VEON_HOME}/data/nuscenes as:
└── data
└── nuscenes
├── v1.0-trainval (existing)
├── sweeps (existing)
├── samples (existing)
└── gts (new)Step 3.2 POP-3D Retrieval Benchmark
For the language-driven object retrieval task, please download the materials as introducted in POP-3D. The corresponding downloading script is download_retrieval_benchmark.sh. After downloading, place the downloaded materials in folder ${VEON_HOME}/data/nuscenes/retrieval_benchmark/. The folder structure is:
└── data
└── nuscenes
└── retrieval_benchmark
├── annotations
├── matching_points
├── retrieval_anns_all.csv
├── retrieval_anns_eval.csv
├── retrieval_anns_test.csv
├── retrieval_anns_train.csv
└── retrieval_anns_val.csvFirst, the ${VEON_HOME}/ckpts folder should have the following structure before training:
├── clipsan (necessary)
│ ├── SAN_ViT-B.pth
│ └── SAN_ViT-L.pth
├── depth_pretrain (empty)
├── depthanythingv2 (branch 1.3.1)
│ ├── depth_anything_v2_metric_vkitti_vitb.pth
│ └── depth_anything_v2_metric_vkitti_vitl.pth
└── zoedepth (branch 1.3.2)
└── ZoeD_M12_NK_p.pt Second, the ${VEON_HOME}/data/nuscenes folder should be organized as follows:
└── data
└── nuscenes
├── bevdetv2-nuscenes_infos_train.pkl
├── bevdetv2-nuscenes_infos_val.pkl
├── gts
├── lidarseg
├── maps
├── retrieval_benchmark (optional)
├── samples
├── sweeps
├── v1.0-test
└── v1.0-trainvalNot all components from the nuScenes dataset are necessary, but the above folder structure is OK.
Now we introduce how to train and test the VEON models. The training stage can be divided into two stages, including Depth Pretraining (Stage 1), and Occupancy Prediction (Stage 2).
By default, we use 8 NVIDIA V100 GPUs with 32G memory each.
For adapting the depth foundation model, you should run the following script. We recommend using the DepthAnythingV2 variants instead of the MiDaS variants, but we first take the ZoeDepth variants as an example.
# Branch 1.3.1: MiDaS + ZoeDepth variant
# Script format: bash ./tools/dist_train.sh $config $num_gpu
cd ${VEON_HOME}
bash tools/dist_train.sh configs/veon/veon-pretrain-zoedepth.py 8The outputting checkpoints are in folder ${VEON_HOME}/work_dirs/veon-pretrain-zoedepth/.
Before starting training stage 2, you should: (1) select one checkpoint (e.g., epoch_48.pth); (2) place it in ${VEON_HOME}/ckpts/depth_pretrain; and (3) rename the checkpoint as zoedepth_pretrain.pth. Note: The file names of the adapted depth models can be revised in the config files of stage 2.
Similarly, for the DepthAnythingV2 variants, the training script is:
# Branch 1.3.2: Depth-Anything-V2 variant
cd ${VEON_HOME}
bash tools/dist_train.sh configs/veon/veon-pretrain-depthanythingv2.py 8Before starting training stage 2, you should also select one checkpoint (e.g., epoch_48.pth), place it in ${VEON_HOME}/ckpts/depth_pretrain/, and rename it as depthanythingv2_pretrain_large.pth.
After obtaining the finetuned depth estimator, we can start training stage 2 by the following script. We recommend using the DepthAnythingV2 variants instead of the MiDaS variants, but we first take the ZoeDepth variants as example.
cd ${VEON_HOME}
bash tools/dist_train.sh configs/veon/veon-temporal-base-512x1408-zoe-nodepthcache.py 8After training stage 2, all resulting VEON checkpoints will be stored in ${VEON_HOME}/work_dirs/veon-temporal-base-512x1408-zoe-nodepthcache/.
Similarly, for the DepthAnythingV2 variant, you can run the script as:
cd ${VEON_HOME}
bash tools/dist_train.sh configs/veon/veon-temporal-base-512x1408-dav2-nodepthcache.py 8The resulting VEON checkpoints will be stored in ${VEON_HOME}/work_dirs/veon-temporal-base-512x1408-dav2-nodepthcache/.
After training stage 2, you can test the checkpoints stored in folder ${VEON_HOME}/work_dirs/ on specific tasks.
Testing Mode 1: Occ3D-nuScenes Dataset
To test and eval a single checkpoint (e.g. epoch_10.pth) on Occ3D-nuScenes, you can run the following script:
# Testing only epoch_10.pth on Occ3D-nuScenes Dataset
# Script format: ./tools/dist_test.sh $config $checkpoint $num_gpu --eval $metric
cd ${VEON_HOME}
bash ./tools/dist_test.sh configs/veon/veon-temporal-base-512x1408-zoe-nodepthcache.py work_dirs/veon-temporal-base-512x1408-zoe-nodepthcache/epoch_10.pth 8 --eval bboxHowever, we strongly recommend testing all the resulting checkpoints within a certain epoch interval, e.g. epoch 5 to epoch 15. That useful script can be written as follows:
# Testing all checkpoints from epoch 5 to epoch 15 on Occ3D-nuScenes Dataset
# Script format: ./tools/dist_test_all.sh $config $checkpoint_folder $num_gpu $start_epoch $end_epoch --eval $metric
cd ${VEON_HOME}
bash ./tools/dist_test_all.sh configs/veon/veon-temporal-base-512x1408-zoe-nodepthcache.py work_dirs/veon-temporal-base-512x1408-zoe-nodepthcache 8 5 15 --eval bboxNote: The corresponding config files for DepthAnythingV2 variants are also provided with -dav2. Again, we recommend using the DepthAnythingV2 variants instead of the MiDaS variants.
Testing Mode 2: POP-3D Retrieval Benchmark
You can simply change the config file to eval a single checkpoint on POP-3D retreival benchmark. Here, the config file is different, but the checkpoint is kept the same.
# Testing only epoch_10.pth on POP-3D Retrieval Benchmark
cd ${VEON_HOME}
bash ./tools/dist_test.sh configs/veon/veon-temporal-base-512x1408-zoe-retrieval.py work_dirs/veon-temporal-base-512x1408-zoe-nodepthcache/epoch_10.pth 8 --eval bboxNote: The corresponding config files for DepthAnythingV2 variants are also provided with -dav2.
In training stage 2, as the depth estimator is frozen, we can cache the predicted depth on the whole training set, and thereby accelerate the training process.
Take the MiDaS + ZoeDepth version as an example. After obtaining the finetuned depth checkpoint (e.g. {VEON_HOME}/ckpts/depth_pretrain/zoedepth_pretrain.pth), you can run the following script for only one complete epoch, to cache all predicted depth maps on disk. Around 120G free disk space is required.
cd ${VEON_HOME}
bash tools/dist_train.sh configs/veon/veon-depthcache-zoedepth.py 8After one epoch, all depth maps will be stored in folder {VEON_HOME}/data/nuscenes/depth_cache/depth/. Then, you can run training stage 2 with the following script. This would save not only training time but also GPU memory.
cd ${VEON_HOME}
bash tools/dist_train.sh configs/veon/veon-temporal-base-512x1408-zoe-withdepthcache.py 8Note: The corresponding config files for DepthAnythingV2 variants are also provided, and the depth cache folder is {VEON_HOME}/data/nuscenes/depth_cache/depth_dav2/.
In the journal version of VEON, we integrate surrounding images from multiple frames to exploit the rich temporal information. In fact, you can revise the config files by only one line to run the VEON-T{X} variants.
Take the VEON-L-T{X} variants as an example. You can find the config file, e.g. ./configs/veon/veon-temporal-base-512x1408-zoe-nodepthcache.py, and revise the following line:
# Original code: num_temporal = 1
num_temporal = 2 # 1, 2, 3, 4 are all ok for V100This would support training and testing VEON-T2 with 2-frame inputs. The training and testing scripts are kept the same.
Note: We strongly recommend using the depth cache mechanism when num_temporal > 2, or the ``GPU out of memory'' error would occur on NVIDIA V100 GPUs.
This repository refers to multiple great open-sourced code bases. Thanks for their great contribution to the community.
If this work is helpful for your research, please consider citing the following BibTeX entry.
@inproceedings{eccv24-veon,
title={VEON: Vocabulary-Enhanced Occupancy Prediction},
author={Zheng, Jilai and Tang, Pin and Wang, Zhongdao and Wang, Guoqing and Ren, Xiangxuan and Feng, Bailan and Ma, Chao},
booktitle={ECCV},
year={2024},
}