This is the PyTorch implementation of "NavQ: Learning a Q-Model for Foresighted Vision-and-Language Navigation" (ICCV 2025)
Please follow DUET to prepare the Matterport3D simulator, the environment, and the data and features of REVERIE and R2R.
After that, please download the following files and put them under the datasets drectory:
- The CLIP features of the image observations, provided by ScaleVLN (
clip_vit-b16_mp3d_original.hdf5,clip_vit-b16_mp3d_hm3d_gibson.hdf5) - The CLIP features of the textual descriptions (
text_fts_LangNavGlobal.npy,text_fts_LangNavGlobal_SVLN.npy). The textual descriptions on MP3D are borrowed from LangNav, while the textual descriptions on the new scenes of ScaleVLN are extracted using BLIP in a similar mannar - The
bert-base-uncasedmodel
Also, please download these files related to the Q-Model training and put them under the Q_pretrain/Q_files directory.
To train the Q Model, you may first get in Q_pretrain and use the following codes.
python train_mae.py --out_dir=ckpt/MAE --use_SVLN=False --eval_interval=1000
python train.py --out_dir=ckpt/Q --use_SVLN=False --eval_interval=1000 --resume_ckpt=ckpt/MAE/ckpt20000.pt
To use the additionl scenes provided by ScaleVLN, please run:
python train_mae.py --out_dir=ckpt/MAE-SVLN --use_SVLN=True --eval_interval=2000
python train.py --out_dir=ckpt/Q-SVLN --use_SVLN=True --eval_interval=2000 --resume_ckpt=ckpt/MAE-SVLN/ckpt100000.pt
Our Pre-trained Q-Model can be found at here.
Please modify the "WM_ckpt" in pretrain_src/config/reverie_obj_model_config.json to the path of the Q-model you would like to use.
Then, run the following lines to start pretraining the agent.
cd pretrain_src
bash run_reverie.sh
Please modify the Q_ckpt and pretrain_ckpt in map_nav_src/scripts/run_reverie.sh to the path of the Q-model and the pretrained model.
Then, run the following lines to start finetuning the agent.
cd map_nav_src
bash scripts/run_reverie.sh
We have provided our trained model at here.
This repository is built upon DUET. The structure and training of the Q model are largely inspired by nanoGPT. We also make use of the data provided by ScaleVLN and LangNav. We sincerely thank these works for their valuable contributions.