Open-Nav: Exploring Zero-Shot Vision-and-Language Navigation in Continuous Environment with Open-Source LLMs
International Conference on Robotics and Automation (ICRA) 2025
Authors: Yanyuan Qiao, Wenqi Lyu, Hui Wang, Zixu Wang, Zerui Li, Yuan Zhang, Mingkui Tan, Qi Wu
Vision-and-Language Navigation (VLN) tasks require an agent to follow textual instructions to navigate through 3D environments. Traditional approaches use supervised learning methods, relying heavily on domain-specific datasets to train VLN models. Recent methods try to utilize closedsource large language models (LLMs) like GPT-4 to solve VLN tasks in zero-shot manners, but face challenges related to expensive token costs and potential data breaches in realworld applications. In this work, we introduce Open-Nav, a novel study that explores open-source LLMs for zero-shot VLN in the continuous environment. Open-Nav employs a spatial-temporal chain-of-thought (CoT) reasoning approach to break down tasks into instruction comprehension, progress estimation, and decision-making. It enhances scene perceptions with fine-grained object and spatial knowledge to improve LLM’s reasoning in navigation. Our extensive experiments in both simulated and real-world environments demonstrate that Open-Nav achieves competitive performance compared to using closed-source LLMs.
☑️ Release OpenNav_R2R-CE_100 for quick and cost-effective testing in simulated environments.
☑️ Full implementation of Open-Nav available for both training and inference
We recommend using Python 3.8 with a conda environment:
conda create -n opennav python=3.8
conda activate opennavThis project builds upon Discrete-Continuous-VLN. Please follow the steps below:
- You could follow the Discrete-Continuous-VLN to install
habitat-labandhabitat-simby following the official Habitat installation guide. - We use Habitat
v0.1.7in our experiments, the same version used in VLN-CE to ensure compatibility. - You may refer to requirements.txt or environment.yml in this repository for the exact package versions used.
ℹ️ Note: Our installation instructions are adapted from Discrete-Continuous-VLN.
OpenNav_R2R-CE_100: Download Here
Please place the downloaded files under:
data/datasets/R2R_VLNCE_v1-2_preprocessed/val_unseen/
We use Matterport3D (MP3D) scene reconstructions in this project.
You can obtain the dataset by following the instructions on the official Matterport3D project page. The download script download_mp.py is required to fetch the scenes.
To download the scenes:
⚠️ Requires Python 2.7.
python download_mp.py --task habitat -o data/scene_datasets/mp3d/Expected directory structure:
- data/
- scene_datasets/
- mp3d/
- {scene_id}/
- {scene_id}.glb
- {scene_id}_semantic.ply
- {scene_id}.house
- {scene_id}.navmesh
We provide several pre-trained models to support waypoint prediction and visual encoding in the Open-Nav framework.
Path:
waypoint_prediction/checkpoints/check_val_best_avg_wayscore
These models are used to predict candidate waypoints in the environment from visual input.
Path:
data/pretrained_models/ddppo-models/gibson-2plus-resnet50.pth
- Download link: ResNet-50 pretrained on Gibson for DD-PPO
This ResNet-50 depth encoder is trained for PointGoal navigation on the Gibson dataset and used to extract visual features from depth images.
Some external models are required for Scene Perception:
Please refer to their respective repositories for model download and setup instructions. These models are used to get spatial visual information to support the reasoning process of open-source LLMs.
Clone or place them under the root directory:
Path:
recognize_anything/
SpatialBot3B/
To run inference with Open-Nav, use the provided script:
bash run_OpenNav.bashYou can specify which LLM to use via the --llm argument in the script. Supported options include:
• gpt4o (default): Uses GPT-4o via OpenAI API
• Qwen2, Llama3.1, Gemma, Phi3, etc.: Open-source LLMs (require local deployment)
To change the number of evaluation episodes, edit the following field in:
habitat_extensions/config/vlnce_task.yaml
Locate this section and modify EPISODES_TO_LOAD:
DATASET:
TYPE: VLN-CE-v1
SPLIT: val_unseen
DATA_PATH: data/datasets/R2R_VLNCE_v1-2_preprocessed/{split}/OpenNav_R2R-CE_100_bertidx.json.gz
SCENES_DIR: data/scene_datasets/
EPISODES_TO_LOAD: 1 # Change this to run more episodesWe acknowledge that some parts of our code are adapted from existing open-source projects. Specifically, we reference the following repositories: DiscussNav, Discrete-Continuous-VLN, SpatialBot, RAM
If you find this work useful, please cite our paper:
@inproceedings{qiao2025opennav,
author = {Yanyuan Qiao and Wenqi Lyu and Hui Wang and Zixu Wang and Zerui Li and Yuan Zhang and Mingkui Tan and Qi Wu},
title = {Open-Nav: Exploring Zero-Shot Vision-and-Language Navigation in Continuous Environment with Open-Source LLMs},
booktitle = {Proceedings of the IEEE International Conference on Robotics and Automation (ICRA)},
year = {2025}
}