Noriaki Hirose1, 2, Catherine Glossop1, Dhruv Shah3, Sergey Levine1
1 UC Berkeley (Berkeley AI Research), 2 Toyota Motor North America, , 3 Princeton University
IEEE International Conference on Robotics and Automation (ICRA) 2026
Please set up a conda environment (see instructions in SETUP.md).
-
Download our checkpoints and place them in our directory. "omnivla-original" is the trained checkpoints of the OmniVLA for paper submission. "omnivla-original-balance" contains the trained checkpoints of OmniVLA that account for the data balance in the LeLaN dataset. And "omnivla-finetuned-cast" is finetuned checkpoints with the CAST dataset.
git clone https://huggingface.co/NHirose/omnivla-original git clone https://huggingface.co/NHirose/omnivla-original-balance git clone https://huggingface.co/NHirose/omnivla-finetuned-cast -
Run OmniVLA using a sample current image, goal images, GPS pose, and language prompt. You can view the generated trajectory in the output figure 1_ex.jpg.
python inference/run_omnivla.py -
Change the goal modality: by default, our code generates actions based on the language prompt. To use a different modality, you can modify the settings around line 560.
-
Run OmniVLA to control the real robot. Modify "run_omnivla.py" to update the robot’s state (camera image, GPS signal) and adjust the goal information accordingly. Then, feed the generated velocity commands to your robot.
-
To try the finetuned checkpoints with the CAST dataset, update the path and step number in "InferenceConfig" within "run_omnivla.py".
-
Download our checkpoints and place them in our directory.
git clone https://huggingface.co/NHirose/omnivla-edge -
Run OmniVLA-edge using a sample current image, goal images, GPS pose, and language prompt. You can view the generated trajectory in the output figure 1_ex_omnivla_edge.jpg.
python inference/run_omnivla_edge.py -
Change the goal modality: by default, our code generates actions based on the language prompt. To use a different modality, you can modify the settings around line 425.
-
Run OmniVLA to control the real robot. Modify "run_omnivla_edge.py" to update the robot’s state (camera image, GPS signal) and adjust the goal information accordingly. Then, feed the generated velocity commands to your robot.
We provide the training code along with a sample dataloader to help you quickly understand the required data loading structure. Since preparing the full training dataset is resource-intensive, we include this simplified code base for convenience.
-
Downloading MBRA project code base:
cd .. git clone https://github.com/NHirose/Learning-to-Drive-Anywhere-with-MBRA.git -
Downloading MBRA model:
cd OmniVLA_internal git clone https://huggingface.co/NHirose/MBRA/ -
You can set the training or debugging mode at line 10 in vla-scripts/train_omnivla.py. Note that even in debugging mode, the code requires at least 20 GB of GPU memory (we use an NVIDIA RTX 4090).
-
You can configure visualization at line 11 in vla-scripts/train_omnivla.py. During training, it should be set to False.
-
Training our policy from OpenVLA checkpoints (Please fill X):
torchrun --standalone --nnodes 1 --nproc-per-node X vla-scripts/train_omnivla.py --vla_path openvla/openvla-7b --dataset_name omnivla --num_images_in_input 2 --batch_size X --wandb_entity "X" --wandb_project "omnivla" -
Finetuning our OmniVLA (Please fill X):
torchrun --standalone --nnodes 1 --nproc-per-node X vla-scripts/train_omnivla.py --vla_path ./omnivla-original --dataset_name omnivla --num_images_in_input 2 --batch_size X --wandb_entity "X" --wandb_project "omnivla" -
Memo finetuning our OmniVLA on our large navigation dataset:
conda activate omnivla_2 cd /media/noriaki/Noriaki_Data/OmniVLA torchrun --standalone --nnodes 1 --nproc-per-node 1 vla-scripts/train_omnivla_dataset.py --vla_path ./omnivla-original --dataset_name omnivla --wandb_entity "noriaki-hirose" --wandb_project "omnivla"
We provide training code that supports multiple public datasets. Before following the full training process, please first ensure that you can run the example training with the sample dataloader.
-
Downloading all datasets from the original website. (GNM, LeLaN, Frodobots, CAST) Please verify that the downloaded datasets work properly in their original codebase, except BDD dataset. Note that please download the LeLaN dataset from this link instead of the original link. The updated dataset already includes the NoMaD trajectories used for collision-avoidance supervision, you no longer need to compute the NoMaD policy during training. Please carefully follow the usage procedure described in the LeLaN codebase when working with the dataset.
-
Downloading the modified BDD dataset with MBRA annotations from here and extract it. The image sequences in the modified dataset remain subject to the original BDD license, while the additional MBRA annotations are released under the MIT license.
-
Downloading the lerobot code base for the Frodobots dataset dataloader:
git clone https://github.com/huggingface/lerobot.git -
Edit the data path in config_nav/mbra_and_dataset_config.yaml:
-
Training our policy from OpenVLA checkpoints (Please fill X):
torchrun --standalone --nnodes 1 --nproc-per-node X vla-scripts/train_omnivla_dataset.py --vla_path ./omnivla-original --dataset_name omnivla --wandb_entity "X" --wandb_project "omnivla"
In our training setup, we use 8 Nvidia H100 GPUs (80 GB each) across 8 nodes. The batch sizes are configured as [LeLaN, GNM, Frodobots, BDD] = [4, 1, 1, 1], with gradient accumulation set to 4 steps. When finetuning with CAST dataset, we set the batch size as [LeLaN, CAST, GNM, Frodobots, BDD] = [2, 2, 1, 1, 1]. To do so, you need to directly edit train_omnivla_dataset.py.
We implement our ideas and design choices on top of the pretrained checkpoints. Our work builds upon the OpenVLA-OFT codebase, with additional code added to create OmniVLA. As such, our implementation leverages many components of the OpenVLA-OFT codebase. We sincerely appreciate the effort and contributions of the OpenVLA-OFT team!
@misc{hirose2025omnivla,
title={OmniVLA: An Omni-Modal Vision-Language-Action Model for Robot Navigation},
author={Noriaki Hirose and Catherine Glossop and Dhruv Shah and Sergey Levine},
year={2025},
eprint={2509.19480},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2509.19480},
}