OmniVLA: An Omni-Modal Vision-Language-Action Model for Robot Navigation

Noriaki Hirose^{1, 2}, Catherine Glossop¹, Dhruv Shah³, Sergey Levine¹

¹ UC Berkeley (Berkeley AI Research), ² Toyota Motor North America, , ³ Princeton University

IEEE International Conference on Robotics and Automation (ICRA) 2026

Installation

Please set up a conda environment (see instructions in SETUP.md).

Inference

Download our checkpoints and place them in our directory. "omnivla-original" is the trained checkpoints of the OmniVLA for paper submission. "omnivla-original-balance" contains the trained checkpoints of OmniVLA that account for the data balance in the LeLaN dataset. And "omnivla-finetuned-cast" is finetuned checkpoints with the CAST dataset.
```
git clone https://huggingface.co/NHirose/omnivla-original
git clone https://huggingface.co/NHirose/omnivla-original-balance    
git clone https://huggingface.co/NHirose/omnivla-finetuned-cast
```
Run OmniVLA using a sample current image, goal images, GPS pose, and language prompt. You can view the generated trajectory in the output figure 1_ex.jpg.
```
python inference/run_omnivla.py
```
Change the goal modality: by default, our code generates actions based on the language prompt. To use a different modality, you can modify the settings around line 560.
Run OmniVLA to control the real robot. Modify "run_omnivla.py" to update the robot’s state (camera image, GPS signal) and adjust the goal information accordingly. Then, feed the generated velocity commands to your robot.
To try the finetuned checkpoints with the CAST dataset, update the path and step number in "InferenceConfig" within "run_omnivla.py".

Inference: OmniVLA-edge

Download our checkpoints and place them in our directory.
```
git clone https://huggingface.co/NHirose/omnivla-edge
```
Run OmniVLA-edge using a sample current image, goal images, GPS pose, and language prompt. You can view the generated trajectory in the output figure 1_ex_omnivla_edge.jpg.
```
python inference/run_omnivla_edge.py
```
Change the goal modality: by default, our code generates actions based on the language prompt. To use a different modality, you can modify the settings around line 425.
Run OmniVLA to control the real robot. Modify "run_omnivla_edge.py" to update the robot’s state (camera image, GPS signal) and adjust the goal information accordingly. Then, feed the generated velocity commands to your robot.

Training

We provide the training code along with a sample dataloader to help you quickly understand the required data loading structure. Since preparing the full training dataset is resource-intensive, we include this simplified code base for convenience.

Downloading MBRA project code base:

cd ..
git clone https://github.com/NHirose/Learning-to-Drive-Anywhere-with-MBRA.git

Downloading MBRA model:

cd OmniVLA_internal
git clone https://huggingface.co/NHirose/MBRA/

You can set the training or debugging mode at line 10 in vla-scripts/train_omnivla.py. Note that even in debugging mode, the code requires at least 20 GB of GPU memory (we use an NVIDIA RTX 4090).
You can configure visualization at line 11 in vla-scripts/train_omnivla.py. During training, it should be set to False.

Training our policy from OpenVLA checkpoints (Please fill X):

torchrun --standalone --nnodes 1 --nproc-per-node X vla-scripts/train_omnivla.py  --vla_path openvla/openvla-7b --dataset_name omnivla --num_images_in_input 2 --batch_size X --wandb_entity "X" --wandb_project "omnivla"

Finetuning our OmniVLA (Please fill X):

torchrun --standalone --nnodes 1 --nproc-per-node X vla-scripts/train_omnivla.py  --vla_path ./omnivla-original --dataset_name omnivla --num_images_in_input 2 --batch_size X --wandb_entity "X" --wandb_project "omnivla"

Memo finetuning our OmniVLA on our large navigation dataset:

conda activate omnivla_2
cd /media/noriaki/Noriaki_Data/OmniVLA
torchrun --standalone --nnodes 1 --nproc-per-node 1 vla-scripts/train_omnivla_dataset.py  --vla_path ./omnivla-original --dataset_name omnivla --wandb_entity "noriaki-hirose"   --wandb_project "omnivla"

Training with GNM, LeLaN, Frodobots, BDD and CAST datasets

We provide training code that supports multiple public datasets. Before following the full training process, please first ensure that you can run the example training with the sample dataloader.

Downloading all datasets from the original website. (GNM, LeLaN, Frodobots, CAST) Please verify that the downloaded datasets work properly in their original codebase, except BDD dataset. Note that please download the LeLaN dataset from this link instead of the original link. The updated dataset already includes the NoMaD trajectories used for collision-avoidance supervision, you no longer need to compute the NoMaD policy during training. Please carefully follow the usage procedure described in the LeLaN codebase when working with the dataset.
Downloading the modified BDD dataset with MBRA annotations from here and extract it. The image sequences in the modified dataset remain subject to the original BDD license, while the additional MBRA annotations are released under the MIT license.
Downloading the lerobot code base for the Frodobots dataset dataloader:
```
git clone https://github.com/huggingface/lerobot.git 
```
Edit the data path in config_nav/mbra_and_dataset_config.yaml:

Training our policy from OpenVLA checkpoints (Please fill X):

torchrun --standalone --nnodes 1 --nproc-per-node X vla-scripts/train_omnivla_dataset.py  --vla_path ./omnivla-original --dataset_name omnivla --wandb_entity "X"   --wandb_project "omnivla"

In our training setup, we use 8 Nvidia H100 GPUs (80 GB each) across 8 nodes. The batch sizes are configured as [LeLaN, GNM, Frodobots, BDD] = [4, 1, 1, 1], with gradient accumulation set to 4 steps. When finetuning with CAST dataset, we set the batch size as [LeLaN, CAST, GNM, Frodobots, BDD] = [2, 2, 1, 1, 1]. To do so, you need to directly edit train_omnivla_dataset.py.

Acknowledgement

We implement our ideas and design choices on top of the pretrained checkpoints. Our work builds upon the OpenVLA-OFT codebase, with additional code added to create OmniVLA. As such, our implementation leverages many components of the OpenVLA-OFT codebase. We sincerely appreciate the effort and contributions of the OpenVLA-OFT team!

Citing

@misc{hirose2025omnivla,
      title={OmniVLA: An Omni-Modal Vision-Language-Action Model for Robot Navigation}, 
      author={Noriaki Hirose and Catherine Glossop and Dhruv Shah and Sergey Levine},
      year={2025},
      eprint={2509.19480},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2509.19480}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
config_nav		config_nav
experiments/robot		experiments/robot
inference		inference
prismatic		prismatic
vla-scripts		vla-scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
SETUP.md		SETUP.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OmniVLA: An Omni-Modal Vision-Language-Action Model for Robot Navigation

Installation

Inference

Inference: OmniVLA-edge

Training

Training with GNM, LeLaN, Frodobots, BDD and CAST datasets

Acknowledgement

Citing

About

Uh oh!

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

OmniVLA: An Omni-Modal Vision-Language-Action Model for Robot Navigation

Installation

Inference

Inference: OmniVLA-edge

Training

Training with GNM, LeLaN, Frodobots, BDD and CAST datasets

Acknowledgement

Citing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Packages 0

Uh oh!

Contributors 1

Languages

Packages