Skip to content

NHirose/OmniVLA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OmniVLA: An Omni-Modal Vision-Language-Action Model for Robot Navigation

Python License: MIT Static Badge

Noriaki Hirose1, 2, Catherine Glossop1, Dhruv Shah3, Sergey Levine1

1 UC Berkeley (Berkeley AI Research), 2 Toyota Motor North America, , 3 Princeton University

IEEE International Conference on Robotics and Automation (ICRA) 2026

Installation

Please set up a conda environment (see instructions in SETUP.md).

Inference

  1. Download our checkpoints and place them in our directory. "omnivla-original" is the trained checkpoints of the OmniVLA for paper submission. "omnivla-original-balance" contains the trained checkpoints of OmniVLA that account for the data balance in the LeLaN dataset. And "omnivla-finetuned-cast" is finetuned checkpoints with the CAST dataset.

    git clone https://huggingface.co/NHirose/omnivla-original
    git clone https://huggingface.co/NHirose/omnivla-original-balance    
    git clone https://huggingface.co/NHirose/omnivla-finetuned-cast
    
  2. Run OmniVLA using a sample current image, goal images, GPS pose, and language prompt. You can view the generated trajectory in the output figure 1_ex.jpg.

    python inference/run_omnivla.py
    
  3. Change the goal modality: by default, our code generates actions based on the language prompt. To use a different modality, you can modify the settings around line 560.

  4. Run OmniVLA to control the real robot. Modify "run_omnivla.py" to update the robot’s state (camera image, GPS signal) and adjust the goal information accordingly. Then, feed the generated velocity commands to your robot.

  5. To try the finetuned checkpoints with the CAST dataset, update the path and step number in "InferenceConfig" within "run_omnivla.py".

Inference: OmniVLA-edge

  1. Download our checkpoints and place them in our directory.

    git clone https://huggingface.co/NHirose/omnivla-edge
    
  2. Run OmniVLA-edge using a sample current image, goal images, GPS pose, and language prompt. You can view the generated trajectory in the output figure 1_ex_omnivla_edge.jpg.

    python inference/run_omnivla_edge.py
    
  3. Change the goal modality: by default, our code generates actions based on the language prompt. To use a different modality, you can modify the settings around line 425.

  4. Run OmniVLA to control the real robot. Modify "run_omnivla_edge.py" to update the robot’s state (camera image, GPS signal) and adjust the goal information accordingly. Then, feed the generated velocity commands to your robot.

Training

We provide the training code along with a sample dataloader to help you quickly understand the required data loading structure. Since preparing the full training dataset is resource-intensive, we include this simplified code base for convenience.

  1. Downloading MBRA project code base:

    cd ..
    git clone https://github.com/NHirose/Learning-to-Drive-Anywhere-with-MBRA.git
    
  2. Downloading MBRA model:

    cd OmniVLA_internal
    git clone https://huggingface.co/NHirose/MBRA/
    
  3. You can set the training or debugging mode at line 10 in vla-scripts/train_omnivla.py. Note that even in debugging mode, the code requires at least 20 GB of GPU memory (we use an NVIDIA RTX 4090).

  4. You can configure visualization at line 11 in vla-scripts/train_omnivla.py. During training, it should be set to False.

  5. Training our policy from OpenVLA checkpoints (Please fill X):

    torchrun --standalone --nnodes 1 --nproc-per-node X vla-scripts/train_omnivla.py  --vla_path openvla/openvla-7b --dataset_name omnivla --num_images_in_input 2 --batch_size X --wandb_entity "X" --wandb_project "omnivla"
    
  6. Finetuning our OmniVLA (Please fill X):

    torchrun --standalone --nnodes 1 --nproc-per-node X vla-scripts/train_omnivla.py  --vla_path ./omnivla-original --dataset_name omnivla --num_images_in_input 2 --batch_size X --wandb_entity "X" --wandb_project "omnivla"
    
  7. Memo finetuning our OmniVLA on our large navigation dataset:

    conda activate omnivla_2
    cd /media/noriaki/Noriaki_Data/OmniVLA
    torchrun --standalone --nnodes 1 --nproc-per-node 1 vla-scripts/train_omnivla_dataset.py  --vla_path ./omnivla-original --dataset_name omnivla --wandb_entity "noriaki-hirose"   --wandb_project "omnivla"
    

Training with GNM, LeLaN, Frodobots, BDD and CAST datasets

We provide training code that supports multiple public datasets. Before following the full training process, please first ensure that you can run the example training with the sample dataloader.

  1. Downloading all datasets from the original website. (GNM, LeLaN, Frodobots, CAST) Please verify that the downloaded datasets work properly in their original codebase, except BDD dataset. Note that please download the LeLaN dataset from this link instead of the original link. The updated dataset already includes the NoMaD trajectories used for collision-avoidance supervision, you no longer need to compute the NoMaD policy during training. Please carefully follow the usage procedure described in the LeLaN codebase when working with the dataset.

  2. Downloading the modified BDD dataset with MBRA annotations from here and extract it. The image sequences in the modified dataset remain subject to the original BDD license, while the additional MBRA annotations are released under the MIT license.

  3. Downloading the lerobot code base for the Frodobots dataset dataloader:

    git clone https://github.com/huggingface/lerobot.git 
    
  4. Edit the data path in config_nav/mbra_and_dataset_config.yaml:

  5. Training our policy from OpenVLA checkpoints (Please fill X):

    torchrun --standalone --nnodes 1 --nproc-per-node X vla-scripts/train_omnivla_dataset.py  --vla_path ./omnivla-original --dataset_name omnivla --wandb_entity "X"   --wandb_project "omnivla"
    

In our training setup, we use 8 Nvidia H100 GPUs (80 GB each) across 8 nodes. The batch sizes are configured as [LeLaN, GNM, Frodobots, BDD] = [4, 1, 1, 1], with gradient accumulation set to 4 steps. When finetuning with CAST dataset, we set the batch size as [LeLaN, CAST, GNM, Frodobots, BDD] = [2, 2, 1, 1, 1]. To do so, you need to directly edit train_omnivla_dataset.py.

Acknowledgement

We implement our ideas and design choices on top of the pretrained checkpoints. Our work builds upon the OpenVLA-OFT codebase, with additional code added to create OmniVLA. As such, our implementation leverages many components of the OpenVLA-OFT codebase. We sincerely appreciate the effort and contributions of the OpenVLA-OFT team!

Citing

@misc{hirose2025omnivla,
      title={OmniVLA: An Omni-Modal Vision-Language-Action Model for Robot Navigation}, 
      author={Noriaki Hirose and Catherine Glossop and Dhruv Shah and Sergey Levine},
      year={2025},
      eprint={2509.19480},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2509.19480}, 
}

About

Official repository for OmniVLA training and inference code

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Languages