3D Aware Region Prompted Vision Language Model (ICLR'26)

An-Chieh Cheng, Yang Fu, Yukang Chen, Zhijian Liu, Xiaolong Li, Subhashree Radhakrishnan, Song Han, Yao Lu, Jan Kautz, Pavlo Molchanov, Hongxu Yin^†, Xiaolong Wang^†, Sifei Liu^†

SR-3D-Teaser.mp4

💡 Introduction

SR-3D introduces a canonical positional representation shared across single-view and multi-view inputs. This unified representation enables large-scale single-view pretraining and supports the transfer of learned spatial priors to multi-view settings.

Installation

Create the environment

./environment_setup.sh sr-3d

This script will:

create a Conda environment named sr-3d,
install PyTorch + CUDA and core dependencies,
install this repo as a package in editable mode.

Activate the environment

conda activate sr-3d

Data

SR-3D data & annotations — Download from our Hugging Face link.
EmbodiedScan — Follow the EmbodiedScan instructions to obtain annotations (.pkl), scene RGB-D frames.
ScanNet++ (optional) — Only needed for VSI-Bench evaluation.

Expected directory structure

SR-3D-Data/
├── data/                    # Scene RGB-D data per dataset
│   ├── 3rscan/
│   ├── arkitscenes/
│   ├── matterport3d/
│   ├── scannet/
│   └── scannetpp/           # Optional; for VSI-Bench only
├── processed/               # Preprocessed inputs
├── svila_processed/         # VILA-style processed (e.g. SR-3D-Bench)
│   └── sr3d_bench_v1.json
├── embodiedscan/            # EmbodiedScan annotations & assets
│   └── embodiedscan_infos_train.pkl
└── metadata/                # Dataset metadata (e.g. GT boxes)
    └── scannet_train_gt_box.json

Place all data under a single root (e.g. SR-3D-Data/). Before running any scan/video related evaluation scripts, export the SR3D_VIDEO_ROOT environment variable:

export SR3D_VIDEO_ROOT="/path/to/SR-3D-Data"

All scan/video related evaluation scripts (scripts/v1_5/eval/*.sh) will use this variable to locate data files.

Model Checkpoints

We release three pretrained checkpoints, each optimized for different use cases:

Checkpoint	Description	Use Case
`a8cheng/sr3d-nvila-8b-singleview-pretrain`	Single-view spatial pretraining with strong 2D spatial understanding and region capabilities	Recommended for fine-tuning on downstream tasks
`a8cheng/sr3d-nvila-8b-multiview-scans`	Trained on Scan data series (ScanQA, SQA3D, Scan2Cap, etc.)	Multi-view scene understanding benchmarks
`a8cheng/sr3d-nvila-8b-multiview-videos`	Optimized for VSI-Bench	Video-based spatial reasoning (VSI-Bench)

Note: We found that ScanQA data is noisy and hurts VSI-Bench performance, so we train separate checkpoints for Scan-based benchmarks vs. video-based benchmarks.

Evaluation

Our evaluation scripts are wrapped with vila-eval. Each benchmark writes:

predictions: runs/eval/<model_name>/<task>/*_output.json
metrics: runs/eval/<model_name>/<task>/metrics.json

Multi-view

SR-3D-Bench

SR-3D-Bench is a benchmark for evaluating spatial region understanding and 3D-aware reasoning in vision-language models. It focuses on reasoning over explicitly marked regions across single-view and multi-view visual inputs, requiring models to ground spatial relations, geometry, and semantics jointly.

Evaluation Setup

SR-3D-Bench requires LLM-based evaluation. Please set your OpenAI API key before running the evaluation:

export OPENAI_API_KEY="YOUR_API_KEY"
vila-eval --model-path a8cheng/sr3d-nvila-8b-multiview-scans --conv-mode auto --tasks sr3dbench -n 8

This evaluation protocol is aligned with prior region-aware spatial benchmarks such as SpatialRGPT, and follows a free-form QA setting where the model must generate grounded spatial answers.

Important

💡 To improve compatibility with standard VLM evaluation toolchains, we also provide a multiple-choice (MC) version of SR-3D-Bench.

Each video is pre-annotated with a set-of-marks, explicitly specifying the target regions.
Questions are converted into multiple-choice format, which can be evaluated without LLM.
The evaluation protocol closely follows VSI-Bench, making it easy to integrate with existing VLM benchmarks and tooling (e.g. lmms-eval, VLMEvalKit).

Check out the MC & SoM variant here.

ScanQA, SQA3D, Scan2Cap

vila-eval --model-path a8cheng/sr3d-nvila-8b-multiview-scans --conv-mode auto --tasks scanqa,sqa3d,scan2cap -n 8

VSI-Bench

vila-eval --model-path a8cheng/sr3d-nvila-8b-multiview-videos --conv-mode auto --tasks vsibench -n 8

Single-view Spatial

You can run SAT, EmbSpatial, or BLINK_S separately with the --tasks argument:

vila-eval --model-path a8cheng/sr3d-nvila-8b-singleview-pretrain --conv-mode auto --tasks blink_val -n 8

or run all single‑view spatial benchmarks with:

vila-eval --model-path a8cheng/sr3d-nvila-8b-singleview-pretrain --conv-mode auto --tags-include single_view_spatial -n 8

📜 Citation

  @inproceedings{cheng2026sr3d,
          title={3D Aware Region Prompted Vision Language Model},
          author={An-Chieh Cheng and Yang Fu and Yukang Chen and Zhijian Liu and Xiaolong Li and Subhashree Radhakrishnan and Song Han and Yao Lu and Jan Kautz and Pavlo Molchanov and Hongxu Yin and Xiaolong Wang and Sifei Liu},
          booktitle={International Conference on Learning Representations},
          year={2026}
  }

🙏 Acknowledgement

We have used code snippets from different repositories, especially from: VILA, NVILA, Video3DLLM, LLaVA-3D, and EmbodiedScan. We would like thank the authors of these repositories for their excellent work.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
assets		assets
llava		llava
scripts		scripts
.flake8		.flake8
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
environment_setup.sh		environment_setup.sh
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

3D Aware Region Prompted Vision Language Model (ICLR'26)

💡 Introduction

Installation

Data

Expected directory structure

Model Checkpoints

Evaluation

Multi-view

SR-3D-Bench

Evaluation Setup

ScanQA, SQA3D, Scan2Cap

VSI-Bench

Single-view Spatial

📜 Citation

🙏 Acknowledgement

About

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

3D Aware Region Prompted Vision Language Model (ICLR'26)

💡 Introduction

Installation

Data

Expected directory structure

Model Checkpoints

Evaluation

Multi-view

SR-3D-Bench

Evaluation Setup

ScanQA, SQA3D, Scan2Cap

VSI-Bench

Single-view Spatial

📜 Citation

🙏 Acknowledgement

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors 1

Languages