Skip to content

AnjieCheng/SR-3D

Repository files navigation

3D Aware Region Prompted Vision Language Model (ICLR'26)

website Arxiv Huggingface License

An-Chieh Cheng, Yang Fu, Yukang Chen, Zhijian Liu, Xiaolong Li, Subhashree Radhakrishnan, Song Han, Yao Lu, Jan Kautz, Pavlo Molchanov, Hongxu Yin, Xiaolong Wang, Sifei Liu

SR-3D-Teaser.mp4

💡 Introduction

SR-3D introduces a canonical positional representation shared across single-view and multi-view inputs. This unified representation enables large-scale single-view pretraining and supports the transfer of learned spatial priors to multi-view settings.

Installation

  1. Create the environment
./environment_setup.sh sr-3d

This script will:

  • create a Conda environment named sr-3d,
  • install PyTorch + CUDA and core dependencies,
  • install this repo as a package in editable mode.
  1. Activate the environment
conda activate sr-3d

Data

  1. SR-3D data & annotations — Download from our Hugging Face link.
  2. EmbodiedScan — Follow the EmbodiedScan instructions to obtain annotations (.pkl), scene RGB-D frames.
  3. ScanNet++ (optional) — Only needed for VSI-Bench evaluation.

Expected directory structure

SR-3D-Data/
├── data/                    # Scene RGB-D data per dataset
│   ├── 3rscan/
│   ├── arkitscenes/
│   ├── matterport3d/
│   ├── scannet/
│   └── scannetpp/           # Optional; for VSI-Bench only
├── processed/               # Preprocessed inputs
├── svila_processed/         # VILA-style processed (e.g. SR-3D-Bench)
│   └── sr3d_bench_v1.json
├── embodiedscan/            # EmbodiedScan annotations & assets
│   └── embodiedscan_infos_train.pkl
└── metadata/                # Dataset metadata (e.g. GT boxes)
    └── scannet_train_gt_box.json

Place all data under a single root (e.g. SR-3D-Data/). Before running any scan/video related evaluation scripts, export the SR3D_VIDEO_ROOT environment variable:

export SR3D_VIDEO_ROOT="/path/to/SR-3D-Data"

All scan/video related evaluation scripts (scripts/v1_5/eval/*.sh) will use this variable to locate data files.

Model Checkpoints

We release three pretrained checkpoints, each optimized for different use cases:

Checkpoint Description Use Case
a8cheng/sr3d-nvila-8b-singleview-pretrain Single-view spatial pretraining with strong 2D spatial understanding and region capabilities Recommended for fine-tuning on downstream tasks
a8cheng/sr3d-nvila-8b-multiview-scans Trained on Scan data series (ScanQA, SQA3D, Scan2Cap, etc.) Multi-view scene understanding benchmarks
a8cheng/sr3d-nvila-8b-multiview-videos Optimized for VSI-Bench Video-based spatial reasoning (VSI-Bench)

Note: We found that ScanQA data is noisy and hurts VSI-Bench performance, so we train separate checkpoints for Scan-based benchmarks vs. video-based benchmarks.

Evaluation

Our evaluation scripts are wrapped with vila-eval. Each benchmark writes:

  • predictions: runs/eval/<model_name>/<task>/*_output.json
  • metrics: runs/eval/<model_name>/<task>/metrics.json

Multi-view

SR-3D-Bench

sr3d-bench

SR-3D-Bench is a benchmark for evaluating spatial region understanding and 3D-aware reasoning in vision-language models. It focuses on reasoning over explicitly marked regions across single-view and multi-view visual inputs, requiring models to ground spatial relations, geometry, and semantics jointly.

Evaluation Setup

SR-3D-Bench requires LLM-based evaluation. Please set your OpenAI API key before running the evaluation:

export OPENAI_API_KEY="YOUR_API_KEY"
vila-eval --model-path a8cheng/sr3d-nvila-8b-multiview-scans --conv-mode auto --tasks sr3dbench -n 8

This evaluation protocol is aligned with prior region-aware spatial benchmarks such as SpatialRGPT, and follows a free-form QA setting where the model must generate grounded spatial answers.

Important

💡 To improve compatibility with standard VLM evaluation toolchains, we also provide a multiple-choice (MC) version of SR-3D-Bench.

  • Each video is pre-annotated with a set-of-marks, explicitly specifying the target regions.
  • Questions are converted into multiple-choice format, which can be evaluated without LLM.
  • The evaluation protocol closely follows VSI-Bench, making it easy to integrate with existing VLM benchmarks and tooling (e.g. lmms-eval, VLMEvalKit).

Check out the MC & SoM variant here.

ScanQA, SQA3D, Scan2Cap

vila-eval --model-path a8cheng/sr3d-nvila-8b-multiview-scans --conv-mode auto --tasks scanqa,sqa3d,scan2cap -n 8

VSI-Bench

vsibench
vila-eval --model-path a8cheng/sr3d-nvila-8b-multiview-videos --conv-mode auto --tasks vsibench -n 8

Single-view Spatial

You can run SAT, EmbSpatial, or BLINK_S separately with the --tasks argument:

vila-eval --model-path a8cheng/sr3d-nvila-8b-singleview-pretrain --conv-mode auto --tasks blink_val -n 8

or run all single‑view spatial benchmarks with:

vila-eval --model-path a8cheng/sr3d-nvila-8b-singleview-pretrain --conv-mode auto --tags-include single_view_spatial -n 8

📜 Citation

  @inproceedings{cheng2026sr3d,
          title={3D Aware Region Prompted Vision Language Model},
          author={An-Chieh Cheng and Yang Fu and Yukang Chen and Zhijian Liu and Xiaolong Li and Subhashree Radhakrishnan and Song Han and Yao Lu and Jan Kautz and Pavlo Molchanov and Hongxu Yin and Xiaolong Wang and Sifei Liu},
          booktitle={International Conference on Learning Representations},
          year={2026}
  }

🙏 Acknowledgement

We have used code snippets from different repositories, especially from: VILA, NVILA, Video3DLLM, LLaVA-3D, and EmbodiedScan. We would like thank the authors of these repositories for their excellent work.

About

[ICLR'26] This repository is the implementation of "3D Aware Region Prompted Vision Language Model"

Topics

Resources

License

Stars

Watchers

Forks