BEVFusion Inference

Pure PyTorch reimplementation of BEVFusion (MIT HAN Lab) for inference-only use. Loads the official pretrained weights and runs 3D object detection on nuScenes data, producing 10-class detections (car, truck, bus, pedestrian, etc.) in bird's-eye view.

This is not a fork of the original repo. The model architecture was rebuilt from scratch using PyTorch, timm, and spconv v2, with a custom weight-loading layer that maps the official checkpoint keys to the new structure.

Architecture

Camera (6x) ──► Swin-T ──► FPN ──► LSS Depth ──► Camera BEV
                                                       │
                                                       ▼
                                                    ConvFuser ──► SECOND Backbone ──► SECOND FPN ──► TransFusion Head ──► Detections
                                                       ▲
                                                       │
LiDAR ──► Voxelize ──► Sparse 3D Conv Encoder ──► LiDAR BEV

Component	Implementation
Camera backbone	Swin Transformer Tiny via `timm` (`swin_tiny_patch4_window7_224`, 256x704 input)
Camera neck	Generalized LSS FPN (multi-scale feature fusion)
Camera-to-BEV	Lift-Splat-Shoot with 118 depth bins (1.0-60.0m), `scatter_add` pooling
LiDAR voxelization	Pure PyTorch hash-based hard voxelization
LiDAR encoder	Sparse 3D convolutions via spconv v2 (4 stages: 16→32→64→128 channels)
Fusion	Channel concatenation + Conv2d + BN + ReLU
BEV decoder	SECOND backbone + SECOND FPN
Detection head	TransFusion (heatmap proposals + transformer decoder, 200 queries)

Verified Environment

This was built and tested on the following setup:


OS	Windows 11 + WSL2 (Ubuntu, kernel 6.6.87.2-microsoft-standard-WSL2)
GPU	NVIDIA GPU with CUDA 13.0, driver 581.42
Python	3.10
PyTorch	2.7.1+cu126
spconv	spconv-cu120 (v2)
timm	latest
GPU memory	~1.7 GB for single-sample inference

Quick Start

1. Environment setup

uv venv .venv --python 3.10
source .venv/bin/activate
uv pip install torch torchvision --index-url https://download.pytorch.org/whl/cu126
uv pip install timm nuscenes-devkit pyquaternion matplotlib Pillow
uv pip install spconv-cu120
uv pip install -e .

2. Download pretrained weights

bash scripts/download_models.sh

This downloads bevfusion-det.pth (~157 MB) into pretrained/.

3. Get nuScenes mini dataset

Download nuScenes mini from https://www.nuscenes.org/nuscenes#download and extract:

mkdir -p /path/to/nuscenes
cd /path/to/nuscenes
tar xzf v1.0-mini.tgz

The directory should contain v1.0-mini/, samples/, and sweeps/.

4. Run inference

Single sample:

python scripts/run_inference.py --dataroot /path/to/nuscenes --num-samples 1

Batch across all scenes (3 samples per scene):

python scripts/run_batch.py --dataroot /path/to/nuscenes --per-scene 3

5. Visualize

Inference + visualization in one shot:

python scripts/visualize.py --dataroot /path/to/nuscenes --sample-idx 2

Visualize from saved .npz results:

python scripts/visualize.py --dataroot /path/to/nuscenes --result outputs/sample_0002.npz

Batch-visualize all results:

python scripts/visualize_all.py --dataroot /path/to/nuscenes

Outputs are saved to outputs/ as vis_XXXX.png (camera views + BEV plot) and sample_XXXX.npz (raw detections).

Tests

pytest tests/ -v

21 tests total: 15 component-level unit tests + 6 integration tests (model instantiation, weight loading, end-to-end forward pass). Requires a CUDA GPU.

Project Structure

src/
  config.py           # BEVFusionConfig dataclass (all hyperparameters)
  model.py            # BEVFusion model + checkpoint key remapping
  camera_encoder.py   # Swin-T backbone, FPN neck, LSS depth transform
  lidar_encoder.py    # Sparse 3D conv encoder (spconv v2)
  fuser.py            # Camera-LiDAR BEV fusion
  decoder.py          # SECOND backbone + FPN
  head.py             # TransFusion detection head
  ops.py              # BEV pooling (scatter_add), voxelization
scripts/
  run_inference.py    # Full inference pipeline with nuScenes data loading
  run_batch.py        # Batch inference across multiple scenes
  visualize.py        # 3D box projection on camera views + BEV plot
  visualize_all.py    # Batch visualization from saved results
  test_pretrained.py  # Quick smoke test with synthetic data
  download_models.sh  # Download pretrained weights
tests/
  test_components.py  # Unit tests for each module
  test_model.py       # Integration tests

Weight Loading

The pretrained checkpoint uses mmcv/mmdet3d conventions. The custom remapping handles:

mmcv Swin → timm Swin: stages.{i} → layers_{i}, attn.w_msa → attn, ffn.layers → mlp.fc{1,2}
spconv v1 → v2: 5D kernel weights transposed from (k,k,k,in,out) to (out,k,k,k,in)
mmcv ConvModule: checkpoint uses .conv./.bn. sub-keys, matched by custom ConvModule class

Result: 564/568 parameters matched (99.3%). The 4 unmatched are computed buffers (bev_pos, voxelization grid constants).

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
docs		docs
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BEVFusion Inference

Architecture

Verified Environment

Quick Start

1. Environment setup

2. Download pretrained weights

3. Get nuScenes mini dataset

4. Run inference

5. Visualize

Tests

Project Structure

Weight Loading

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

BEVFusion Inference

Architecture

Verified Environment

Quick Start

1. Environment setup

2. Download pretrained weights

3. Get nuScenes mini dataset

4. Run inference

5. Visualize

Tests

Project Structure

Weight Loading

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages