Skip to content

mych907/bevfusion_inference

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BEVFusion Inference

Pure PyTorch reimplementation of BEVFusion (MIT HAN Lab) for inference-only use. Loads the official pretrained weights and runs 3D object detection on nuScenes data, producing 10-class detections (car, truck, bus, pedestrian, etc.) in bird's-eye view.

This is not a fork of the original repo. The model architecture was rebuilt from scratch using PyTorch, timm, and spconv v2, with a custom weight-loading layer that maps the official checkpoint keys to the new structure.

Architecture

Camera (6x) ──► Swin-T ──► FPN ──► LSS Depth ──► Camera BEV
                                                       │
                                                       ▼
                                                    ConvFuser ──► SECOND Backbone ──► SECOND FPN ──► TransFusion Head ──► Detections
                                                       ▲
                                                       │
LiDAR ──► Voxelize ──► Sparse 3D Conv Encoder ──► LiDAR BEV
Component Implementation
Camera backbone Swin Transformer Tiny via timm (swin_tiny_patch4_window7_224, 256x704 input)
Camera neck Generalized LSS FPN (multi-scale feature fusion)
Camera-to-BEV Lift-Splat-Shoot with 118 depth bins (1.0-60.0m), scatter_add pooling
LiDAR voxelization Pure PyTorch hash-based hard voxelization
LiDAR encoder Sparse 3D convolutions via spconv v2 (4 stages: 16→32→64→128 channels)
Fusion Channel concatenation + Conv2d + BN + ReLU
BEV decoder SECOND backbone + SECOND FPN
Detection head TransFusion (heatmap proposals + transformer decoder, 200 queries)

Verified Environment

This was built and tested on the following setup:

OS Windows 11 + WSL2 (Ubuntu, kernel 6.6.87.2-microsoft-standard-WSL2)
GPU NVIDIA GPU with CUDA 13.0, driver 581.42
Python 3.10
PyTorch 2.7.1+cu126
spconv spconv-cu120 (v2)
timm latest
GPU memory ~1.7 GB for single-sample inference

Quick Start

1. Environment setup

uv venv .venv --python 3.10
source .venv/bin/activate
uv pip install torch torchvision --index-url https://download.pytorch.org/whl/cu126
uv pip install timm nuscenes-devkit pyquaternion matplotlib Pillow
uv pip install spconv-cu120
uv pip install -e .

2. Download pretrained weights

bash scripts/download_models.sh

This downloads bevfusion-det.pth (~157 MB) into pretrained/.

3. Get nuScenes mini dataset

Download nuScenes mini from https://www.nuscenes.org/nuscenes#download and extract:

mkdir -p /path/to/nuscenes
cd /path/to/nuscenes
tar xzf v1.0-mini.tgz

The directory should contain v1.0-mini/, samples/, and sweeps/.

4. Run inference

Single sample:

python scripts/run_inference.py --dataroot /path/to/nuscenes --num-samples 1

Batch across all scenes (3 samples per scene):

python scripts/run_batch.py --dataroot /path/to/nuscenes --per-scene 3

5. Visualize

Inference + visualization in one shot:

python scripts/visualize.py --dataroot /path/to/nuscenes --sample-idx 2

Visualize from saved .npz results:

python scripts/visualize.py --dataroot /path/to/nuscenes --result outputs/sample_0002.npz

Batch-visualize all results:

python scripts/visualize_all.py --dataroot /path/to/nuscenes

Outputs are saved to outputs/ as vis_XXXX.png (camera views + BEV plot) and sample_XXXX.npz (raw detections).

Tests

pytest tests/ -v

21 tests total: 15 component-level unit tests + 6 integration tests (model instantiation, weight loading, end-to-end forward pass). Requires a CUDA GPU.

Project Structure

src/
  config.py           # BEVFusionConfig dataclass (all hyperparameters)
  model.py            # BEVFusion model + checkpoint key remapping
  camera_encoder.py   # Swin-T backbone, FPN neck, LSS depth transform
  lidar_encoder.py    # Sparse 3D conv encoder (spconv v2)
  fuser.py            # Camera-LiDAR BEV fusion
  decoder.py          # SECOND backbone + FPN
  head.py             # TransFusion detection head
  ops.py              # BEV pooling (scatter_add), voxelization
scripts/
  run_inference.py    # Full inference pipeline with nuScenes data loading
  run_batch.py        # Batch inference across multiple scenes
  visualize.py        # 3D box projection on camera views + BEV plot
  visualize_all.py    # Batch visualization from saved results
  test_pretrained.py  # Quick smoke test with synthetic data
  download_models.sh  # Download pretrained weights
tests/
  test_components.py  # Unit tests for each module
  test_model.py       # Integration tests

Weight Loading

The pretrained checkpoint uses mmcv/mmdet3d conventions. The custom remapping handles:

  • mmcv Swin → timm Swin: stages.{i}layers_{i}, attn.w_msaattn, ffn.layersmlp.fc{1,2}
  • spconv v1 → v2: 5D kernel weights transposed from (k,k,k,in,out) to (out,k,k,k,in)
  • mmcv ConvModule: checkpoint uses .conv./.bn. sub-keys, matched by custom ConvModule class

Result: 564/568 parameters matched (99.3%). The 4 unmatched are computed buffers (bev_pos, voxelization grid constants).

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors