Pure PyTorch reimplementation of BEVFusion (MIT HAN Lab) for inference-only use. Loads the official pretrained weights and runs 3D object detection on nuScenes data, producing 10-class detections (car, truck, bus, pedestrian, etc.) in bird's-eye view.
This is not a fork of the original repo. The model architecture was rebuilt from scratch using PyTorch, timm, and spconv v2, with a custom weight-loading layer that maps the official checkpoint keys to the new structure.
Camera (6x) ──► Swin-T ──► FPN ──► LSS Depth ──► Camera BEV
│
▼
ConvFuser ──► SECOND Backbone ──► SECOND FPN ──► TransFusion Head ──► Detections
▲
│
LiDAR ──► Voxelize ──► Sparse 3D Conv Encoder ──► LiDAR BEV
| Component | Implementation |
|---|---|
| Camera backbone | Swin Transformer Tiny via timm (swin_tiny_patch4_window7_224, 256x704 input) |
| Camera neck | Generalized LSS FPN (multi-scale feature fusion) |
| Camera-to-BEV | Lift-Splat-Shoot with 118 depth bins (1.0-60.0m), scatter_add pooling |
| LiDAR voxelization | Pure PyTorch hash-based hard voxelization |
| LiDAR encoder | Sparse 3D convolutions via spconv v2 (4 stages: 16→32→64→128 channels) |
| Fusion | Channel concatenation + Conv2d + BN + ReLU |
| BEV decoder | SECOND backbone + SECOND FPN |
| Detection head | TransFusion (heatmap proposals + transformer decoder, 200 queries) |
This was built and tested on the following setup:
| OS | Windows 11 + WSL2 (Ubuntu, kernel 6.6.87.2-microsoft-standard-WSL2) |
| GPU | NVIDIA GPU with CUDA 13.0, driver 581.42 |
| Python | 3.10 |
| PyTorch | 2.7.1+cu126 |
| spconv | spconv-cu120 (v2) |
| timm | latest |
| GPU memory | ~1.7 GB for single-sample inference |
uv venv .venv --python 3.10
source .venv/bin/activate
uv pip install torch torchvision --index-url https://download.pytorch.org/whl/cu126
uv pip install timm nuscenes-devkit pyquaternion matplotlib Pillow
uv pip install spconv-cu120
uv pip install -e .bash scripts/download_models.shThis downloads bevfusion-det.pth (~157 MB) into pretrained/.
Download nuScenes mini from https://www.nuscenes.org/nuscenes#download and extract:
mkdir -p /path/to/nuscenes
cd /path/to/nuscenes
tar xzf v1.0-mini.tgzThe directory should contain v1.0-mini/, samples/, and sweeps/.
Single sample:
python scripts/run_inference.py --dataroot /path/to/nuscenes --num-samples 1Batch across all scenes (3 samples per scene):
python scripts/run_batch.py --dataroot /path/to/nuscenes --per-scene 3Inference + visualization in one shot:
python scripts/visualize.py --dataroot /path/to/nuscenes --sample-idx 2Visualize from saved .npz results:
python scripts/visualize.py --dataroot /path/to/nuscenes --result outputs/sample_0002.npzBatch-visualize all results:
python scripts/visualize_all.py --dataroot /path/to/nuscenesOutputs are saved to outputs/ as vis_XXXX.png (camera views + BEV plot) and sample_XXXX.npz (raw detections).
pytest tests/ -v21 tests total: 15 component-level unit tests + 6 integration tests (model instantiation, weight loading, end-to-end forward pass). Requires a CUDA GPU.
src/
config.py # BEVFusionConfig dataclass (all hyperparameters)
model.py # BEVFusion model + checkpoint key remapping
camera_encoder.py # Swin-T backbone, FPN neck, LSS depth transform
lidar_encoder.py # Sparse 3D conv encoder (spconv v2)
fuser.py # Camera-LiDAR BEV fusion
decoder.py # SECOND backbone + FPN
head.py # TransFusion detection head
ops.py # BEV pooling (scatter_add), voxelization
scripts/
run_inference.py # Full inference pipeline with nuScenes data loading
run_batch.py # Batch inference across multiple scenes
visualize.py # 3D box projection on camera views + BEV plot
visualize_all.py # Batch visualization from saved results
test_pretrained.py # Quick smoke test with synthetic data
download_models.sh # Download pretrained weights
tests/
test_components.py # Unit tests for each module
test_model.py # Integration tests
The pretrained checkpoint uses mmcv/mmdet3d conventions. The custom remapping handles:
- mmcv Swin → timm Swin:
stages.{i}→layers_{i},attn.w_msa→attn,ffn.layers→mlp.fc{1,2} - spconv v1 → v2: 5D kernel weights transposed from
(k,k,k,in,out)to(out,k,k,k,in) - mmcv ConvModule: checkpoint uses
.conv./.bn.sub-keys, matched by customConvModuleclass
Result: 564/568 parameters matched (99.3%). The 4 unmatched are computed buffers (bev_pos, voxelization grid constants).