Skip to content

FuchengSu/WorldStereo

Repository files navigation

WorldStereo: Bridging Camera-Guided Video Generation and Scene Reconstruction via 3D Geometric Memories

Yisu Zhang1,2*   Chenjie Cao2*   Tengfei Wang2†   Xuhui Zuo2   Junta Wu2   Jianke Zhu1‡   Chunchao Guo2

1Zhejiang University    2Tencent Hunyuan
*Equal Contribution   †Project Lead   ‡Corresponding Author

arXiv HuggingFace CVPR 2026


📅 News


✅ TODO List

  • Release inference code and model weights of WorldStereo 2.0
  • Release data pre-processing pipelines for panoramic and multi-trajectory scenes

📖 Abstract

We propose WorldStereo, a novel framework that bridges camera-guided video generation and 3D reconstruction via two dedicated geometric memory modules:

  • Global-Geometric Memory (GGM) enables precise camera control while injecting coarse structural priors through incrementally updated point clouds via a ControlNet branch.
  • Spatial-Stereo Memory (SSM) constrains the model's attention receptive fields with 3D correspondences to focus on fine-grained details from the memory bank.

Together, these components enable WorldStereo to generate multi-view-consistent videos under precise camera control, facilitating high-quality 3D reconstruction. Furthermore, WorldStereo shows impressive efficiency by leveraging a distribution-matching distilled (DMD) VDM backbone without joint training.


🎬 Results

3D Reconstruction from a Single Image

Given a single reference image, WorldStereo generates multi-view consistent videos and reconstructs a dense 3D point cloud. Below are example results on two scenes.

Scene: Kitchen   |   Input image → Point cloud (5 views)

Kitchen input   Kitchen pcd 1 Kitchen pcd 2 Kitchen pcd 3 Kitchen pcd 4 Kitchen pcd 5

Camera Control

Methods Camera Metrics Visual Quality
RotErr ↓TransErr ↓ATE ↓ Q-Align ↑CLIP-IQA+ ↑Laion-Aes ↑CLIP-I ↑
SEVA1.6901.5782.8793.2320.4794.62377.16
Gen3C0.9441.5802.7893.3530.4894.86382.33
WorldStereo0.7621.2452.1414.1490.5475.25789.05
WorldStereo 2.00.4920.9681.7684.2050.5445.26689.43

Single-View-Generated Reconstruction

Methods Tanks-and-Temples MipNeRF360
Precision ↑ Recall ↑ F1-Score ↑ AUC ↑ Precision ↑ Recall ↑ F1-Score ↑ AUC ↑
SEVA 33.59 35.34 36.73 51.03 22.38 55.63 28.75 46.81
Gen3C 46.73 25.51 31.24 42.44 23.28 75.37 35.26 52.10
Lyra 50.38 28.67 32.54 43.05 30.02 58.60 36.05 49.89
FlashWorld 26.58 20.72 22.29 30.45 35.97 53.77 42.60 53.86
WorldStereo 2.0 43.62 41.02 41.43 58.19 43.19 65.32 51.27 65.79
WorldStereo 2.0 (DMD) 40.41 44.41 43.16 60.09 42.34 64.83 50.52 65.64

🆕 WorldStereo 2.0 vs. 1.0

WorldStereo 2.0 introduces four key improvements over the original version:

WorldStereo 1.0 WorldStereo 2.0
Latent Space Standard video latent space Keyframe latent space — encodes each frame independently, substantially improving visual quality of generated novel views and completely supporting parallel encoding/decoding
Memory Mechanism Cross-attention to retrieved reference frames Stereo stitching in the main branch — reference views are spatially concatenated with target frames along the width dimension in the main DiT branch, enabling stronger and more direct memory fusion
Backbone Fine-tuning Frozen backbone Partial backbone fine-tuning — backbone weights are selectively updated to adapt to the keyframe latent space and improve overall generation quality
Training Data Limited camera trajectories Expanded UE rendering data — significantly more Unreal Engine rendered scenes with diverse and precise camera motions, leading to stronger camera control and memory capabilities

More details of WorldStereo 2.0 are shown in HY-World 2.0.


⚙️ Installation

1. Clone the repository:

git clone https://github.com/FuchengSu/WorldStereo.git
cd WorldStereo

2. Install core dependencies:

conda create -n worldstereo python=3.11
conda activate worldstereo
pip install -r requirements.txt

3. Install PyTorch3D (required for point cloud rendering):

pip install --no-build-isolation "git+https://github.com/facebookresearch/pytorch3d.git@stable"

4. Install MoGe (monocular depth estimation):

pip install git+https://github.com/microsoft/MoGe.git@0286b495230a074aadf1c76cc5c679e943e5d1c6

5. (Optional) Install third-party reconstruction module for WorldMirror reconstruction:

mkdir third_party
cd third_party
git clone https://github.com/Tencent-Hunyuan/HY-World-2.0.git
pip install -r HY-World-2.0/requirements.txt

Note: third_party/HY-World-2.0 is required only for apply_worldmirror post-processing (multi-view depth consistency and Gaussian Splatting reconstruction). You can skip it for basic video generation.


🚀 Quick Start

Model Variants

WorldStereo ships three model variants, each suited to a different use case:

Model Type Entry Point Description
worldstereo-camera run_camera_control.py Camera control only; single-view input
worldstereo-memory run_camera_control.py / run_multi_traj.py Full model with GGM + SSM; multi-view consistent generation; best quality
worldstereo-memory-dmd run_camera_control.py / run_multi_traj.py DMD distillation variant; 4-step inference, fastest

Models are automatically downloaded from HuggingFace Hub (hanshanxue/WorldStereo) on first run.


Single-View Camera Control

Generate a video from a single image along a specified camera trajectory:

python run_camera_control.py \
    --model_type worldstereo-camera \
    --input_path examples/images \
    --output_path outputs \
    --seed 1024

Multi-GPU Inference (Sequence Parallel)

Scale to multiple GPUs using Sequence Parallelism (SP) and FSDP:

torchrun --nproc_per_node=8 run_camera_control.py \
    --model_type worldstereo-memory \
    --input_path examples/panorama \
    --output_path outputs \
    --fsdp

Multi-Trajectory Inference (Panorama / Reconstruction)

For panoramic scene generation or 3D reconstruction from multiple trajectories:

# Panoramic scene generation
torchrun --nproc_per_node=8 run_multi_traj.py \
    --model_type worldstereo-memory \
    --task_type panorama \
    --input_path examples/panorama \
    --output_path outputs \
    --fsdp

# Panoramic scene generation (DMD fast variant)
torchrun --nproc_per_node=8 run_multi_traj.py \
    --model_type worldstereo-memory-dmd \
    --task_type panorama \
    --input_path examples/panorama \
    --output_path outputs \
    --fsdp

# 3D scene reconstruction
torchrun --nproc_per_node=8 run_multi_traj.py \
    --model_type worldstereo-memory \
    --task_type reconstruction \
    --input_path examples/reconstruction \
    --output_path outputs \
    --fsdp

# 3D scene reconstruction (DMD fast variant)
torchrun --nproc_per_node=8 run_multi_traj.py \
    --model_type worldstereo-memory-dmd \
    --task_type reconstruction \
    --input_path examples/reconstruction \
    --output_path outputs \
    --fsdp

WorldMirror 3D Reconstruction (Optional)

After running run_multi_traj.py, the memory bank is automatically exported to a WorldMirror-compatible format under <output_path>/<scene>/world_mirror_data/<model_type>/. You can then run feedforward 3D reconstruction with HY-World 2.0:

# Requires: pip install -r third_party/HY-World-2.0/requirements.txt

cd third_party/HY-World-2.0
torchrun --nproc_per_node=8 -m hyworld2.worldrecon.pipeline --input_path ../../outputs/<scene>/world_mirror_data/<model_type>/images \
          --prior_cam_path ../../outputs/<scene>/world_mirror_data/<model_type>/cameras.json \
          --strict_output_path ../../outputs/<scene>/world_mirror_data/<model_type>/results \
          --target_size 832 --use_fsdp --enable_bf16 --no_save_normal --no_save_gs --no_sky_mask \
          --apply_edge_mask --apply_confidence_mask --confidence_percentile 15.0 --compress_pts --no_interactive \
          --disable_heads gs points

This produces metric-scale depth, surface normals, camera poses, a dense point cloud (.ply), and optionally Gaussian Splat renderings from the generated multi-view frames.


Python API

import torch
from models.worldstereo_wrapper import WorldStereo

device = torch.device("cuda:0")

worldstereo = WorldStereo.from_pretrained(
    "hanshanxue/WorldStereo",
    subfolder="worldstereo-memory",   # or "worldstereo-camera" / "worldstereo-memory-dmd"
    sp_world_size=1,
    fsdp=False,
    device=device,
)

output = worldstereo(**pipeline_inputs)

CLI Reference

run_camera_control.py

Flag Default Description
--model_type worldstereo-camera Model variant to use
--input_path examples/images Input scene directory
--output_path outputs Output directory
--local_files_only False Use locally cached weights instead of downloading
--fsdp False Enable FSDP model sharding
--seed 1024 Random seed

run_multi_traj.py (additional flags)

Flag Default Description
--task_type panorama panorama or reconstruction
--align_nframe 8 Frames per clip saved for updating the memory bank

📂 Input Data Format

Camera-Only Inference (examples/images/)

<scene>/
├── image.png                 # reference image
├── prompt.json               # text descriptions at three verbosity levels
│   # {"short caption": ..., "medium caption": ..., "long caption": ...}
└── camera.json               # camera trajectory
    # {"motion_list": [...], "extrinsic": [...], "intrinsic": [...]}

Memory-Augmented Multi-Trajectory (examples/panorama/, examples/reconstruction/)

<scene>/
├── panorama.png              # (optional) full panorama — triggers VLM single-path inference
├── meta_info.json            # {"scene_type": "perspective" | "panorama"}
├── start_frame.png           # reference start image for depth initialization
└── render_results/
    └── <view_id>/
        └── <traj_id>/
            ├── render.mp4         # pre-rendered geometry video (point cloud warp)
            ├── render_mask.mp4    # binary occlusion mask video
            └── camera.json        # {"extrinsic": [...], "intrinsic": [...]}

🔧 Architecture

Model Variants

WorldStereo defines two transformer architectures in models/worldstereo.py, both extending WanTransformer3DModel from diffusers:

  • WorldStereoModel — Wan DiT backbone + ControlNet. Used by worldstereo-camera. The ControlNet encodes rendered point cloud geometry and camera embeddings, injecting residuals at each transformer block.
  • WorldStereoRefSModel — Extends WorldStereoModel with WanTransformerSparseSpatialBlock layers. These SSM blocks perform sparse attention over retrieved reference frames, guided by 3D correspondences. Used by worldstereo-memory and worldstereo-memory-dmd.

Inference Pipelines

Three pipelines are provided under models/pipelines/, selected automatically based on model_type in the config:

Pipeline Class Mode
pipeline_pcd_keyframe.py KFPCDControllerPipeline Camera; standard DDIM sampling
pipeline_ref_keyframe.py KFPCDControllerRefPipeline Camera + GGM + SSM; standard DDIM sampling
pipeline_dmd_keyframe.py RefKFDMDGeneratorPipeline Camera + GGM + SSM; 4-step DMD distillation

3D Memory Bank

The memory bank (src/retrieval_wm.py) manages the growing 3D representation across trajectories:

  1. Init — MoGe depth estimation on the start frame lifts it to a point cloud.
  2. Retrieve — For each new target trajectory, the most relevant reference frames are selected via FOV-overlap scoring combined with DINOv2 image features and quality-aware furthest-point sampling.
  3. Update — After generation, new frames and their estimated depths are appended to the bank.
  4. Reconstruction — Feedforward reconstruction via HY-World 2.0 WorldMirror enforces multi-view depth consistency; final global alignment produces a unified point cloud.

Distributed Inference

WorldStereo supports two parallelism strategies:

  • Sequence Parallel (SP) — The sequence dimension is sharded across the SP group at each attention layer (models/attention.py). Controlled by torchrun --nproc_per_node.
  • FSDP — Full-Sharded Data Parallel wraps both the transformer and auxiliary encoders. Enabled with --fsdp. Requires a device_mesh with ("rep", "shard") dimensions.

🤝 Acknowledgements

WorldStereo builds upon the following excellent works:

  • Wan — Video DiT backbone
  • HunyuanVideo-1.5 — Components of sequence parallel and video generation model
  • MoGe — Monocular geometry estimation
  • HY-World 2.0 — WorldMirror reconstruction module
  • diffusers — Pipeline and model utilities

📝 Citation

If you find WorldStereo useful in your research, please cite:

@article{zhang2026worldstereo,
  title={WorldStereo: Bridging Camera-Guided Video Generation and Scene Reconstruction via 3D Geometric Memories},
  author={Zhang, Yisu and Cao, Chenjie and Wang, Tengfei and Zuo, Xuhui and Wu, Junta and Zhu, Jianke and Guo, Chunchao},
  journal={arXiv preprint arXiv:2603.02049},
  year={2026}
}

About

[CVPR 2026] WorldStereo: Bridging Camera-Guided Video Generation and Scene Reconstruction via 3D Geometric Memories (WorldExpand of HY-World 2.0)

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages