WorldStereo: Bridging Camera-Guided Video Generation and Scene Reconstruction via 3D Geometric Memories
Yisu Zhang1,2* Chenjie Cao2* Tengfei Wang2† Xuhui Zuo2 Junta Wu2 Jianke Zhu1‡ Chunchao Guo2
1Zhejiang University
2Tencent Hunyuan
*Equal Contribution †Project Lead ‡Corresponding Author
[2026.02]🎉 WorldStereo is accepted by CVPR 2026![2026.03]📄 Paper is now available on arXiv: https://arxiv.org/abs/2602.24233[2026.04]🚀 Code and model weights of WorldStereo 2.0 are now released![2026.04]🚀 HY-World 2.0 are now released: https://github.com/Tencent-Hunyuan/HY-World-2.0 !
- Release inference code and model weights of WorldStereo 2.0
- Release data pre-processing pipelines for panoramic and multi-trajectory scenes
We propose WorldStereo, a novel framework that bridges camera-guided video generation and 3D reconstruction via two dedicated geometric memory modules:
- Global-Geometric Memory (GGM) enables precise camera control while injecting coarse structural priors through incrementally updated point clouds via a ControlNet branch.
- Spatial-Stereo Memory (SSM) constrains the model's attention receptive fields with 3D correspondences to focus on fine-grained details from the memory bank.
Together, these components enable WorldStereo to generate multi-view-consistent videos under precise camera control, facilitating high-quality 3D reconstruction. Furthermore, WorldStereo shows impressive efficiency by leveraging a distribution-matching distilled (DMD) VDM backbone without joint training.
Given a single reference image, WorldStereo generates multi-view consistent videos and reconstructs a dense 3D point cloud. Below are example results on two scenes.
Scene: Kitchen | Input image → Point cloud (5 views)
| Methods | Camera Metrics | Visual Quality | |||||
|---|---|---|---|---|---|---|---|
| RotErr ↓ | TransErr ↓ | ATE ↓ | Q-Align ↑ | CLIP-IQA+ ↑ | Laion-Aes ↑ | CLIP-I ↑ | |
| SEVA | 1.690 | 1.578 | 2.879 | 3.232 | 0.479 | 4.623 | 77.16 |
| Gen3C | 0.944 | 1.580 | 2.789 | 3.353 | 0.489 | 4.863 | 82.33 |
| WorldStereo | 0.762 | 1.245 | 2.141 | 4.149 | 0.547 | 5.257 | 89.05 |
| WorldStereo 2.0 | 0.492 | 0.968 | 1.768 | 4.205 | 0.544 | 5.266 | 89.43 |
| Methods | Tanks-and-Temples | MipNeRF360 | ||||||
|---|---|---|---|---|---|---|---|---|
| Precision ↑ | Recall ↑ | F1-Score ↑ | AUC ↑ | Precision ↑ | Recall ↑ | F1-Score ↑ | AUC ↑ | |
| SEVA | 33.59 | 35.34 | 36.73 | 51.03 | 22.38 | 55.63 | 28.75 | 46.81 |
| Gen3C | 46.73 | 25.51 | 31.24 | 42.44 | 23.28 | 75.37 | 35.26 | 52.10 |
| Lyra | 50.38 | 28.67 | 32.54 | 43.05 | 30.02 | 58.60 | 36.05 | 49.89 |
| FlashWorld | 26.58 | 20.72 | 22.29 | 30.45 | 35.97 | 53.77 | 42.60 | 53.86 |
| WorldStereo 2.0 | 43.62 | 41.02 | 41.43 | 58.19 | 43.19 | 65.32 | 51.27 | 65.79 |
| WorldStereo 2.0 (DMD) | 40.41 | 44.41 | 43.16 | 60.09 | 42.34 | 64.83 | 50.52 | 65.64 |
WorldStereo 2.0 introduces four key improvements over the original version:
| WorldStereo 1.0 | WorldStereo 2.0 | |
|---|---|---|
| Latent Space | Standard video latent space | Keyframe latent space — encodes each frame independently, substantially improving visual quality of generated novel views and completely supporting parallel encoding/decoding |
| Memory Mechanism | Cross-attention to retrieved reference frames | Stereo stitching in the main branch — reference views are spatially concatenated with target frames along the width dimension in the main DiT branch, enabling stronger and more direct memory fusion |
| Backbone Fine-tuning | Frozen backbone | Partial backbone fine-tuning — backbone weights are selectively updated to adapt to the keyframe latent space and improve overall generation quality |
| Training Data | Limited camera trajectories | Expanded UE rendering data — significantly more Unreal Engine rendered scenes with diverse and precise camera motions, leading to stronger camera control and memory capabilities |
More details of WorldStereo 2.0 are shown in HY-World 2.0.
1. Clone the repository:
git clone https://github.com/FuchengSu/WorldStereo.git
cd WorldStereo2. Install core dependencies:
conda create -n worldstereo python=3.11
conda activate worldstereo
pip install -r requirements.txt3. Install PyTorch3D (required for point cloud rendering):
pip install --no-build-isolation "git+https://github.com/facebookresearch/pytorch3d.git@stable"4. Install MoGe (monocular depth estimation):
pip install git+https://github.com/microsoft/MoGe.git@0286b495230a074aadf1c76cc5c679e943e5d1c65. (Optional) Install third-party reconstruction module for WorldMirror reconstruction:
mkdir third_party
cd third_party
git clone https://github.com/Tencent-Hunyuan/HY-World-2.0.git
pip install -r HY-World-2.0/requirements.txtNote:
third_party/HY-World-2.0is required only forapply_worldmirrorpost-processing (multi-view depth consistency and Gaussian Splatting reconstruction). You can skip it for basic video generation.
WorldStereo ships three model variants, each suited to a different use case:
| Model Type | Entry Point | Description |
|---|---|---|
worldstereo-camera |
run_camera_control.py |
Camera control only; single-view input |
worldstereo-memory |
run_camera_control.py / run_multi_traj.py |
Full model with GGM + SSM; multi-view consistent generation; best quality |
worldstereo-memory-dmd |
run_camera_control.py / run_multi_traj.py |
DMD distillation variant; 4-step inference, fastest |
Models are automatically downloaded from HuggingFace Hub (hanshanxue/WorldStereo) on first run.
Generate a video from a single image along a specified camera trajectory:
python run_camera_control.py \
--model_type worldstereo-camera \
--input_path examples/images \
--output_path outputs \
--seed 1024Scale to multiple GPUs using Sequence Parallelism (SP) and FSDP:
torchrun --nproc_per_node=8 run_camera_control.py \
--model_type worldstereo-memory \
--input_path examples/panorama \
--output_path outputs \
--fsdpFor panoramic scene generation or 3D reconstruction from multiple trajectories:
# Panoramic scene generation
torchrun --nproc_per_node=8 run_multi_traj.py \
--model_type worldstereo-memory \
--task_type panorama \
--input_path examples/panorama \
--output_path outputs \
--fsdp
# Panoramic scene generation (DMD fast variant)
torchrun --nproc_per_node=8 run_multi_traj.py \
--model_type worldstereo-memory-dmd \
--task_type panorama \
--input_path examples/panorama \
--output_path outputs \
--fsdp
# 3D scene reconstruction
torchrun --nproc_per_node=8 run_multi_traj.py \
--model_type worldstereo-memory \
--task_type reconstruction \
--input_path examples/reconstruction \
--output_path outputs \
--fsdp
# 3D scene reconstruction (DMD fast variant)
torchrun --nproc_per_node=8 run_multi_traj.py \
--model_type worldstereo-memory-dmd \
--task_type reconstruction \
--input_path examples/reconstruction \
--output_path outputs \
--fsdpAfter running run_multi_traj.py, the memory bank is automatically exported to a WorldMirror-compatible format under <output_path>/<scene>/world_mirror_data/<model_type>/. You can then run feedforward 3D reconstruction with HY-World 2.0:
# Requires: pip install -r third_party/HY-World-2.0/requirements.txt
cd third_party/HY-World-2.0
torchrun --nproc_per_node=8 -m hyworld2.worldrecon.pipeline --input_path ../../outputs/<scene>/world_mirror_data/<model_type>/images \
--prior_cam_path ../../outputs/<scene>/world_mirror_data/<model_type>/cameras.json \
--strict_output_path ../../outputs/<scene>/world_mirror_data/<model_type>/results \
--target_size 832 --use_fsdp --enable_bf16 --no_save_normal --no_save_gs --no_sky_mask \
--apply_edge_mask --apply_confidence_mask --confidence_percentile 15.0 --compress_pts --no_interactive \
--disable_heads gs pointsThis produces metric-scale depth, surface normals, camera poses, a dense point cloud (.ply), and optionally Gaussian Splat renderings from the generated multi-view frames.
import torch
from models.worldstereo_wrapper import WorldStereo
device = torch.device("cuda:0")
worldstereo = WorldStereo.from_pretrained(
"hanshanxue/WorldStereo",
subfolder="worldstereo-memory", # or "worldstereo-camera" / "worldstereo-memory-dmd"
sp_world_size=1,
fsdp=False,
device=device,
)
output = worldstereo(**pipeline_inputs)run_camera_control.py
| Flag | Default | Description |
|---|---|---|
--model_type |
worldstereo-camera |
Model variant to use |
--input_path |
examples/images |
Input scene directory |
--output_path |
outputs |
Output directory |
--local_files_only |
False |
Use locally cached weights instead of downloading |
--fsdp |
False |
Enable FSDP model sharding |
--seed |
1024 |
Random seed |
run_multi_traj.py (additional flags)
| Flag | Default | Description |
|---|---|---|
--task_type |
panorama |
panorama or reconstruction |
--align_nframe |
8 |
Frames per clip saved for updating the memory bank |
<scene>/
├── image.png # reference image
├── prompt.json # text descriptions at three verbosity levels
│ # {"short caption": ..., "medium caption": ..., "long caption": ...}
└── camera.json # camera trajectory
# {"motion_list": [...], "extrinsic": [...], "intrinsic": [...]}
<scene>/
├── panorama.png # (optional) full panorama — triggers VLM single-path inference
├── meta_info.json # {"scene_type": "perspective" | "panorama"}
├── start_frame.png # reference start image for depth initialization
└── render_results/
└── <view_id>/
└── <traj_id>/
├── render.mp4 # pre-rendered geometry video (point cloud warp)
├── render_mask.mp4 # binary occlusion mask video
└── camera.json # {"extrinsic": [...], "intrinsic": [...]}
WorldStereo defines two transformer architectures in models/worldstereo.py, both extending WanTransformer3DModel from diffusers:
WorldStereoModel— Wan DiT backbone + ControlNet. Used byworldstereo-camera. The ControlNet encodes rendered point cloud geometry and camera embeddings, injecting residuals at each transformer block.WorldStereoRefSModel— ExtendsWorldStereoModelwithWanTransformerSparseSpatialBlocklayers. These SSM blocks perform sparse attention over retrieved reference frames, guided by 3D correspondences. Used byworldstereo-memoryandworldstereo-memory-dmd.
Three pipelines are provided under models/pipelines/, selected automatically based on model_type in the config:
| Pipeline | Class | Mode |
|---|---|---|
pipeline_pcd_keyframe.py |
KFPCDControllerPipeline |
Camera; standard DDIM sampling |
pipeline_ref_keyframe.py |
KFPCDControllerRefPipeline |
Camera + GGM + SSM; standard DDIM sampling |
pipeline_dmd_keyframe.py |
RefKFDMDGeneratorPipeline |
Camera + GGM + SSM; 4-step DMD distillation |
The memory bank (src/retrieval_wm.py) manages the growing 3D representation across trajectories:
- Init — MoGe depth estimation on the start frame lifts it to a point cloud.
- Retrieve — For each new target trajectory, the most relevant reference frames are selected via FOV-overlap scoring combined with DINOv2 image features and quality-aware furthest-point sampling.
- Update — After generation, new frames and their estimated depths are appended to the bank.
- Reconstruction — Feedforward reconstruction via HY-World 2.0 WorldMirror enforces multi-view depth consistency; final global alignment produces a unified point cloud.
WorldStereo supports two parallelism strategies:
- Sequence Parallel (SP) — The sequence dimension is sharded across the SP group at each attention layer (
models/attention.py). Controlled bytorchrun --nproc_per_node. - FSDP — Full-Sharded Data Parallel wraps both the transformer and auxiliary encoders. Enabled with
--fsdp. Requires adevice_meshwith("rep", "shard")dimensions.
WorldStereo builds upon the following excellent works:
- Wan — Video DiT backbone
- HunyuanVideo-1.5 — Components of sequence parallel and video generation model
- MoGe — Monocular geometry estimation
- HY-World 2.0 — WorldMirror reconstruction module
- diffusers — Pipeline and model utilities
If you find WorldStereo useful in your research, please cite:
@article{zhang2026worldstereo,
title={WorldStereo: Bridging Camera-Guided Video Generation and Scene Reconstruction via 3D Geometric Memories},
author={Zhang, Yisu and Cao, Chenjie and Wang, Tengfei and Zuo, Xuhui and Wu, Junta and Zhu, Jianke and Guo, Chunchao},
journal={arXiv preprint arXiv:2603.02049},
year={2026}
}




