A diagnostic tool and a guideline for advancing next-generation world models capable of robust understanding, forecasting, and purposeful action.
- Action Simulation Fidelity
- Smoothness Evaluation
- Generation Consistency Evaluation
- Simulative Reasoning & Planning
This section evaluates the ability of world models to simulate actions faithfully based on multi-round prompts.
We evaluate multiple state-of-the-art world models, categorized into local models and API-based models:
- Cosmos-Predict1-14B-Video2World - GitHub
- Cosmos-Predict2-14B-Video2World - GitHub
- WAN 2.1-I2V-14B - GitHub
- WAN 2.2-I2V-A14B - GitHub
- Gen-3
- KLING
- MiniMax-Hailuo
- PAN - Our proprietary model (requires custom endpoint, not yet publicly released)
For each model, create a dedicated conda environment and follow the installation instructions from their respective repositories:
- Cosmos-Predict1: Follow setup instructions at GitHub (env name:
cosmos-predict1) - Cosmos-Predict2: Follow setup instructions at GitHub (env name:
cosmos-predict2) - WAN 2.1: Follow setup instructions at GitHub (env name:
wan2_1) - WAN 2.2: Follow setup instructions at GitHub (env name:
wan2_2)
Download the corresponding checkpoints for each model and place them in the respective directories:
thirdparty/
├── cosmos-predict1/checkpoints/
├── cosmos-predict2/checkpoints/
├── wan2_1/checkpoints/
└── wan2_2/checkpoints/
Execute the generation scripts using SLURM for local models:
# Local Models. Example:
sbatch action_simulation_fidelity_scripts/cosmos1.shFor models that use API calls:
conda create -n video-api python=3.10 -y
conda activate video-api
conda install -c conda-forge ffmpeg
pip install -r requirements_api.txtExecute the generation scripts for API-based models:
# API-based Models. Example:
bash action_simulation_fidelity_scripts/gen3.shAfter generating videos for each model, evaluate their action simulation fidelity using GPT-4o:
python action_simulation_fidelity_scripts/action_simulation_fidelity_eval.py \
--openai_api_key YOUR_OPENAI_API_KEY \
--base_path outputs/action_simulation_fidelity/MODEL_NAME \
--dataset_json datasets/action_simulation_fidelity_subset/samples_subset.json \
--save_name MODEL_NAMEResults will be saved in outputs/action_simulation_fidelity/MODEL_NAME/MODEL_NAME_results.json.
This section evaluates the temporal smoothness of multi-round generated videos using optical flow. Consecutive frame pairs are processed with SEA-RAFT to compute velocity and acceleration magnitudes, which are combined into a smoothness score (vmag × exp(−λ × amag)).
datasets/smoothness_eval/samples.json contains 100 photorealistic outdoor scenes, each with a 10-round sequential prompt list. A small subset of reference images is uploaded — set IMAGE_ROOT in the generation scripts to point to your local copy of the WorldScore-Dataset.
wget https://huggingface.co/datasets/memcpy/SEA-RAFT/resolve/main/Tartan-C-T-TSKH-spring540x960-M.pth \
-O thirdparty/SEA-RAFT/checkpoints/Tartan-C-T-TSKH-spring540x960-M.pthScripts for all supported models are in smoothness_eval_scripts/. Example using PAN:
# Edit IMAGE_ROOT inside the script first, then:
bash smoothness_eval_scripts/pan.shGenerated videos are saved under outputs/smoothness_eval/pan/{instance_id}/rounds/.
python smoothness_eval_scripts/compute_smoothness_scores.py \
--videos_dir outputs/smoothness_eval/pan \
--output_dir outputs/smoothness_eval/pan_scores \
--raft_ckpt thirdparty/SEA-RAFT/checkpoints/Tartan-C-T-TSKH-spring540x960-M.pth \
--num_workers 4Per-instance results are written to outputs/smoothness_eval/pan_scores/{instance_id}/smoothness.json. An aggregate summary.json is written once all instances are scored. For multi-node SLURM evaluation, set MODEL_NAME in smoothness_eval_scripts/eval.sh and run sbatch smoothness_eval_scripts/eval.sh.
This section evaluates video generation models on 7 aspects of multi-round generation consistency using the WorldScore benchmark framework (MIT License).
| Aspect | Metric | Key dependency |
|---|---|---|
camera_control |
camera reprojection error | DROID-SLAM |
object_control |
object detection score | GroundingDINO + SAM2 |
content_alignment |
CLIP score | CLIP |
3d_consistency |
reprojection error | DROID-SLAM |
photometric_consistency |
optical flow AEPE | SEA-RAFT |
style_consistency |
Gram matrix distance | VGG |
subjective_quality |
CLIP-IQA+, MUSIQ | QAlign, MUSIQ |
Each score is a list of per-round values rather than a single scalar. The bold aspects require heavy thirdparty dependencies (see WorldScore's own setup guide). The remaining four aspects (content_alignment, photometric_consistency, style_consistency, subjective_quality) can be run on any GPU without those dependencies.
git submodule update --init thirdparty/WorldScore
pip install -e thirdparty/WorldScoreFollow WorldScore's setup instructions for the thirdparty dependencies (DROID-SLAM, GroundingDINO, SAM2) if you need all 7 aspects.
bash generation_consistency_eval_scripts/install_patches.shThis copies the modified evaluator into the WorldScore submodule.
Edit IMAGE_ROOT in the script to point to your local WorldScore-Dataset, then run:
bash generation_consistency_eval_scripts/pan.shGenerated videos are saved under outputs/generation_consistency_eval/pan/.
python generation_consistency_eval_scripts/prepare_worldscore_dirs.py \
--videos_root outputs/generation_consistency_eval/pan \
--dataset_json datasets/generation_consistency_eval/samples.json \
--output_root outputs/generation_consistency_eval/pan_evalpython generation_consistency_eval_scripts/run_evaluate_multiround.py \
--model_name pan \
--visual_movement static \
--runs_root outputs/generation_consistency_eval/pan_eval \
--num_jobs 24 \
--use_slurm True \
--slurm_partition main \
--slurm_qos wmResults are written to outputs/generation_consistency_eval/pan_eval/worldscore_output/worldscore_multiround.json. For SLURM-based end-to-end runs, set MODEL_NAME in generation_consistency_eval_scripts/eval.sh and run sbatch generation_consistency_eval_scripts/eval.sh
This section evaluates evaluates whether a world model can serve as an internal simulator that enables an agent to reason about actions and plan toward a goal.
All models, including Cosmos-Predict1, Cosmos-Predict2, V-JEPA2, PAN models need to be fine-tuned on specific datasets for the evaluation tasks:
| Task Type | Dataset | Models to Fine-tune |
|---|---|---|
| Step-Wise Simulation Open-ended Simulation Planning |
Agibot World Colosseo – “A large-scale manipulation platform for scalable and intelligent embodied systems” (Bu et al., 2025) |
Cosmos-Predict1 Cosmos-Predict2 V-JEPA2 PAN |
| Structured Simulation Planning | Language Table – “Interactive language: Talking to robots in real time” (Lynch et al., 2023) |
Cosmos-Predict1 Cosmos-Predict2 V-JEPA2 PAN |
Fine-tuning process:
-
Follow the respective model repository instructions for fine-tuning:
- Cosmos-Predict1 Fine-tuning
- Cosmos-Predict2 Fine-tuning
- V-JEPA2 Conda Env Creating and Fine-tuning
- PAN (yet to be released)
-
Replace the original checkpoints with your fine-tuned versions in the
thirdparty/*/checkpoints/directories.
This task measures whether a world model can accurately predict the immediate consequence of a given action within a manipulation context. Run the evaluation scripts for all models (Cosmos-predict1, Cosmos-predict2, V-JEPA2, and PAN):
# Example: Cosmos-Predict1
sbatch simulative_reasoning_planning_scripts/step_wise_simulation_scripts/cosmos1.shEvaluation Methods:
-
Video Generation Models (Cosmos-predict1, Cosmos-predict2, PAN): Manually evaluate whether the generated videos in
outputs/simulative_reasoning_planning/step_wise_simulation/{model_name}/fulfill the given prompts. -
V-JEPA2: Check the quantitative results in
outputs/simulative_reasoning_planning/step_wise_simulation/vjepa2/subset.jsonlto see if the inference predictions match the ground truth answers.
This setting evaluates goal-directed manipulation on diverse objects in open-ended environments. VLM-only serves as the baseline that uses only VLM reasoning (e.g., GPT-o3) to evaluate how much world models can enhance performance. Run the evaluation scripts for different model configurations:
VLM-only Baseline:
# Pure VLM reasoning
python simulative_reasoning_planning_scripts/open_ended_simulation_planning/VLM_only.py --openai_key your_api_keyVLM + World Model Combinations:
# VLM + V-JEPA2
python simulative_reasoning_planning_scripts/open_ended_simulation_planning/VLM-WM_reasoning_vjepa2.py --openai_key your_api_key
# VLM + Cosmos-Predict1
sbatch simulative_reasoning_planning_scripts/open_ended_simulation_planning/VLM-WM_reasoning_cosmos1.sh
# VLM + Cosmos-Predict2
sbatch simulative_reasoning_planning_scripts/open_ended_simulation_planning/VLM-WM_reasoning_cosmos2.sh
# VLM + PAN
sbatch simulative_reasoning_planning_scripts/open_ended_simulation_planning/VLM-WM_reasoning_pan.shResult Analysis:
After execution, check the results in:
outputs/simulative_reasoning_planning/open_ended_simulation_planning/[task_name]/[model_name]/[task_name]_refined.json
Manually analyze the generated action sequences to determine whether each model successfully completed the given tasks.
This setting focuses on precise, language-grounded manipulation in highly structured tabletop environments containing regular objects such as colored cubes and spheres.
The evaluation procedure follows the same methodology as described in the Open-Ended Simulation and Planning section.
