Benchmarking Video World Models

WorldOlympiad: Can Your World Model Survive a Triathlon?

A unified benchmark for diagnosing video-based world models across physical faithfulness, geometric consistency, and interaction fidelity.

Yuke Zhao1*, Wangbo Zhao3*, Weijie Wang1*, Zeyu Zhang2*†, Dakai An3, Akide Liu4, Yinghao Yu5, Jiasheng Tang2‡, Fan Wang2, Wei Wang3, Bohan Zhuang1‡

1Zhejiang University · 2DAMO Academy, Alibaba Group · 3HKUST · 4Monash University · 5TRE, Alibaba Group

*Equal contribution. · Project lead. · Corresponding authors.

1,000
Benchmark Videos
Gaming, robotics, and real-world videos
8
Generation Pipelines
OpenWorldLib evaluation suite
3
Diagnostic Tracks
Physical, 3D, and interaction
0.95
Human Alignment
Spearman correlation with human rank

TL;DR

WorldOlympiad evaluates whether long-video generators can act as reliable world models, not just visually plausible video synthesizers.

WorldOlympiad decomposes world-model evaluation into three complementary tracks: whether generated videos obey physical rules, preserve coherent 3D structure, and sustain controllable interactions over long horizons.

The benchmark covers 400 robotics videos, 400 gaming videos, and 200 real-world videos. Each video is standardized through chunking, captioning, and refinement before model rollout and automatic evaluation.

The main result is diagnostic: current models can be strong in physical regularity or interaction following, but geometric consistency and long-horizon state preservation remain unresolved.

Gaming simulator
Triangle bread: target-object following
Real-world simulator

Physical Faithfulness

Rule-level judging over mechanics, thermodynamics, and material behavior.

3D Consistency

Gaussian-splat reconstruction, meta-view quality, and camera trajectory alignment.

Interaction Fidelity

Chunk following, transition smoothness, and long-range global coherence.

Data Composition

The test set is built to cover three downstream settings with complementary world-modeling demands.

Robotics

400

RoboCOIN bimanual manipulation videos, manually filtered for object contact, gripper motion, and physically grounded interaction.

Gaming

400

GameGen-X open-world gameplay videos sampled from OGameData_50K and split into evaluation chunks up to 60 seconds.

Real-world

200

LVD-2M long-take videos selected with duration over 60 seconds and motion score greater than 50.

1

Chunk

Identify the main continuous execution interval and split it into at most six contiguous chunks.

2

Caption

Generate chunk-level action labels and English captions describing scene, entities, events, and outcomes.

3

Refine

Use full-video context to correct hallucinations, standardize terminology, and validate camera-movement labels.

Leaderboard

Scores are normalized to the 0-1 range. Click headers to sort, or filter by evaluation task.

Physical 3D consistency CLIP alignment Interaction Overall
Task:
Rank Model Task Physical 3D Cons. CLIP Interact. All
1
LingBot-World
All Tasks
0.942
0.373
0.314
0.734
0.683
2
Cosmos-Predict-2.5
All Tasks
0.906
0.399
0.313
0.707
0.671
3
Rolling Forcing
All Tasks
0.873
0.321
0.327
0.636
0.610
4
Yume-1.5
All Tasks
0.863
0.301
0.306
0.649
0.604
5
LongLive
All Tasks
0.863
0.363
0.323
0.526
0.584
6
Hunyuan-WorldPlay
All Tasks
0.692
0.424
0.290
0.316
0.477
7
WoW
All Tasks
0.708
0.250
0.267
0.345
0.434
8
Matrix-Game 2.0
All Tasks
0.325
0.255
0.237
0.113
0.231

Result Diagnostics

Aggregate charts from the paper show overall ranking, metric profiles, domain robustness, and human preference alignment.

Overall Ranking Eight pipelines compared under the same aggregate score.
Metric Profile Physical, 3D, CLIP, interaction, and total behavior.
Domain Robustness Fine-grained gaming, robotics, and general-world transfer patterns.
Human Alignment Pairwise preference agreement with automatic ranking.

Metric Profiles

Compact score matrix across physical faithfulness, 3D consistency, CLIP, interaction, and aggregate score.

LingBot-World is strongest on the aggregate score. Scores are normalized to [0, 1].

Domain Robustness

Fine-grained domain heatmaps for physical, 3D, and interaction diagnostics.

Physical detail heatmap.

Human Alignment

Automatic scores align with pairwise human preference rankings.

Spearman rho=0.95 between automatic and human ranks.

Key Findings

WorldOlympiad is designed for diagnosis, not just a single aggregate score.

1

Model scale correlates with stronger world-model performance. LingBot-World is a 14B-activated model and ranks first overall, with strong physical faithfulness (0.942) and interaction fidelity (0.734). It also opens a clear aggregate gap over smaller 1.3B-scale models such as LongLive, improving the overall score from 0.584 to 0.683.

2

Physical regularity is emerging, but uneven. Several recent pipelines score highly on common physical behavior, yet thermodynamics and material-level rules remain more fragile than frequent mechanics patterns.

3

The geometry-simulation gap remains open. Hunyuan-WorldPlay is strongest on 3D consistency (0.424), but most models remain in the 0.25-0.40 range, indicating unstable reconstruction, meta-view quality, or camera trajectory alignment.

4

Sustained single-domain training can support broad generalization. LingBot-World and Cosmos-Predict-2.5 are specialized around gaming and robotics, yet both remain strong across all three domains, indicating that domain-focused training can learn transferable world knowledge. However, this behavior is not guaranteed: WoW reaches 0.502 on robotics videos but drops to 0.368 on gaming and 0.415 on general videos. How to reliably obtain stronger cross-domain generalization from specialized training remains an open direction.

5

Automatic judging tracks human preference. Pairwise human preference rankings and WorldOlympiad automatic rankings are highly consistent, with Spearman rho=0.95 across the annotated model set.

Evaluation Metrics

Each track exposes a different failure mode in video world models.

Physical Track

Object-centric masks and MLLM-as-judge compare generated videos against interpretable physical rules.

  • Mechanics: gravity, buoyancy, compression, impact.
  • Thermodynamics: melting, sublimation, vaporization, condensation, deposition, freezing.
  • Material: color mixing, solubility, hardness, combustibility.

3D Track

Generated videos are reconstructed and rendered to test whether the scene remains spatially coherent.

  • GS: Gaussian-splat video reconstruction quality.
  • Meta: novel meta-view image quality.
  • Camera Motion: generated camera trajectory versus reference trajectory.

Interaction Track

Chunk-generated long videos are judged locally and globally for action-conditioned continuity.

  • Chunk: per-chunk action and caption following.
  • Transition: adjacent boundary smoothness.
  • Global: long-range scene, identity, and semantic consistency.
  • CLIP: scalable semantic adherence signal.

Qualitative Compare Cases

Good vs Bad comparisons selected from outputs_batch/cases.md. Each paired clip uses the same scenario; the left side is the stronger rollout and the right side shows the failure.

Loading case videos...

Acknowledgement

WorldOlympiad acknowledges the generation infrastructure and foundation models that support the benchmark workflow.

We thank OpenWorldLib for providing the generation pipeline infrastructure used in our video world model rollouts, and we also acknowledge Qwen3-VL, Depth Anything 3, and SAM 3 for their foundational model capabilities that support video understanding, 3D reconstruction, and segmentation in our workflow.

License

WorldOlympiad is released under a permissive open-source license.

The WorldOlympiad project website and code are made available under the MIT License. You may use, modify, and redistribute the project materials under the terms of the MIT License.

Citation

If WorldOlympiad is useful for your research, please cite our work.

BibTeX

@misc{zhao2026worldolympiad,
  title        = {WorldOlympiad: Can Your World Model Survive a Triathlon?},
  author       = {Yuke Zhao and Wangbo Zhao and Weijie Wang and
                  Zeyu Zhang and Dakai An and Akide Liu and
                  Yinghao Yu and Jiasheng Tang and Fan Wang and
                  Wei Wang and Bohan Zhuang},
  year         = {2026},
  howpublished = {\url{https://alibaba-damo-academy.github.io/WorldOlympiad}},
  note         = {Project website and benchmark}
}