Main Overall Ranking
Switch the diagnostic track to inspect how the eight pipelines reorder under each criterion.
A unified benchmark for diagnosing video-based world models across physical faithfulness, geometric consistency, and interaction fidelity.
1Zhejiang University · 2DAMO Academy, Alibaba Group · 3HKUST · 4Monash University · 5TRE, Alibaba Group
*Equal contribution. · †Project lead. · ‡Corresponding authors.
WorldOlympiad evaluates whether long-video generators can act as reliable world models, not just visually plausible video synthesizers.
WorldOlympiad decomposes world-model evaluation into three complementary tracks: whether generated videos obey physical rules, preserve coherent 3D structure, and sustain controllable interactions over long horizons.
The benchmark covers 400 robotics videos, 400 gaming videos, and 200 real-world videos. Each video is standardized through chunking, captioning, and refinement before model rollout and automatic evaluation.
The main result is diagnostic: current models can be strong in physical regularity or interaction following, but geometric consistency and long-horizon state preservation remain unresolved.
Rule-level judging over mechanics, thermodynamics, and material behavior.
Gaussian-splat reconstruction, meta-view quality, and camera trajectory alignment.
Chunk following, transition smoothness, and long-range global coherence.
The test set is built to cover three downstream settings with complementary world-modeling demands.
Identify the main continuous execution interval and split it into at most six contiguous chunks.
Generate chunk-level action labels and English captions describing scene, entities, events, and outcomes.
Use full-video context to correct hallucinations, standardize terminology, and validate camera-movement labels.
Scores are normalized to the 0-1 range. Click headers to sort, or filter by evaluation task.
| Rank | Model | Task | Physical | 3D Cons. | CLIP | Interact. | All |
|---|---|---|---|---|---|---|---|
| 1 | LingBot-World |
All Tasks | 0.942 |
0.373 |
0.314 |
0.734 |
0.683 |
| 2 | Cosmos-Predict-2.5 |
All Tasks | 0.906 |
0.399 |
0.313 |
0.707 |
0.671 |
| 3 | Rolling Forcing |
All Tasks | 0.873 |
0.321 |
0.327 |
0.636 |
0.610 |
| 4 | Yume-1.5 |
All Tasks | 0.863 |
0.301 |
0.306 |
0.649 |
0.604 |
| 5 | LongLive |
All Tasks | 0.863 |
0.363 |
0.323 |
0.526 |
0.584 |
| 6 | Hunyuan-WorldPlay |
All Tasks | 0.692 |
0.424 |
0.290 |
0.316 |
0.477 |
| 7 | WoW |
All Tasks | 0.708 |
0.250 |
0.267 |
0.345 |
0.434 |
| 8 | Matrix-Game 2.0 |
All Tasks | 0.325 |
0.255 |
0.237 |
0.113 |
0.231 |
Aggregate charts from the paper show overall ranking, metric profiles, domain robustness, and human preference alignment.
Switch the diagnostic track to inspect how the eight pipelines reorder under each criterion.
Compact score matrix across physical faithfulness, 3D consistency, CLIP, interaction, and aggregate score.
Fine-grained domain heatmaps for physical, 3D, and interaction diagnostics.
Automatic scores align with pairwise human preference rankings.
WorldOlympiad is designed for diagnosis, not just a single aggregate score.
Model scale correlates with stronger world-model performance. LingBot-World is a 14B-activated model and ranks first overall, with strong physical faithfulness (0.942) and interaction fidelity (0.734). It also opens a clear aggregate gap over smaller 1.3B-scale models such as LongLive, improving the overall score from 0.584 to 0.683.
Physical regularity is emerging, but uneven. Several recent pipelines score highly on common physical behavior, yet thermodynamics and material-level rules remain more fragile than frequent mechanics patterns.
The geometry-simulation gap remains open. Hunyuan-WorldPlay is strongest on 3D consistency (0.424), but most models remain in the 0.25-0.40 range, indicating unstable reconstruction, meta-view quality, or camera trajectory alignment.
Sustained single-domain training can support broad generalization. LingBot-World and Cosmos-Predict-2.5 are specialized around gaming and robotics, yet both remain strong across all three domains, indicating that domain-focused training can learn transferable world knowledge. However, this behavior is not guaranteed: WoW reaches 0.502 on robotics videos but drops to 0.368 on gaming and 0.415 on general videos. How to reliably obtain stronger cross-domain generalization from specialized training remains an open direction.
Automatic judging tracks human preference. Pairwise human preference rankings and WorldOlympiad automatic rankings are highly consistent, with Spearman rho=0.95 across the annotated model set.
Each track exposes a different failure mode in video world models.
Object-centric masks and MLLM-as-judge compare generated videos against interpretable physical rules.
Generated videos are reconstructed and rendered to test whether the scene remains spatially coherent.
Chunk-generated long videos are judged locally and globally for action-conditioned continuity.
Good vs Bad comparisons selected from outputs_batch/cases.md. Each paired clip uses the same scenario; the left side is the stronger rollout and the right side shows the failure.
WorldOlympiad acknowledges the generation infrastructure and foundation models that support the benchmark workflow.
We thank OpenWorldLib for providing the generation pipeline infrastructure used in our video world model rollouts, and we also acknowledge Qwen3-VL, Depth Anything 3, and SAM 3 for their foundational model capabilities that support video understanding, 3D reconstruction, and segmentation in our workflow.
WorldOlympiad is released under a permissive open-source license.
The WorldOlympiad project website and code are made available under the MIT License. You may use, modify, and redistribute the project materials under the terms of the MIT License.
If WorldOlympiad is useful for your research, please cite our work.
@misc{zhao2026worldolympiad,
title = {WorldOlympiad: Can Your World Model Survive a Triathlon?},
author = {Yuke Zhao and Wangbo Zhao and Weijie Wang and
Zeyu Zhang and Dakai An and Akide Liu and
Yinghao Yu and Jiasheng Tang and Fan Wang and
Wei Wang and Bohan Zhuang},
year = {2026},
howpublished = {\url{https://alibaba-damo-academy.github.io/WorldOlympiad}},
note = {Project website and benchmark}
}