Benchmarking Video World Models

WorldOlympiad: Can Your World Model Survive a Triathlon?

A unified benchmark for diagnosing video-based world models across physical faithfulness, geometric consistency, and interaction fidelity.

Yuke Zhao^1*, Wangbo Zhao^3*, Weijie Wang^1*, Zeyu Zhang^2*†, Dakai An³, Akide Liu⁴, Yinghao Yu⁵, Jiasheng Tang^2‡, Fan Wang², Wei Wang³, Bohan Zhuang^1‡

¹Zhejiang University · ²DAMO Academy, Alibaba Group · ³HKUST · ⁴Monash University · ⁵TRE, Alibaba Group

^*Equal contribution. · ^†Project lead. · ^‡Corresponding authors.

Paper Code Data View Leaderboard Browse Cases

1,000

Benchmark Videos

Gaming, robotics, and real-world videos

Generation Pipelines

OpenWorldLib evaluation suite

Diagnostic Tracks

Physical, 3D, and interaction

0.95

Human Alignment

Spearman correlation with human rank

TL;DR

WorldOlympiad evaluates whether long-video generators can act as reliable world models, not just visually plausible video synthesizers.

WorldOlympiad decomposes world-model evaluation into three complementary tracks: whether generated videos obey physical rules, preserve coherent 3D structure, and sustain controllable interactions over long horizons.

The benchmark covers 400 robotics videos, 400 gaming videos, and 200 real-world videos. Each video is standardized through chunking, captioning, and refinement before model rollout and automatic evaluation.

The main result is diagnostic: current models can be strong in physical regularity or interaction following, but geometric consistency and long-horizon state preservation remain unresolved.

Gaming simulator

Triangle bread: target-object following

Real-world simulator

Physical Faithfulness

Rule-level judging over mechanics, thermodynamics, and material behavior.

3D Consistency

Gaussian-splat reconstruction, meta-view quality, and camera trajectory alignment.

Interaction Fidelity

Chunk following, transition smoothness, and long-range global coherence.

Data Composition

The test set is built to cover three downstream settings with complementary world-modeling demands.

Robotics

400

RoboCOIN bimanual manipulation videos, manually filtered for object contact, gripper motion, and physically grounded interaction.

Gaming

400

GameGen-X open-world gameplay videos sampled from OGameData_50K and split into evaluation chunks up to 60 seconds.

Real-world

200

LVD-2M long-take videos selected with duration over 60 seconds and motion score greater than 50.

Chunk

Identify the main continuous execution interval and split it into at most six contiguous chunks.

Caption

Generate chunk-level action labels and English captions describing scene, entities, events, and outcomes.

Refine

Use full-video context to correct hallucinations, standardize terminology, and validate camera-movement labels.

Leaderboard

Scores are normalized to the 0-1 range. Click headers to sort, or filter by evaluation task.

Physical 3D consistency CLIP alignment Interaction Overall

Rank	Model	Task	Physical	3D Cons.	CLIP	Interact.	All
1	LingBot-World	All Tasks	0.942	0.373	0.314	0.734	0.683
2	Cosmos-Predict-2.5	All Tasks	0.906	0.399	0.313	0.707	0.671
3	Rolling Forcing	All Tasks	0.873	0.321	0.327	0.636	0.610
4	Yume-1.5	All Tasks	0.863	0.301	0.306	0.649	0.604
5	LongLive	All Tasks	0.863	0.363	0.323	0.526	0.584
6	Hunyuan-WorldPlay	All Tasks	0.692	0.424	0.290	0.316	0.477
7	WoW	All Tasks	0.708	0.250	0.267	0.345	0.434
8	Matrix-Game 2.0	All Tasks	0.325	0.255	0.237	0.113	0.231

Result Diagnostics

Aggregate charts from the paper show overall ranking, metric profiles, domain robustness, and human preference alignment.

Overall Ranking Eight pipelines compared under the same aggregate score.

Metric Profile Physical, 3D, CLIP, interaction, and total behavior.

Domain Robustness Fine-grained gaming, robotics, and general-world transfer patterns.

Human Alignment Pairwise preference agreement with automatic ranking.

Main Overall Ranking

Switch the diagnostic track to inspect how the eight pipelines reorder under each criterion.

All score ranks the complete benchmark. Click a model to link the diagnostics.

Metric Profiles

Compact score matrix across physical faithfulness, 3D consistency, CLIP, interaction, and aggregate score.

LingBot-World is strongest on the aggregate score. Scores are normalized to [0, 1].

Domain Robustness

Fine-grained domain heatmaps for physical, 3D, and interaction diagnostics.

Physical detail heatmap.

Human Alignment

Automatic scores align with pairwise human preference rankings.

Spearman rho=0.95 between automatic and human ranks.

Key Findings

WorldOlympiad is designed for diagnosis, not just a single aggregate score.

Model scale correlates with stronger world-model performance. LingBot-World is a 14B-activated model and ranks first overall, with strong physical faithfulness (0.942) and interaction fidelity (0.734). It also opens a clear aggregate gap over smaller 1.3B-scale models such as LongLive, improving the overall score from 0.584 to 0.683.

Physical regularity is emerging, but uneven. Several recent pipelines score highly on common physical behavior, yet thermodynamics and material-level rules remain more fragile than frequent mechanics patterns.

The geometry-simulation gap remains open. Hunyuan-WorldPlay is strongest on 3D consistency (0.424), but most models remain in the 0.25-0.40 range, indicating unstable reconstruction, meta-view quality, or camera trajectory alignment.

Sustained single-domain training can support broad generalization. LingBot-World and Cosmos-Predict-2.5 are specialized around gaming and robotics, yet both remain strong across all three domains, indicating that domain-focused training can learn transferable world knowledge. However, this behavior is not guaranteed: WoW reaches 0.502 on robotics videos but drops to 0.368 on gaming and 0.415 on general videos. How to reliably obtain stronger cross-domain generalization from specialized training remains an open direction.

Automatic judging tracks human preference. Pairwise human preference rankings and WorldOlympiad automatic rankings are highly consistent, with Spearman rho=0.95 across the annotated model set.

Evaluation Metrics

Each track exposes a different failure mode in video world models.

Physical Track

Object-centric masks and MLLM-as-judge compare generated videos against interpretable physical rules.

Mechanics: gravity, buoyancy, compression, impact.
Thermodynamics: melting, sublimation, vaporization, condensation, deposition, freezing.
Material: color mixing, solubility, hardness, combustibility.

3D Track

Generated videos are reconstructed and rendered to test whether the scene remains spatially coherent.

GS: Gaussian-splat video reconstruction quality.
Meta: novel meta-view image quality.
Camera Motion: generated camera trajectory versus reference trajectory.

Interaction Track

Chunk-generated long videos are judged locally and globally for action-conditioned continuity.

Chunk: per-chunk action and caption following.
Transition: adjacent boundary smoothness.
Global: long-range scene, identity, and semantic consistency.
CLIP: scalable semantic adherence signal.

Qualitative Compare Cases

Good vs Bad comparisons selected from outputs_batch/cases.md. Each paired clip uses the same scenario; the left side is the stronger rollout and the right side shows the failure.

Loading cases

Loading case videos...

Acknowledgement

WorldOlympiad acknowledges the generation infrastructure and foundation models that support the benchmark workflow.

We thank OpenWorldLib for providing the generation pipeline infrastructure used in our video world model rollouts, and we also acknowledge Qwen3-VL, Depth Anything 3, and SAM 3 for their foundational model capabilities that support video understanding, 3D reconstruction, and segmentation in our workflow.

License

WorldOlympiad is released under a permissive open-source license.

The WorldOlympiad project website and code are made available under the MIT License. You may use, modify, and redistribute the project materials under the terms of the MIT License.

Citation

If WorldOlympiad is useful for your research, please cite our arXiv preprint.

BibTeX

@misc{zhao2026worldolympiadworldmodelsurvive,
  title         = {WorldOlympiad: Can Your World Model Survive a Triathlon?},
  author        = {Yuke Zhao and Wangbo Zhao and Weijie Wang and Zeyu Zhang and
                   Dakai An and Akide Liu and Yinghao Yu and Jiasheng Tang and
                   Fan Wang and Wei Wang and Bohan Zhuang},
  year          = {2026},
  eprint        = {2606.11129},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CV},
  url           = {https://arxiv.org/abs/2606.11129}
}