Performance of various vision-language models on a Multi-image setting across multiple tasks, including Time and Calendar, Space and Location, Geometry and Shapes, Objects and Motion, Reasoning and Observation, and Organization and Pattern.
Models | Avg. | Time and Calendar | Space and Location | Geometry and Shapes | Objects and Motion | Reasoning and Observation | Organization and Pattern | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Calender | Clock | Direction | Location | Place | Angle | Quad | Rectangular | Shape | Triangle | Cube | Move | Reasoning | Observe | Organize | Pattern | Weight | ||
Random Guess | 29.83 | 33.33 | 32.78 | 25.00 | 29.81 | 33.33 | 31.00 | 27.63 | 29.17 | 31.84 | 29.01 | 28.37 | 29.35 | 33.33 | 29.41 | 30.17 | 31.32 | 33.33 |
Human | 93.30 | 100 | 96.00 | 100 | 93.85 | 96.67 | 95.60 | 96.84 | 95.00 | 94.02 | 94.07 | 97.67 | 94.63 | 100 | 93.59 | 93.20 | 95.52 | 100 |
Models | ||||||||||||||||||
Gemini2.0-Flash 🥇 | 49.77 | 100.00 | 70.00 | 20.00 | 57.69 | 66.67 | 70.00 | 68.42 | 53.57 | 61.14 | 70.37 | 44.52 | 68.75 | 40.00 | 35.53 | 74.00 | 46.55 | 100.00 |
Qwen2.5-VL-72B-Instruct 🥈 | 48.08 | 0.00 | 40.67 | 0.00 | 53.85 | 50.00 | 68.00 | 68.42 | 53.57 | 55.02 | 74.07 | 58.22 | 60.66 | 60.00 | 35.53 | 76.00 | 43.10 | 100.00 |
LLaVA-OneVision-72B 🥉 | 47.67 | 0.00 | 33.33 | 0.00 | 26.92 | 66.67 | 61.20 | 57.89 | 57.14 | 60.70 | 51.85 | 41.10 | 60.29 | 100.00 | 38.24 | 82.00 | 41.38 | 80.00 |
Qwen-VL-Max | 47.03 | 0.00 | 46.67 | 0.00 | 42.31 | 66.67 | 74.00 | 52.63 | 42.86 | 54.15 | 66.67 | 56.16 | 60.66 | 50.00 | 35.27 | 68.00 | 39.66 | 100.00 |
Claude-3.7-Sonnet | 46.63 | 100.00 | 50.00 | 100.00 | 53.85 | 50.00 | 58.00 | 63.16 | 57.14 | 60.70 | 59.26 | 40.41 | 67.28 | 100.00 | 31.27 | 76.40 | 53.45 | 100.00 |
GPT-4o | 40.29 | 100.00 | 40.00 | 20.00 | 30.77 | 66.67 | 46.00 | 57.89 | 28.57 | 50.22 | 51.85 | 37.67 | 50.37 | 90.00 | 31.27 | 76.00 | 37.93 | 80.00 |
QVQ-72B-Preview | 39.13 | 100.00 | 43.33 | 0.00 | 46.15 | 83.33 | 58.00 | 42.11 | 46.43 | 44.10 | 62.96 | 36.30 | 48.16 | 50.00 | 28.55 | 78.00 | 48.28 | 100.00 |
Gemma3-27B-it | 38.02 | 100.00 | 50.00 | 0.00 | 38.46 | 83.33 | 48.40 | 31.58 | 25.00 | 41.92 | 40.74 | 32.88 | 47.79 | 50.00 | 32.82 | 54.00 | 31.03 | 80.00 |
InternVL2.5-78B | 37.56 | 20.00 | 31.33 | 100.00 | 42.31 | 66.67 | 54.00 | 47.37 | 46.43 | 53.28 | 55.56 | 33.56 | 40.44 | 50.00 | 28.04 | 76.00 | 31.03 | 100.00 |
Kimi-VL-A3B-Instruct | 37.33 | 0.00 | 46.67 | 0.00 | 30.77 | 83.33 | 44.00 | 47.37 | 39.29 | 43.23 | 33.33 | 34.93 | 44.49 | 50.00 | 31.31 | 58.00 | 36.21 | 0.00 |
LLaVA-OneVision-7B | 36.63 | 0.00 | 40.00 | 0.00 | 11.54 | 83.33 | 44.00 | 36.84 | 32.14 | 37.99 | 48.15 | 30.82 | 46.69 | 50.00 | 32.56 | 58.00 | 29.31 | 100.00 |
LLaVA-Interleave-7B | 35.47 | 0.00 | 36.67 | 20.00 | 19.23 | 83.33 | 46.00 | 26.32 | 57.14 | 39.74 | 29.63 | 30.82 | 33.46 | 50.00 | 33.46 | 62.00 | 31.03 | 100.00 |
GPT-4o-mini | 34.88 | 80.00 | 60.66 | 0.00 | 38.46 | 53.33 | 38.40 | 21.05 | 53.57 | 37.99 | 55.56 | 32.19 | 38.24 | 0.00 | 28.68 | 60.00 | 41.38 | 100.00 |
Kimi-VL-A3B-Thinking | 34.13 | 100.00 | 26.67 | 100.00 | 30.77 | 33.33 | 48.00 | 36.84 | 28.57 | 49.78 | 33.33 | 30.14 | 41.91 | 0.00 | 25.32 | 68.00 | 27.59 | 100.00 |
Mistral-Small-3.1-24B | 31.34 | 20.00 | 40.00 | 0.00 | 30.77 | 30.00 | 38.00 | 31.58 | 35.71 | 29.26 | 51.85 | 30.82 | 31.62 | 50.00 | 29.59 | 38.00 | 34.48 | 20.00 |
Mantis-CLIP | 30.23 | 0.00 | 30.00 | 80.00 | 50.00 | 66.67 | 14.00 | 15.79 | 35.71 | 38.43 | 37.04 | 19.86 | 32.35 | 40.00 | 28.04 | 52.40 | 22.41 | 100.00 |
Qwen2.5-VL-7B-Instruct | 29.24 | 100.00 | 13.33 | 0.00 | 19.23 | 50.00 | 20.00 | 31.58 | 25.00 | 30.13 | 51.85 | 32.19 | 40.81 | 0.00 | 25.19 | 30.00 | 27.59 | 0.00 |
Llama-3.2-90B-Vision-Instruct | 25.41 | 20.00 | 24.67 | 100.00 | 11.54 | 16.67 | 26.40 | 31.58 | 32.14 | 26.20 | 22.22 | 27.40 | 25.37 | 0.00 | 25.58 | 12.00 | 29.31 | 20.00 |
InternVL2.5-8B | 24.71 | 0.00 | 33.33 | 0.00 | 34.62 | 50.00 | 34.00 | 31.58 | 50.00 | 35.81 | 37.04 | 23.29 | 25.74 | 0.00 | 18.99 | 38.00 | 6.90 | 0.00 |
Phi-3.5-vision-instruct | 22.73 | 0.00 | 23.33 | 100.00 | 19.23 | 66.67 | 16.00 | 15.79 | 28.57 | 27.07 | 33.33 | 22.60 | 22.79 | 0.00 | 21.32 | 34.00 | 12.07 | 0.00 |
DeepSeek-VL2 | 15.47 | 0.00 | 23.33 | 0.00 | 23.08 | 16.67 | 14.00 | 10.53 | 14.29 | 29.69 | 14.81 | 6.85 | 18.38 | 10.00 | 9.43 | 44.00 | 20.69 | 0.00 |
Idefics3-8B | 12.91 | 0.00 | 3.33 | 20.00 | 15.38 | 33.33 | 11.60 | 10.53 | 17.86 | 23.14 | 3.70 | 9.59 | 16.91 | 0.00 | 9.69 | 8.40 | 15.52 | 0.00 |
Emu2-Chat | 6.05 | 0.00 | 13.33 | 0.00 | 3.85 | 0.00 | 4.00 | 10.53 | 10.71 | 12.66 | 3.70 | 8.90 | 6.99 | 0.00 | 3.62 | 0.00 | 3.45 | 20.00 |