Benchmarking Multimodal Mathematical Reasoning with Explicit Visual Dependency

1DAMO Academy, Alibaba Group, 2Hupan Lab, 3Zhejiang University, 4Singapore University of Technology and Design
*Equal contribution
Home

VC-Bench

Introduction image
Introduction image 1

We introduce VC-Bench, a benchmark for evaluating explicit visual dependency in multimodal mathematical reasoning.

Comparision of VCBench and Other Visual Math Benchmarks

Comparing ECBench and widely adopted Visual Math Benchmarks. VCBench not only covers all required skills and supports multi-image questions, but also has a higher image-to-question ratio, indicating richer visual context per question compared to other benchmarks.

Data analysis of ECBench reflects a rich diversity of scenario categories, video sources, and evaluation dimensions.


Visual samples from our VC-Bench

Main Results

Performance of various vision-language models on a Multi-image setting across multiple tasks, including Time and Calendar, Space and Location, Geometry and Shapes, Objects and Motion, Reasoning and Observation, and Organization and Pattern.

Models Avg. Time and Calendar Space and Location Geometry and Shapes Objects and Motion Reasoning and Observation Organization and Pattern
Calender Clock Direction Location Place Angle Quad Rectangular Shape Triangle Cube Move Reasoning Observe Organize Pattern Weight
Random Guess 29.83 33.33 32.78 25.00 29.81 33.33 31.00 27.63 29.17 31.84 29.01 28.37 29.35 33.33 29.41 30.17 31.32 33.33
Human 93.30 100 96.00 100 93.85 96.67 95.60 96.84 95.00 94.02 94.07 97.67 94.63 100 93.59 93.20 95.52 100
Models
Gemini2.0-Flash 🥇 49.77 100.00 70.00 20.00 57.69 66.67 70.00 68.42 53.57 61.14 70.37 44.52 68.75 40.00 35.53 74.00 46.55 100.00
Qwen2.5-VL-72B-Instruct 🥈 48.08 0.00 40.67 0.00 53.85 50.00 68.00 68.42 53.57 55.02 74.07 58.22 60.66 60.00 35.53 76.00 43.10 100.00
LLaVA-OneVision-72B 🥉 47.67 0.00 33.33 0.00 26.92 66.67 61.20 57.89 57.14 60.70 51.85 41.10 60.29 100.00 38.24 82.00 41.38 80.00
Qwen-VL-Max 47.03 0.00 46.67 0.00 42.31 66.67 74.00 52.63 42.86 54.15 66.67 56.16 60.66 50.00 35.27 68.00 39.66 100.00
Claude-3.7-Sonnet 46.63 100.00 50.00 100.00 53.85 50.00 58.00 63.16 57.14 60.70 59.26 40.41 67.28 100.00 31.27 76.40 53.45 100.00
GPT-4o 40.29 100.00 40.00 20.00 30.77 66.67 46.00 57.89 28.57 50.22 51.85 37.67 50.37 90.00 31.27 76.00 37.93 80.00
QVQ-72B-Preview 39.13 100.00 43.33 0.00 46.15 83.33 58.00 42.11 46.43 44.10 62.96 36.30 48.16 50.00 28.55 78.00 48.28 100.00
Gemma3-27B-it 38.02 100.00 50.00 0.00 38.46 83.33 48.40 31.58 25.00 41.92 40.74 32.88 47.79 50.00 32.82 54.00 31.03 80.00
InternVL2.5-78B 37.56 20.00 31.33 100.00 42.31 66.67 54.00 47.37 46.43 53.28 55.56 33.56 40.44 50.00 28.04 76.00 31.03 100.00
Kimi-VL-A3B-Instruct 37.33 0.00 46.67 0.00 30.77 83.33 44.00 47.37 39.29 43.23 33.33 34.93 44.49 50.00 31.31 58.00 36.21 0.00
LLaVA-OneVision-7B 36.63 0.00 40.00 0.00 11.54 83.33 44.00 36.84 32.14 37.99 48.15 30.82 46.69 50.00 32.56 58.00 29.31 100.00
LLaVA-Interleave-7B 35.47 0.00 36.67 20.00 19.23 83.33 46.00 26.32 57.14 39.74 29.63 30.82 33.46 50.00 33.46 62.00 31.03 100.00
GPT-4o-mini 34.88 80.00 60.66 0.00 38.46 53.33 38.40 21.05 53.57 37.99 55.56 32.19 38.24 0.00 28.68 60.00 41.38 100.00
Kimi-VL-A3B-Thinking 34.13 100.00 26.67 100.00 30.77 33.33 48.00 36.84 28.57 49.78 33.33 30.14 41.91 0.00 25.32 68.00 27.59 100.00
Mistral-Small-3.1-24B 31.34 20.00 40.00 0.00 30.77 30.00 38.00 31.58 35.71 29.26 51.85 30.82 31.62 50.00 29.59 38.00 34.48 20.00
Mantis-CLIP 30.23 0.00 30.00 80.00 50.00 66.67 14.00 15.79 35.71 38.43 37.04 19.86 32.35 40.00 28.04 52.40 22.41 100.00
Qwen2.5-VL-7B-Instruct 29.24 100.00 13.33 0.00 19.23 50.00 20.00 31.58 25.00 30.13 51.85 32.19 40.81 0.00 25.19 30.00 27.59 0.00
Llama-3.2-90B-Vision-Instruct 25.41 20.00 24.67 100.00 11.54 16.67 26.40 31.58 32.14 26.20 22.22 27.40 25.37 0.00 25.58 12.00 29.31 20.00
InternVL2.5-8B 24.71 0.00 33.33 0.00 34.62 50.00 34.00 31.58 50.00 35.81 37.04 23.29 25.74 0.00 18.99 38.00 6.90 0.00
Phi-3.5-vision-instruct 22.73 0.00 23.33 100.00 19.23 66.67 16.00 15.79 28.57 27.07 33.33 22.60 22.79 0.00 21.32 34.00 12.07 0.00
DeepSeek-VL2 15.47 0.00 23.33 0.00 23.08 16.67 14.00 10.53 14.29 29.69 14.81 6.85 18.38 10.00 9.43 44.00 20.69 0.00
Idefics3-8B 12.91 0.00 3.33 20.00 15.38 33.33 11.60 10.53 17.86 23.14 3.70 9.59 16.91 0.00 9.69 8.40 15.52 0.00
Emu2-Chat 6.05 0.00 13.33 0.00 3.85 0.00 4.00 10.53 10.71 12.66 3.70 8.90 6.99 0.00 3.62 0.00 3.45 20.00

BibTeX

@misc{wong2025vcbench,
  author    = {Zhikai Wang and Jiashuo Sun and Wenqi Zhang and Zhiqiang Hu and Xin Li and Fan Wang and Deli Zhao},
  title     = {Benchmarking Multimodal Mathematical Reasoning with Explicit Visual Dependency},
  year      = {2025},
  eprint    = {2504.18589},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CV},
  url       = {https://arxiv.org/abs/2504.18589}
}