Benchmarking Multimodal Mathematical Reasoning with Explicit Visual Dependency

Performance of various vision-language models on a Multi-image setting across multiple tasks, including Time and Calendar, Space and Location, Geometry and Shapes, Objects and Motion, Reasoning and Observation, and Organization and Pattern.

Models	Avg.	Time and Calendar		Space and Location			Geometry and Shapes					Objects and Motion		Reasoning and Observation		Organization and Pattern
Models	Avg.	Calender	Clock	Direction	Location	Place	Angle	Quad	Rectangular	Shape	Triangle	Cube	Move	Reasoning	Observe	Organize	Pattern	Weight
Random Guess	29.83	33.33	32.78	25.00	29.81	33.33	31.00	27.63	29.17	31.84	29.01	28.37	29.35	33.33	29.41	30.17	31.32	33.33
Human	93.30	100	96.00	100	93.85	96.67	95.60	96.84	95.00	94.02	94.07	97.67	94.63	100	93.59	93.20	95.52	100
Models
Gemini2.0-Flash 🥇	49.77	100.00	70.00	20.00	57.69	66.67	70.00	68.42	53.57	61.14	70.37	44.52	68.75	40.00	35.53	74.00	46.55	100.00
Qwen2.5-VL-72B-Instruct 🥈	48.08	0.00	40.67	0.00	53.85	50.00	68.00	68.42	53.57	55.02	74.07	58.22	60.66	60.00	35.53	76.00	43.10	100.00
LLaVA-OneVision-72B 🥉	47.67	0.00	33.33	0.00	26.92	66.67	61.20	57.89	57.14	60.70	51.85	41.10	60.29	100.00	38.24	82.00	41.38	80.00
Qwen-VL-Max	47.03	0.00	46.67	0.00	42.31	66.67	74.00	52.63	42.86	54.15	66.67	56.16	60.66	50.00	35.27	68.00	39.66	100.00
Claude-3.7-Sonnet	46.63	100.00	50.00	100.00	53.85	50.00	58.00	63.16	57.14	60.70	59.26	40.41	67.28	100.00	31.27	76.40	53.45	100.00
GPT-4o	40.29	100.00	40.00	20.00	30.77	66.67	46.00	57.89	28.57	50.22	51.85	37.67	50.37	90.00	31.27	76.00	37.93	80.00
QVQ-72B-Preview	39.13	100.00	43.33	0.00	46.15	83.33	58.00	42.11	46.43	44.10	62.96	36.30	48.16	50.00	28.55	78.00	48.28	100.00
Gemma3-27B-it	38.02	100.00	50.00	0.00	38.46	83.33	48.40	31.58	25.00	41.92	40.74	32.88	47.79	50.00	32.82	54.00	31.03	80.00
InternVL2.5-78B	37.56	20.00	31.33	100.00	42.31	66.67	54.00	47.37	46.43	53.28	55.56	33.56	40.44	50.00	28.04	76.00	31.03	100.00
Kimi-VL-A3B-Instruct	37.33	0.00	46.67	0.00	30.77	83.33	44.00	47.37	39.29	43.23	33.33	34.93	44.49	50.00	31.31	58.00	36.21	0.00
LLaVA-OneVision-7B	36.63	0.00	40.00	0.00	11.54	83.33	44.00	36.84	32.14	37.99	48.15	30.82	46.69	50.00	32.56	58.00	29.31	100.00
LLaVA-Interleave-7B	35.47	0.00	36.67	20.00	19.23	83.33	46.00	26.32	57.14	39.74	29.63	30.82	33.46	50.00	33.46	62.00	31.03	100.00
GPT-4o-mini	34.88	80.00	60.66	0.00	38.46	53.33	38.40	21.05	53.57	37.99	55.56	32.19	38.24	0.00	28.68	60.00	41.38	100.00
Kimi-VL-A3B-Thinking	34.13	100.00	26.67	100.00	30.77	33.33	48.00	36.84	28.57	49.78	33.33	30.14	41.91	0.00	25.32	68.00	27.59	100.00
Mistral-Small-3.1-24B	31.34	20.00	40.00	0.00	30.77	30.00	38.00	31.58	35.71	29.26	51.85	30.82	31.62	50.00	29.59	38.00	34.48	20.00
Mantis-CLIP	30.23	0.00	30.00	80.00	50.00	66.67	14.00	15.79	35.71	38.43	37.04	19.86	32.35	40.00	28.04	52.40	22.41	100.00
Qwen2.5-VL-7B-Instruct	29.24	100.00	13.33	0.00	19.23	50.00	20.00	31.58	25.00	30.13	51.85	32.19	40.81	0.00	25.19	30.00	27.59	0.00
Llama-3.2-90B-Vision-Instruct	25.41	20.00	24.67	100.00	11.54	16.67	26.40	31.58	32.14	26.20	22.22	27.40	25.37	0.00	25.58	12.00	29.31	20.00
InternVL2.5-8B	24.71	0.00	33.33	0.00	34.62	50.00	34.00	31.58	50.00	35.81	37.04	23.29	25.74	0.00	18.99	38.00	6.90	0.00
Phi-3.5-vision-instruct	22.73	0.00	23.33	100.00	19.23	66.67	16.00	15.79	28.57	27.07	33.33	22.60	22.79	0.00	21.32	34.00	12.07	0.00
DeepSeek-VL2	15.47	0.00	23.33	0.00	23.08	16.67	14.00	10.53	14.29	29.69	14.81	6.85	18.38	10.00	9.43	44.00	20.69	0.00
Idefics3-8B	12.91	0.00	3.33	20.00	15.38	33.33	11.60	10.53	17.86	23.14	3.70	9.59	16.91	0.00	9.69	8.40	15.52	0.00
Emu2-Chat	6.05	0.00	13.33	0.00	3.85	0.00	4.00	10.53	10.71	12.66	3.70	8.90	6.99	0.00	3.62	0.00	3.45	20.00

Benchmarking Multimodal Mathematical Reasoning with Explicit Visual Dependency

VC-Bench

Comparision of VCBench and Other Visual Math Benchmarks

Visual samples from our VC-Bench

Main Results

BibTeX