Abstract

Recent advances in Vision Language Models (VLMs) have shown significant progress in mathematical reasoning, yet they still face a critical bottleneck with problems that require visual assistance, such as drawing auxiliary lines or plotting functions. Most VLMs are constrained to text-only reasoning, while unified models that generate interleaved text and images often lack the precision required for mathematical tasks. We present CodePlot-CoT, a code-driven Chain-of-Thought (CoT) paradigm that enables models to "think with images" in mathematics. Our approach leverages a VLM to generate both textual reasoning and executable plotting code. This code is then rendered into an image, serving as a "visual thought" that is reinput into the model to aid in problem solving. To facilitate this, we introduce Math-VR, the first large-scale, bilingual dataset and benchmark for mathematical problems requiring visual reasoning, comprising 178K samples. We also developed MatplotCode, a specialized image-to-code converter to generate high-quality training data. We benchmark SOTA models on our Math-VR. Our experiments show that CodePlot-CoT achieves up to a 21% performance increase over its base model, demonstrating the effectiveness of our code-driven reasoning paradigm.

Contributions

Math-VR: The first large-scale, bilingual (English and Chinese) dataset and benchmark (178K samples) for mathematical problems with visual reasoning.
CodePlot-CoT: A novel and efficient paradigm that enables VLMs to engage in visual reasoning through code generation.
MatplotCode: A state-of-the-art image-to-code converter for mathematical figures, achieving 100% code execution success rate and high reconstruction fidelity.
Strong Empirical Results: CodePlot-CoT achieves up to a 21% performance increase over strong baselines on the Math-VR benchmark.

Math-VR Benchmark Results

Benchmark on 2,500 English questions (1,000 Text + 1,500 Multimodal). Metrics: Process Score (PS) and Answer Correctness (AC).

VLMs & Unified Models
LLMs

#	Model	Link	Version	#Params	Type	Thinking	Overall (AC)	Overall (PS)	Text (AC)	Text (PS)	Multimodal (AC)	Multimodal (PS)
-	GPT-5-Thinking	Link	-	-	VLM	✓	58.1	70.6	53.2	68.0	61.4	72.3
-	Claude Opus4.1	Link	-	-	VLM	✓	54.3	70.6	53.1	70.5	55.1	70.6
-	Qwen3-VL-235B-A22B-Thinking	Link	-	235B	VLM	✓	66.8	81.0	58.9	77.4	72.1	83.4
-	Qwen3-VL-235B-A22B-Instruct	Link	-	235B	VLM	X	65.0	80.1	59.4	77.8	68.8	81.6
-	Gemini-2.5-Pro	Link	-	-	VLM	✓	64.7	80.8	58.7	77.9	68.7	82.8
-	Gemini-2.5-Flash	Link	2025-06-17	-	VLM	✓	60.5	78.4	57.0	77.5	62.9	79.0
-	GPT-o3	Link	2025-04-16	-	VLM	✓	59.3	76.4	52.9	72.9	63.7	78.6
-	Seed-1.6-Thinking	Link	2025-06-15	-	VLM	✓	58.4	75.2	53.0	73.0	62.0	76.6
-	Nano Banana	Link	2025-08-26	-	UM	X	53.4	73.8	49.1	72.3	56.3	74.7
-	Gemini-2.5-Flash-No-Thinking	Link	2025-06-17	-	VLM	X	52.3	73.7	44.6	70.9	57.5	75.5
-	GLM-4.5V	Link	-	108B	VLM	✓	49.6	69.7	48.0	70.5	50.6	69.1
-	Mimo-VL-7B-RL	Link	2508	7B	VLM	✓	48.3	68.8	43.5	68.4	51.3	69.0
-	InternVL-3.5-8B	Link	-	8B	VLM	✓	40.8	62.8	38.5	64.0	42.2	62.0
-	GPT-4.1-mini	Link	-	-	VLM	X	33.3	60.0	33.3	62.0	33.3	58.6
-	GLM-4.1V-9B	Link	-	9B	VLM	✓	29.0	53.4	27.8	54.4	29.9	52.7
-	Claude-Sonnet-4	Link	2025-05-23	-	VLM	X	28.1	56.4	31.5	60.9	25.8	53.4
-	GPT-4.1	Link	-	-	VLM	X	26.0	53.9	26.6	56.5	25.6	52.2
-	CodePlot-CoT	Link	-	32B	VLM	X	22.1	47.0	31.6	53.8	15.8	42.4
-	Gemini-2.0-Flash	Link	-	-	VLM	X	20.6	50.7	24.1	56.1	18.3	47.0
-	Keye-VL-1.5	Link	-	8B	VLM	X	17.3	38.2	20.2	44.4	15.4	34.0
-	Gemma3	Link	-	27B	VLM	X	16.1	44.8	19.2	50.8	14.1	40.8
-	Qwen-2.5-VL-72B	Link	-	72B	VLM	X	13.7	40.8	15.3	44.6	12.7	38.2
-	Bagel-Zebra-CoT	Link	-	7B	UM	X	10.1	34.1	13.9	41.5	7.6	29.1
-	Qwen-2.5-VL-32B	Link	-	32B	VLM	X	10.0	33.7	10.6	36.9	9.6	31.5
-	GPT-4.1-nano	Link	-	-	VLM	X	9.1	38.5	13.1	45.9	6.4	33.6
-	InternVL-3.5-8B-No-Thinking	Link	-	8B	VLM	X	7.9	31.4	9.2	35.6	7.0	28.6
-	Bagel	Link	-	7B	UM	X	7.6	27.6	8.5	32.9	7.0	24.0
-	Qwen-2.5-VL-3B	Link	-	3B	VLM	X	5.3	27.5	7.9	33.4	3.6	23.6
-	GPT-4o	Link	2024-11-20	-	VLM	X	4.3	30.4	5.7	34.6	3.4	27.6
-	Qwen-2.5-VL-7B	Link	-	7B	VLM	X	3.0	13.8	4.5	18.0	2.0	11.0

#	Model	Link	Version	#Params	Type	Thinking	Text (AC)	Text (PS)
-	Deepseek-R1	Link	-	671B	LLM	✓	49.5	69.9

Visualization

🚨🚨🚨 Note! The data here is heavily compressed for easier visualization.

BibTeX


        @article{duan2025codeplot,
          title={CodePlot-CoT: Mathematical Visual Reasoning by Thinking with Code-Driven Images},
          author={Duan, Chengqi and Sun, Kaiyue and Fang, Rongyao and Zhang, Manyuan and Feng, Yan and Luo, Ying and Liu, Yufang and Wang, Ke and Pei, Peng and Cai, Xunliang and others},
          journal={arXiv preprint arXiv:2510.11718},
          year={2025}
        }

Math-VR & CodePlot-CoT

Mathematical Visual Reasoning by Thinking with Code-Driven Images

We introduce Math-VR, the first large-scale bilingual dataset and Benchmark for mathematical visual reasoning, and CodePlot-CoT, a novel code-driven Chain-of-Thought paradigm that enables models to "think with images" by generating executable plotting code.

Abstract

Contributions

Math-VR Benchmark Results

Visualization

BibTeX