Benchmark on 2,500 English questions (1,000 Text + 1,500 Multimodal). Metrics: Process Score (PS) and Answer Correctness (AC).
# | Model | Link | Version | #Params | Type | Thinking | Overall (AC) | Overall (PS) | Text (AC) | Text (PS) | Multimodal (AC) | Multimodal (PS) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
- | Qwen3-VL-235B-A22B-Thinking | Link | - | 235B | VLM | ✓ | 66.8 | 81.0 | 58.9 | 77.4 | 72.1 | 83.4 |
- | Qwen3-VL-235B-A22B-Instruct | Link | - | 235B | VLM | X | 65.0 | 80.1 | 59.4 | 77.8 | 68.8 | 81.6 |
- | Gemini-2.5-Pro | Link | - | - | VLM | ✓ | 64.7 | 80.8 | 58.7 | 77.9 | 68.7 | 82.8 |
- | Gemini-2.5-Flash | Link | 2025-06-17 | - | VLM | ✓ | 60.5 | 78.4 | 57.0 | 77.5 | 62.9 | 79.0 |
- | GPT-o3 | Link | 2025-04-16 | - | VLM | ✓ | 59.3 | 76.4 | 52.9 | 72.9 | 63.7 | 78.6 |
- | Seed-1.6-Thinking | Link | 2025-06-15 | - | VLM | ✓ | 58.4 | 75.2 | 53.0 | 73.0 | 62.0 | 76.6 |
- | Nano Banana | Link | 2025-08-26 | - | UM | X | 53.4 | 73.8 | 49.1 | 72.3 | 56.3 | 74.7 |
- | Gemini-2.5-Flash-No-Thinking | Link | 2025-06-17 | - | VLM | X | 52.3 | 73.7 | 44.6 | 70.9 | 57.5 | 75.5 |
- | GLM-4.5V | Link | - | 108B | VLM | ✓ | 49.6 | 69.7 | 48.0 | 70.5 | 50.6 | 69.1 |
- | Mimo-VL-7B-RL | Link | 2508 | 7B | VLM | ✓ | 48.3 | 68.8 | 43.5 | 68.4 | 51.3 | 69.0 |
- | InternVL-3.5-8B | Link | - | 8B | VLM | ✓ | 40.8 | 62.8 | 38.5 | 64.0 | 42.2 | 62.0 |
- | GPT-4.1-mini | Link | - | - | VLM | X | 33.3 | 60.0 | 33.3 | 62.0 | 33.3 | 58.6 |
- | GLM-4.1V-9B | Link | - | 9B | VLM | ✓ | 29.0 | 53.4 | 27.8 | 54.4 | 29.9 | 52.7 |
- | Claude-Sonnet-4 | Link | 2025-05-23 | - | VLM | X | 28.1 | 56.4 | 31.5 | 60.9 | 25.8 | 53.4 |
- | GPT-4.1 | Link | - | - | VLM | X | 26.0 | 53.9 | 26.6 | 56.5 | 25.6 | 52.2 |
- | CodePlot-CoT | Link | - | 32B | VLM | X | 22.1 | 47.0 | 31.6 | 53.8 | 15.8 | 42.4 |
- | Gemini-2.0-Flash | Link | - | - | VLM | X | 20.6 | 50.7 | 24.1 | 56.1 | 18.3 | 47.0 |
- | Keye-VL-1.5 | Link | - | 8B | VLM | X | 17.3 | 38.2 | 20.2 | 44.4 | 15.4 | 34.0 |
- | Gemma3 | Link | - | 27B | VLM | X | 16.1 | 44.8 | 19.2 | 50.8 | 14.1 | 40.8 |
- | Qwen-2.5-VL-72B | Link | - | 72B | VLM | X | 13.7 | 40.8 | 15.3 | 44.6 | 12.7 | 38.2 |
- | Bagel-Zebra-CoT | Link | - | 7B | UM | X | 10.1 | 34.1 | 13.9 | 41.5 | 7.6 | 29.1 |
- | Qwen-2.5-VL-32B | Link | - | 32B | VLM | X | 10.0 | 33.7 | 10.6 | 36.9 | 9.6 | 31.5 |
- | GPT-4.1-nano | Link | - | - | VLM | X | 9.1 | 38.5 | 13.1 | 45.9 | 6.4 | 33.6 |
- | InternVL-3.5-8B-No-Thinking | Link | - | 8B | VLM | X | 7.9 | 31.4 | 9.2 | 35.6 | 7.0 | 28.6 |
- | Bagel | Link | - | 7B | UM | X | 7.6 | 27.6 | 8.5 | 32.9 | 7.0 | 24.0 |
- | Qwen-2.5-VL-3B | Link | - | 3B | VLM | X | 5.3 | 27.5 | 7.9 | 33.4 | 3.6 | 23.6 |
- | GPT-4o | Link | 2024-11-20 | - | VLM | X | 4.3 | 30.4 | 5.7 | 34.6 | 3.4 | 27.6 |
- | Qwen-2.5-VL-7B | Link | - | 7B | VLM | X | 3.0 | 13.8 | 4.5 | 18.0 | 2.0 | 11.0 |
# | Model | Link | Version | #Params | Type | Thinking | Text (AC) | Text (PS) |
---|---|---|---|---|---|---|---|---|
- | Deepseek-R1 | Link | - | 671B | LLM | ✓ | 49.5 | 69.9 |