Logo SeePhys

Does Seeing Help Thinking? -- Benchmarking Vision-Based Physics Reasoning

1Sun Yat-sen University, 2ETH Zurich,
3Huawei Noah’s Ark Lab, 4The University of Hong Kong

Introduction

We present SeePhys, a large-scale multimodal benchmark for LLM reasoning grounded in physics questions ranging from middle school to PhD qualifying exams. The benchmark covers 7 fundamental domains spanning the physics discipline, incorporating 21 categories of highly heterogeneous diagrams. In contrast to prior works where visual elements mainly serve auxiliary purposes, our benchmark features a substantial proportion of vision-essential problems (75%) that mandate visual information extraction for correct solutions. Through extensive evaluation, we observe that even the most advanced visual reasoning models (e.g., Gemini-2.5-pro and o4-mini) achieve sub-60% accuracy on our benchmark. These results reveal fundamental challenges in current large language models' visual understanding capabilities, particularly in: (i) establishing rigorous coupling between diagram interpretation and physics reasoning, and (ii) overcoming their persistent reliance on textual cues as cognitive shortcuts.

2nd AI for Math Workshop at ICML 2025

Our SeePhys is now open for submissions at the ICML 2025 Challenge on Automated Math Reasoning and Extensions! To evaluate your model, please submit benchmark results to our website following the official guidelines.

We strongly encourage all participants to concurrently submit their technical reports to the ICML 2025 AI for Math Workshop.

SeePhys

Logo Leaderboard on SeePhys(2000 samples)

Accuracy scores of LLMs:

# LLMs Mid High BO AO UG SUG MA PhD Total
1 DeepSeek-R1🥇 54.9 46.9 47.7 31.9 49.9 34.2 49.0 41.2 42.2
2 DeepSeek-V3🥈 53.9 42.6 36.4 22.8 45.4 29.7 35.9 37.5 36.0
3 Qwen3-235B-A22B🥉 47.1 33.7 31.8 20.4 41.2 25.1 31.7 30.7 31.1
4 QwQ-32B 47.1 42.2 44.9 15.5 40.0 20.1 32.4 24.0 29.7
5 R1-Distilled-Llama-70B 48.0 41.4 34.6 14.2 31.5 16.0 28.9 25.9 26.9
6 Llama-4-Scout-17B 48.0 36.5 31.8 11.3 28.5 14.2 28.3 26.1 24.8
7 Qwen2.5-72B 41.2 40.2 25.2 8.2 26.8 12.8 18.6 17.8 21.1
8 Gemma3-27B 21.6 36.5 30.8 5.1 23.1 9.1 15.2 11.9 16.9
9 Llama-3.1-8B 26.5 15.7 17.8 3.9 7.6 3.7 10.3 8.4 9.2

Accuracy scores of MLLMs:

# MLLMs Mid High BO AO UG SUG MA PhD Total
1 Gemini-2.5-Pro🥇 69.6 66.7 64.5 46.7 64.2 50.2 53.8 44.2 54.9
2 o4-mini🥈 66.7 61.8 56.1 41.8 53.8 45.7 51.0 53.4 51.9
3 o1🥉 60.8 56.6 50.5 32.5 54.4 40.6 52.4 40.4 45.6
4 Doubao-1.5-pro 70.6 58.2 49.5 29.2 56.6 34.7 40.7 37.5 43.9
5 o3-mini 47.1 46.2 39.3 28.3 47.0 36.1 48.3 42.3 40.3
6 GPT-4.1 51.0 52.6 41.1 17.0 39.7 31.1 42.1 35.6 35.3
7 Claude-3.7-Sonnet 52.9 51.8 43.0 16.7 41.4 26.5 33.8 32.4 34.6
8 Qwen2.5-VL-72B-Inst 61.8 42.2 29.0 10.4 29.9 14.6 18.6 19.4 24.2
9 QVQ-72b-preview 38.2 36.5 30.8 11.3 25.9 14.2 26.2 20.2 22.5
10 GPT-4o 37.3 39.0 34.6 7.5 23.4 15.5 24.1 21.8 21.9
11 Llama-3.2-90B-Vision 21.6 25.7 22.4 3.9 9.3 10.0 12.4 8.9 11.7
12 Qwen2.5-VL-7B-Inst 39.2 25.3 21.5 4.2 8.7 5.9 10.3 7.3 11.6
13 Qwen2.5-VL-3B-Inst 30.4 21.3 13.1 2.9 10.4 7.3 6.2 6.2 9.8
14 Qwen2-VL-7B-Inst 24.5 17.3 14.0 4.4 8.5 4.6 10.3 7.0 9.2
15 LLaVA-NeXT-7B 14.5 12.7 11.2 5.5 13.2 8.2 11.0 9.4 8.7
16 Llama3.2-11B-Vision 23.5 18.5 14.0 4.2 5.4 3.7 4.8 7.5 8.3
17 Phi-4-multimodal 20.6 12.4 12.1 4.4 7.0 5.0 8.3 4.9 7.6
18 InternVL2.5-8B 17.6 12.4 9.3 2.9 5.6 3.2 4.1 5.1 6.2
19 LLaVA-OneVision-7B 20.6 10.8 12.1 2.7 5.4 2.3 6.2 5.4 6.1

Accuracy (%) of different LLMs/MLLMs by knowledge level. Mid: Middle school; High: High school; BO: Beginner Olympiad; AO: Advanced Olympiad; UG: Undergraduate; SUG: Senior undergraduate; MA: Master's; PhD: PhD qualifying exams. The highest and second-highest scores in each section are bolded and underscored, respectively.

Logo SeePhys Dataset

Overview

SeePhys

Overview of SeePhys. It encompasses 7 core physics domains and 21 diagram types, spanning the full knowledge spectrum from middle school to PhD qualifying exams levels.

SeePhys comprises 2,000 rigorously validated questions covering a ​​comprehensive range of knoledge levels from middle school to PhD qualifying exam levels. These questions span 7 major fields of both classical and modern physics. To assess the extent to which different models rely on visual information for reasoning, we curate two subsets with different visual information enrichment and additionally compile supplementary copies of 2,000 purely visual instances where all problem statements in texts are presented in picture form. Through meticulous selection of 21 diagram types by domain experts, each problem challenges frontier MLLMs to integrate domain knowledge with visual understanding of physics diagrams (e.g., Feynman diagrams for particle interactions and Circuit diagrams for Electromagnetism).

You can download the dataset on Hugging Face Dataset.

Dataset statistics

SeePhys SeePhys

The statistics of Logo SeePhys dataset.

Logo The above figure shows in detail the statistics of our SeePhys. The questions span 7 core physics fields and are stratified across 8 knowledge levels from middle school to PhD qualifying exams. Notably, 18.6% of problems target PhD-level reasoning, while 22.6% represent advanced Olympiad challenges. The benchmark emphasizes multimodal reasoning: 75% of questions are Vision-Essential, which necessarily requires diagram interpretation for solving (e.g., analyzing Feynman diagrams), while 25% are Vision-Optional, where visuals supplement text. Questions are language-balanced (1,039 English vs. 961 Chinese) and 88% have multi-step reasoning annotations, validated via expert annotation. Visual diversity is ensured through 21 diagram types (e.g., circuit schematics, free-body diagrams), curated by domain experts. The dataset's composition supports granular evaluation of MLLMs' physics understanding across textual, visual, and reasoning dimensions.

Dataset Examples

Logo Experiment Results

Results on Existing Foundation Models

Error Analysis Examples

BibTeX


      @article{xiang2025seephys,
        title={SeePhys: Does Seeing Help Thinking? -- Benchmarking Vision-Based Physics Reasoning},
        author={Kun Xiang, Heng Li, Terry Jingchen Zhang, Yinya Huang, Zirong Liu, Peixin Qu, Jixi He, Jiaqi Chen, Yu-Jie Yuan, Jianhua Han, Hang Xu, Hanhui Li, Mrinmaya Sachan, Xiaodan Liang},
        journal={arXiv preprint arXiv:2505.19099},
        year={2025}
      }