跳到主要导航 跳到搜索 跳到主要内容

PhysReason: A Comprehensive Benchmark towards Physics-Based Reasoning

  • Xinyu Zhang
  • , Yuxuan Dong
  • , Yanrui Wu
  • , Jiaxing Huang
  • , Chengyou Jia
  • , Basura Fernando
  • , Mike Zheng Shou
  • , Lingling Zhang
  • , Jun Liu
  • Xi'an Jiaotong University
  • Ministry of Education Key Laboratory of Intelligent Networks and Network Security
  • Agency for Science, Technology and Research, Singapore
  • National University of Singapore
  • Shaanxi Province Key Laboratory of Big Data Knowledge Engineering

科研成果: 书/报告/会议事项章节会议稿件同行评审

7 引用 (Scopus)

摘要

Large language models demonstrate remarkable capabilities across various domains, especially mathematics and logic reasoning. However, current evaluations overlook physics-based reasoning, a complex task requiring physics theorems and constraints. We present PhysReason, a 1,200-problem benchmark comprising knowledge-based (25%) and reasoning-based (75%) problems, where the latter are divided into three difficulty levels (easy, medium, hard). Notably, problems require an average of 8.1 solution steps, with hard problems requiring 15.6, reflecting the complexity of physics-based reasoning. We propose the Physics Solution Auto Scoring Framework, incorporating efficient answer-level and comprehensive step-level evaluations. Top-performing models like Deepseek-R1, Gemini-2.0-Flash-Thinking, and o3-mini-high achieve less than 60% on answer-level evaluation, with performance dropping from knowledge questions (75.11%) to hard problems (31.95%). Through step-level evaluation, we identify four key bottlenecks: Physics Theorem Application, Physics Process Understanding, Calculation, and Physics Condition Analysis. These findings position PhysReason as a novel and comprehensive benchmark for evaluating physics-based reasoning capabilities in large language models. Our code and data will be published at https://dxzxy12138.github.io/PhysReason/.

源语言英语
主期刊名Long Papers
编辑Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
出版商Association for Computational Linguistics (ACL)
16593-16615
页数23
ISBN(电子版)9798891762510
出版状态已出版 - 2025
活动63rd Annual Meeting of the Association for Computational Linguistics, ACL 2025 - Vienna, 奥地利
期限: 27 7月 20251 8月 2025

出版系列

姓名Proceedings of the Annual Meeting of the Association for Computational Linguistics
1
ISSN(印刷版)0736-587X

会议

会议63rd Annual Meeting of the Association for Computational Linguistics, ACL 2025
国家/地区奥地利
Vienna
时期27/07/251/08/25

学术指纹

探究 'PhysReason: A Comprehensive Benchmark towards Physics-Based Reasoning' 的科研主题。它们共同构成独一无二的指纹。

引用此