LLM-Feynman: Leveraging Large Language Models for Universal Scientific Formula and Theory Discovery

Zhilong Song, Qionghua Zhou, Chunjin Ren, Chongyi Ling, Minggang Ju, Jinlan Wang

arXiv:2503.06512·cond-mat.mtrl-sci·Published 2025-03-09·Updated 2025-07-25

Distilling underlying principles from data has historically driven scientific breakthroughs. However, conventional data-driven machine learning often produces complex models that lack interpretability and generalization due to insufficient domain expertise. Here, we present LLM-Feynman, a novel framework that leverages large language models (LLMs) alongside systematic optimization to derive concise, interpretable formulas from data and domain knowledge. Our method integrates automated feature engineering, LLM-guided symbolic regression with self-evaluation, and Monte Carlo tree search to enhance formula discovery and clarity. The embedding of domain knowledge simplifies the formula, while self-evaluation based on this knowledge further minimizes prediction errors, surpassing conventional symbolic regression in accuracy and interpretability. Our LLM-Feynman successfully rediscovered over 90% of fundamental physical formulas and demonstrated its efficacy in key materials science applications, including classification of two-dimensional material and perovskite synthesizability and determination of the Green's function and screened Coulomb interaction bandgaps, and prediction of ionic conductivity in lithium solid-state electrolytes. By transcending mere data fitting through the integration of deep domain knowledge, this LLM-Feynman offers a transformative paradigm for the automated discovery of generalizable scientific formulas and theories across disciplines.

TopicsGenerative Models & Discovery

Tagssymbolic-regression

arXiv categoriescond-mat.mtrl-sci

arXiv abstract page PDF