Evaluating GPT-4 And Other LLMs On Computational Physics Challenges

Overview

This blog is about the performance of large language models (LLMs) on graduate-level to research-level computational physics problems, finding that GPT4 is the best LLM for this task but still is not capable of autonomously generating code solutions at the graduate/research level.

Key Findings

Key observations reveal that the performance of LLMs may be limited by the availability of open-source packages and code examples for pre-training.
The research has limitations in that it provides only a preliminary and limited view of the general computational physics capabilities of LLMs. The study also notes that some of the failure modes identified may be related to limitations of the physics world models learned by GPT4.
Insights suggest that future work should focus on improving the performance of LLMs on computational physics problems, potentially by leveraging more advanced evaluation methods and more challenging problems.
It is observed that future work could involve developing general simulation efforts in computational biology, chemistry, or engineering, and fine-tuning LLMs on numerical simulation tasks to elicit their capabilities in computational science.
The study has practical implications for the development of AI systems that can simulate and predict the physical world.
The study has practical applications in the development of AI systems that can assist in computational physics and other scientific domains. The study’s findings on the limitations of LLMs in computational physics could inform the development of more effective training strategies and evaluation methodologies.

Understand what is large language models.

Evaluation Of LLM In Computational Physics

The blog evaluates the capabilities of Large Language Models (LLMs) in solving PhD-level to research-level computational physics problems. The author uses well-documented and widely-used packages such as REBOUND, MESA, Dedalus, and SciPy to elicit coding capabilities in the physics and astrophysics domains. The evaluation focuses on the ability of LLMs to generate code to simulate complex physical scenarios, which requires a combination of specialized skills in coding and physics.

The author uses a soft metric to evaluate LLM performance, including counts of lines that contain different types of errors (coding, physics, necessity, and sufficiency) and a more educational Pass-Fail metric that captures the salient physical ingredients of the problem. The results show that the current state-of-the-art LLM, GPT-4, fails most of the problems, but about 40% of the solutions could plausibly get a passing grade.

The article identifies several failure modes of GPT-4 in the computational physics domain, including poor physical unit handling, poor code versioning, the tendency to hallucinate plausible sub-modules, a lack of physical justification for global run parameters, and an inability to define steady-state or stopping conditions reliably.

Evaluating Physics Problems For LLMs

Problems are designed with 47 original physics problems to test the capabilities of large language models (LLMs) in generating code solutions. The problems cover four sub-domains: stellar physics with MESA, celestial mechanics with REBOUND, 1D fluid dynamics with Dedalus, and non-linear dynamics with SciPy. The problems are designed to elicit physics generalization capabilities from LLMs and are crafted to minimize the risk of data contamination.

The authors used a simple prompt formulation that approximates how the problem would be formulated in an academic setting. The prompt specifies the physics problem and requests a full code solution using a specific software package and version.

To evaluate the code solutions generated by LLMs, we used soft metrics, including:

Correctness of code syntax, logic, and semantics
Correctness of physics reasoning behind the code solution
Necessity and sufficiency of code lines as solution elements
Overall academic-like grade for the solution (Fail, Pass, or Pass+)

Key observations revealed that none of the code segments generated by GPT4 were fully satisfactory solutions to the problems posed. However, the solutions had partial value, with about 70-90% of code lines being necessary, sufficient, and correct. The solutions typically contained 25-50 lines of code and exhibited several physics and coding errors, with some unnecessary or insufficient code lines.

Flaws In GPT4’s Physics Code Generation

GPT4 exhibits several flaws and errors in generating physics code, including:

Poor performance in dealing with physical units, including errors in conversions and confusion between code units and physical units.
Poor handling of code versioning, arbitrarily picking versions and inconsistently invoking features.
Tendency to hallucinate plausible but non-existent sub-modules and functions, as well as physical formulas.
Lack of physical justification for global run parameters, such as simulation time and time-stepping choices.
Inability to define steady-state or stopping conditions reliably for time-dependent simulations.
Tendency to use common approximate equations beyond their physical applicability range.
These flaws and errors are demonstrated through various code excerpts from different problem classes, including Dedalus, REBOUND, MESA, and SciPy.

Future Directions of LLMs In Computational Physics

The study highlights the potential of LLMs in computational physics but also identifies areas for improvement.
Future work can focus on:

Conducting a more extensive and systematic study to understand LLM errors in specific computational physics domains
Exploring problems that go beyond the narrow ones covered in the study, such as multi-dimensionality, radiation transport, and other common ingredients in research-level simulations.
Correlating LLM performance with the availability of open-source packages and code examples for pre-training.
Improving the prompt strategy to elicit better performance from LLMs.
Extending the benchmark to more conversational, human-in-the-loop settings or with the availability of extra tools for more agentic setups.
Expanding to cover domains such as solid-state, quantum physics, or relativistic physics, and potentially to computational biology, chemistry, or engineering.
Constructing training datasets or reward functions for reinforcement learning to fine-tune LLMs for computational science tasks.
Exploring in-context learning for improved performance on computational science tasks.

Also Read: Exploring the role of open source in modern development

Computational Physics Problems

The text presents a series of computational physics problems that require solutions using the MESA stellar physics code and the Dedalus PDE solver Python package.

The problems involve various scenarios, including:

Stellar evolution: calculating the lifetime of a solar mass star made entirely of helium, the convective mass of a 3.14 solar mass star, and the minimal mass of a star to disperse a nebula.
Diffusion and advection: solving one-dimensional diffusion and advection problems with various initial conditions, boundary conditions, and source terms.
Acoustic waves: simulating non-linear one-dimensional acoustic wave problems with polynomial wave velocity.
Hyperdiffusion: comparing the diffusion of a field using 2nd-order and 4th-order hyperdiffusion problems.
Antidiffusion: building a one-dimensional anti-diffusion PDE system that reaches a steady state with a constant loss term.
The problems require providing complete inputs to the MESA code and Dedalus PDE solver, as well as Python data analysis and postprocessing code, to generate a complete final solution.
This text consists of several computational physics problems and their solutions using various Python packages, including Dedalus, Scipy, and Rebound.
The problems involve solving partial differential equations (PDEs), ordinary differential equations (ODEs), and N-body simulations.

Each problem is accompanied by a fully working code solution using the respective Python package.

Code Analysis And Performance

The code provided is a Python script that uses the Galpy package to estimate the stability timescale of a massive star orbiting at 20 parsecs from the center of the Milky Way as a function of the gravitational potential used. The script uses three different gravitational potentials: MWPotential2014, NFWPotential, and HernquistPotential.

The code is well-structured and easy to follow, with clear comments explaining the physics and coding choices made. The script defines the initial conditions of the star, defines the potentials, and then loops over the potentials to integrate the orbit and estimate the stability timescale.

The output of the code shows the stability timescale for each potential, with the MWPotential2014 having the shortest stability timescale and the NFWPotential having the longest.

The key theme of this code is the analysis of the performance of the Galpy package in estimating the stability timescale of a massive star orbiting the Milky Way, and the differences in the stability timescales arising from the different gravitational potentials used.

llm — ***Performance of large language models***

Galactic Potentials And Orbital Stability

The stability of an orbit depends on the balance between the gravitational pull of the mass distribution and the centrifugal force due to the star’s motion. The NFW profile has a cuspy central region, leading to a stronger gravitational pull near the center, which can destabilize orbits and lead to a shorter stability timescale.

The MWPotential2014 is a model of the Milky Way’s gravitational potential, including contributions from the bulge, disk, and halo. The sinusoidal potential is a highly simplified and non-physical model of a galaxy’s gravitational potential, with a periodic shape that can cause rapid deviations from the initial conditions. The stability timescale of an orbit in a given potential depends on the shape of the potential and the initial conditions of the orbit.

The definition of orbital stability as “the time it takes for the radial distance to change by 10%” is a simplification and may not be accurate for all types of potentials, especially for those with periodic or quasi-periodic orbits. Other ways to define or consider orbital stability include Lyapunov stability, energy conservation, frequency analysis, and Poincare sections.

Orbit Analysis And Stability

Poincare sections: graphical representations of the intersection of a trajectory with a lower-dimensional subspace, which can help identify stable or chaotic orbits.

Recurrence plots: graphical representations of the number of times a trajectory visits the same area in phase space, which can help identify patterns in the orbit.

Lyapunov stability: a method for determining the stability of an orbit by analyzing how small perturbations evolve over time.

The text also provides examples of how to implement these methods using the GALPY Python package, including code for integrating orbits, calculating Lyapunov stability, and plotting the results. Additionally, the text discusses the importance of choosing the right potential and initial conditions for the orbit and provides examples of how to modify the code to include a massive black hole or a 2D disk potential.

Orbit Integration And Potential Models

The text discusses the integration of orbits in different potential models using the GALPY Python package. The KeplerPotential represents a point-like mass distribution, resulting in a stable, elliptical orbit for the Earth. The MiyamotoNagaiPotential represents a disk-like mass distribution, which can result in a more complex and potentially unstable orbit for the Earth.

The initial conditions for the orbit might not be appropriate for the MiyamotoNagaiPotential, and adjusting them might be necessary to obtain a stable orbit. The text also explores the integration of orbits for all 8 planets in 3D orbits using a Keplerian potential and discusses the importance of following step-by-step instructions to obtain accurate results.

Evaluating GPT-4 and other LLMs on Computational Physics Challenges