Can an LLM Be a Black-Box Optimizer?

1 month ago 6

At Terra AI, I have a great collaborator who describes himself as “optimization-brained,” and always comes up with the best strategies for optimization problems, whether he’s coming up the with best linear algebra way of performing a calculation, or he’s thinking about the best way to apply a set of constraints. This is in contrast to my experience in my PhD, where I mostly applied the classic “graduate student descent” algorithm, manually fiddling with parameters and generally trying and get an intuition for how a system works. In my defence, this was how most of my colleagues did optimization for Earth-system models. Of course, if you’re doing a well-defined problem like regularized inversion of a geophysics problem, you’d use a robust optimizer approach like Gauss-Newton or L-BFGS. But often you’re working something like a complex-system model with many interacting components, and you probably don’t have access to gradients or adjoints. An example of this could be a model of the dynamics of geologic system that couples weathering, erosion, sediment transport, and biological processes, and you’re trying to fit a model to data that has a lot of noise. In Earth sciences, we end up in this situation all the time: you have this huge, complicated simulation code (probably written in MATLAB), and you have to fiddle with some input parameters to get the outputs to match your field data. Oh and also, the simulation takes 2 hours to run or your laptop. Are you really going to wrap this in an optimizer loop? Chris Rackauckas wrote a really nice blog post about this kind of thing in The nonlinear effect of code speed on productivity; Richard Hamming also writes about this back in the 1950s in his book “The Art of Doing Science and Engineering.” Of course you could use some approach like Bayesian optimization, but we’re Earth scientist! We don’t know about sophisticated techniques like that! So instead, people typically end up taking an approach to optimization that is part “grad-student descent,” and part “intuition-guided.”

But this got me thinking: a lot of human intuition is (for better or worse) encoded in written language. What if we could take the intuition that is latent in a large language model (LLM) to guide an optimization process? I’ve recently been getting better at using LLM APIs (I won an award for it!), so this seemed like a fun project to try out! I bet someone’s already done this, but I wanted to see where I could get myself, that’s why it’s called research!

Problem setup

For the optimization setup, I decided to use the classic 2D Rosenbrock function as a first test function. The function is defined as:

f(x) = (100 - x_0)^2 + 100 (x_1 - x_0^2)^2,

where x_0 and x_1 are the independent two variables. The aim of the optimization is to find the global minimum of the function f, which is at x = (1, 1) with f(x) = 0. To keep things simple, I chose to optimize the function in the range x_0, x_1 \in [-3, 3].

LLM Optimizer

How do you turn an LLM into an optimizer? Well as I’ve found in previous projects, you can just tell an LLM that it has that persona in its system prompt!

You are an optimizer proposing the next point to minimize a 2D function. Always return ONLY a JSON array of floats representing the next candidate point, e.g., [x0, x1]. No extra text.

This is combined with a user prompt that specifies the history of past sample points and function values:

Problem: Minimize a 2D function. Bounds: [(-3.0, 3.0), (-3.0, 3.0)]. Evaluation history: [ (point=[2.5, 2.5], value=934.1) (point=[3, 3], value=2713.16) (point=[2, 2], value=197.052) ] Return ONLY the next point as a JSON array of floats within bounds.

Both of these are supplied to the LLM API, and I get back a response like this:

So all I needed to do now is wrap this in a loop, and make a new API call for each step in the evaluation loop! I implemented the LLM optimizer in a stateless mode, which means that the LLM is given the full history of the optimization process at each step, but without any other context. An alternative approach would be to supply each point sequentially, in a chat dialoge fashion. I didn’t end up doing this because I thought the context might get a bit messy by the later evaluation steps. I originally started using o3-mini as my reasoning model of choice, mostly due to its lower cost and higher rate limits. However, after changing my API call scheduler a little bit I was also able to effectively experiment on a bigger model: gpt-5-mini.

Evaluation

Of course, I needed to be able to judge how well this LLM optimizer is doing, so I also implemented two other optimizers: Nelder-Mead, and Bayesian optimization.
Nelder-Mead is a region-based optimization technique that moves the corners of a simplex in the direction of the best point, whereas Bayesian optimization fits a Gaussian process model to the objective function and then samples the next point to evaluate based on the model’s uncertainty.
For the evaluation metric, I’ll be judging the optimizers based on the minimum objective function they find, given a fixed budget of 20 function evaluations.
Before I show the results, what do you think will happen?
Personally, I thought that Bayesian optimization would do the best, since it’s typically pretty sample-efficient, and I thought that the LLM optimizer would do second-best, ahead of gradient descent and Nelder-Mead.

A Slight Mis-Step

Almost immediately, I found a mistake with my problem setup. The LLM would immediately guess that the minimum would either be at the origin, or at $(1, 1)$.

=== LLM (stateless) prompt === You are an optimizer proposing the next point to minimize a 2D function. The minimum is not always at the origin, so be careful. Make large moves when necessary, as you have few function evaluations available. Always return ONLY a JSON array of floats representing the next candidate point, e.g., [x0, x1]. No extra text. Problem: Minimize a 2D function. Bounds: [(-3.0, 3.0), (-3.0, 3.0)]. Evaluation history: [ (point=[1, 1], value=0) (point=[1, 1], value=0) (point=[1, 1], value=0) (point=[1, 1], value=0) (point=[1, 1], value=0) (point=[1, 1], value=0) (point=[1.01, 0.99], value=0.090701) (point=[1, 1], value=0) (point=[1, 1], value=0) (point=[1, 1], value=0) ] Return ONLY the next point as a JSON array of floats within bounds. === LLM response === [1.0, 1.0]

I believe the reason for this is this is where minima are typically located in commonly used optimization test functions.
I fixed this problem with three tweaks: (1) I added a small random translation to the function, (2) set the LLM optimization routine to start with a random initial point, and (3) altered the system prompt to encourage the LLM to explore the domain more.

System Prompt: You are an optimizer proposing the next point to minimize a 2D function. The minimum is not always at the origin, so be careful. Make large moves when necessary, as you have few function evaluations available. Think carefully about your next move, and be sure to consider the history of points and values. Always return ONLY a JSON array of floats representing the next candidate point, e.g., [x0, x1]. No extra text.

Results

With that bug cleared up, I was able to run the evaluation suite for all optimization methods! Before looking at all the results, I think it’s cool to look at an example of the LLMs responses.

=== LLM (stateless) prompt === <system prompt not shown> Problem: Minimize a 2D function. Bounds: [(-3.0, 3.0), (-3.0, 3.0)]. Evaluation history: [ (point=[-1.43248, -0.969005], value=963.06) (point=[2.5, 2.5], value=934.1) (point=[3, 3], value=2713.16) (point=[2, 2], value=197.052) (point=[2.5, 1.5], value=1644.72) (point=[2.5, 2], value=1264.41) (point=[1.5, 1.5], value=6.34255) (point=[1, 1], value=16.301) (point=[1.55, 1.55], value=11.8555) (point=[1.45, 1.45], value=2.71255) (point=[1.4, 1.4], value=0.702304) (point=[1.35, 1.35], value=0.063643) (point=[1.3, 1.3], value=0.563394) (point=[1.345, 1.345], value=0.066107) (point=[1.36, 1.36], value=0.0930568) (point=[1.3486, 1.3486], value=0.0631964) (point=[1.348, 1.348], value=0.0632766) (point=[1.37, 1.33], value=0.577212) (point=[1.349, 1.349], value=0.0632333) ] Return ONLY the next point as a JSON array of floats within bounds. === LLM response === [1.3485, 1.3485] Completed with 20 evaluations; final best: 0.0631964

I found this so cool to see! You can sort of make out a thinking process here, where the LLM first explores wide, and then gets more and more precise.
When running experiments, I’d track the API responses in a terminal window, and I often ended up making similar next guesses to the LLM. I.e., “Ok we’re going along this line, and the value is still decreasing, so let’s take another step in that direction. Oh ok now the value is higher, so let’s take a step back, and move orthogonal instead…” and so on.

Anyway, let’s see how the LLM approach compares to the other optimizers. The figure below shows an example of the optimization paths for each method.

paths6-gpt5

In this example, Nelder-Mead was not able to get close to the global minimum in the fixed budget of 20 function evaluations.
Although Bayesian optimization made a good first start, it didn’t make great progress past the 9th function evaluation.
However, as luck would have it, the LLM optimizer performed better than all the other optimizers, getting fairly close to the global minimum!
I thought this was probably a fluke though, and ran another attempt. This time, I’m visualizing the objective function per iteration step, for each optimization method.

convergence6-gpt5

Again, the LLM optimizer ended up doing the best! Still I wan’t convinced. So I ended up running 10 experiments (starting at random initial points) and taking the average of the results. Would the LLM optimizer be statistically better than the other optimizers?

convergence5-gpt5

It seems that GPT-5 is actually quite a fine optimizer, in a latency-agnostic, limited function evaluation setting!

Conclusions

So what can I gather from this? Well it’s been making me thing about the different choices one makes when deciding on an optimization (or decision-making) strategy. I think a big deciding factor is robustness. Although approaches like Nelder-Mead etc… don’t have guarrantees to converge to minima, for practical purposes they will “work,” because they have been constructed for that purpose. With a prompt-guided LLM, could really do anything: hallucinate, get caught in a loop, refuse to return formatted outputs, etc. If you have a safety-critical application, one might prefer a more robust method over a “fuzzy” LLM approach. Another big factor is latency. LLM API calls are (usually) slow, and for some settings you just can’t afford that. (On the topic of affordability, LLMs aren’t free! These experiments cost me around $10 in API calls.) Overall this makes me think of another colleague I work with, who works on the Curiosity rover, where decisions are made by committee, and latency is minutes to hours. I don’t see LLMs being used in settings like that, but there are similarities!

Will I actually use this in my work? Probably not in its current form. But if someone made a really robust LLM optimizer, I’d actually consider using it for real problems. I think in cases were you don’t have access to gradients, you have a small budget for function evaluations, and you don’t want to wrap your code in an optimizer loop, this could actually be a valid approach!
I could imaging this working in an interface where I relay inputs and outputs to an LLM that is set up as a kind of hybrid online-offline optimizer, and could see myself using it as a good addition to “intuition-guided” optimization.

I think some interesting future work could be to experiment with different formulations of the prompting, e.g., a stateful dialogue. It would also be good to compare the performance of reasoning vs. non-reasoning LLMs, as well as testing on other optimization problems (even some real settings)! If you’re interested in the code for this project, you can check it out in this github repo!

Read Entire Article