It's Not What You Think: LLMs Like Obvious Answers

3 days ago 3

GSM8K is an AI benchmark from late 2021 consisting of over 8,000 grade school math problems. Back then, LLMs were still pretty bad at math: a fine-tuned version of GPT-3 got only about 33% of the problems right. You know what happened next, of course. LLMs were close to acing it even before reasoning models hit the scene.

Still, no model ever quite got 100%. Why not? Some researchers looked into it and found that 1-2% of the problems had issues: usually the question was poorly written, and sometimes the answer key was simply wrong. They took a sample of 300 problems and carefully removed the bad ones. Funny enough, of this sample, there’s a single question that most LLMs still get wrong. Here it is:

A landscaping company is delivering flagstones to a customer’s yard. Each flagstone weighs 75 pounds. If the delivery trucks can carry a total weight of 2000 pounds, how many trucks will be needed to transport 80 flagstones in one trip?

I’ll confess that, when I first saw a model’s answer, I missed the mistake. Here’s o4-mini-high, via ChatGPT:

That might sound fine, even obvious, if you’re not reading closely. The grade-school style doesn’t help: it sounds like the sort of problem that couldn’t possibly be tricky. But, no, the flagstones are discrete: 26 of them weigh 75*26=1950 pounds, so a truck can only fit 26 of them. 26*3 = 78. You need an extra truck. The right answer is 4.

Why is this interesting? It’s well-known that LLMs are superhuman at some tasks while surprisingly deficient at others, and that, due to their statistical nature, these deficiencies can crop up unexpectedly. But I find this problem more interesting because it suggests a partial explanation: maybe models are inclined toward obvious answers. In other words, maybe they fall for “trick questions”: questions which have answers that are intuitive, appealing, but wrong. This would be very human of them!

One of the biggest questions about LLMs is how much they generalize beyond their training distribution. They clearly respond quite well to many inputs they haven’t literally seen before. But, how far off the beaten path can they go? This question is impossible to quantify in any broad way, but it looms large in many big-picture discussions, e.g. about whether models can come up with novel ideas.

On such big questions, we just have to collect our data points here and there and weigh them up as best we can. In my own personal running tally, I find the above example noteworthy. It’s a simple suggestion that, at least so far, going off the beaten path is a weakness. And, perhaps, it could give us something to measure to be able to tell when this ceases to be the case.

Here’s another bit of evidence for this hypothesis, in the form of a harder—though still quite elementary—math problem.

There is an ant at one of the corners of a 1x1x2 box. The ant can only move by walking on the surface of the box. What is the distance to the point on the box that the ant will have to travel the farthest to reach?

If it were a 1x1x1 box, the desired point would be the opposite corner. If it were just “as the crow flies” distance, i.e., standard Euclidean distance in 3D, the desired point would be the opposite corner. The opposite corner is the intuitive answer—but it’s wrong. I won’t go into detail on the right answer, but I liked the visuals on this site, as well as its suggestion to consider a 1x1x10 box to build a correct intuition.

This problem is not only off the beaten path because intuition points to the wrong answer: it also doesn’t appear to be discussed online much. If you search for it, you mostly get results for the 1x1x1 case. Or, that’s for English, anyway: the problem appears to be due to a Japanese mathematician, and there’s a Japanese Wikipedia page about it.

Whatever the cause, this seems to be a problem with which the LLMs are unfamiliar: they all think the answer is the opposite corner.

As usual, my purpose here is not to laugh at the LLMs’ mistakes but to find something we can pay attention to in order to understand the trajectory their math capabilities. I think this suggests a decent angle: questions with surprising answers. Not only is it interesting in its own right, but it suggests a way to get a sense of how much they’re capable of seeing beyond what we’d all expect to be true.

Contamination is a problem, of course. No LLM would give the wrong answer to the Monty Hall problem. But, for instance, the ant-on-a-box problem apparently made it into Martin Gardner’s famous Mathematical Games column, in Scientific American. If LLMs aren’t getting it, maybe that means the column isn’t so prominent in their training data. Maybe we can make a benchmark out of the less-famous problems that appear there, say, those without an English Wikipedia article.

Even if not, it’s one more hard-to-define concept to nonetheless keep an eye on. If you have a favorite counterintuitive math problem, let me know how LLMs do on it!