Mythbusting Large Language Models

3 months ago 3

Chatbots can be deceptive. How do LLMs really work under the hood?

Large Language Models (LLMs) are remarkable tools with major limitations. In the two years since ChatGPT made them famous, their use has spread to search engines, coding assistants, document editors, even WhatsApp.

Unfortunately, as their use has spread, clarity about how they work has not. Over the past two years as tech lead of the Guardian’s fledgling Newsroom AI team, I’ve learned a lot about LLMs. But I’ve also noticed that the applications built on top of them conceal what’s really going on, leading to widespread misunderstanding.

In this piece, I want to cut through the confusion by busting five LLM myths: misconceptions that naturally arise when using applications built on LLMs. Through this, we’ll learn about how LLMs really work and how they are made.

Let’s start by looking at how LLMs appear to work. Consider this conversation with Claude:

A chat conversation showing a user asking an AI assistant to repeat a string of random characters including letters, numbers, emoji, and Chinese characters, followed by requests for a random word and sentence, then asking what the first message was.

What might you assume about language models, told only that the chatbot is powered by an LLM?

It looks like it can hold a conversation. That is, it has a notion of me (chatbot) vs you (user), and an intrinsic notion of structured turn-based conversations in which it follows user instructions and answers user questions.
It looks like it accepts text — i.e. a string of characters — as input from the user. (If you’re a software engineer and you pop open the dev tools, you would indeed see a UTF-8 string being sent to the backend.)
It looks like it outputs text, of an arbitrary length. If you’re paying close attention, you may notice that the output appears incrementally on the screen, meaning the longer the output, the longer you have to wait for the complete response.
If you put the chatbot through its paces, you’ll realise that it has knowledge and abilities across many domains. For instance, it can write code, write a sonnet, translate, summarise, answer technical questions, etc. You might reasonably conclude that these different abilities required specialised types of domain-specific training or programming.
It looks like it can remember. “The first thing I said to you” requires remembering what the user has said previously.

In fact, none of these are true. Let’s go through them one by one.

Truth: LLMs predict likely continuations of conversations

When we talk to an LLM-powered chatbot, it looks like we’re interacting with something that can hold the thread of a conversation, understand whose turn it is to speak, keep track of its own thoughts independently of those of its conversational partner, and so on.

In fact, this chatbot facade is an illusion, and under the hood, the real task of a language model is simply to predict continuations of text. Given a text prompt as input, an LLM will output text that is likely to follow the prompt, given what the model has seen in its training data.

For instance:

“What is the capital of France?” and output from the LLM: “Paris:

In order to play the role of assistant in chatbot context, the LLM is fed a “screenplay” of user/assistant dialogue up to the point of its next line, and it then generates a probable continuation from this point:

What is the capital of France? Assistant:” and output from the LLM: “The capital of France is Paris”

It could equally be given some input that finishes halfway through a user turn, and it would then complete from there, putting words in the user’s mouth:

“capital of France? Assistant: The capital of France is Paris”

Or the input could put words in the assistant’s mouth, forcing it to carry on its own response from that point:

What is the capital of France? Assistant: It”, and output from the LLM: “‘s Paris”

The key thing is that there really is no intrinsic notion of user and assistant, just a notion of completion based on text seen in training. This is why the second stage of training an LLM — supervised fine-tuning — uses documents that have this user/assistant structure. More on this later.

Truth: LLMs operate on tokens (≈ words) represented by vectors

What you type into a chatbot is indeed a string of Unicode-encoded characters. But the model never sees these characters. First, this input string must be tokenised: chopped up into units that are roughly word-sized. The model will have a fixed-size vocabulary of such tokens, typically containing tens to hundreds of thousands of entries.¹

For instance, using the gpt-4o tokeniser, which has a vocabulary of around 200,000 tokens, “the capital of France is” gets chopped up into units that all happen to be words:

A diagram showing the “The capital of France is” chopped up into 5 separate words

But this need not always be the case. Here is a famous example:

The fact that LLMs see tokens instead of characters is the root of some of their most surprising failure modes, such as their difficulty counting the number of r’s in the word strawberry.²

Tokens are represented by vectors that encode meaning

Each of these tokens corresponds to a numerical representation which gets looked up by id once the tokeniser has identified the tokens present in the prompt.

These numerical representations are vectors, and a sequence of these vectors is what the model actually receives as input.

A diagram showing a vector (list of numbers) against each input token

What are vectors? The terminology may be intimidating if you don’t have much maths background, but like a lot of technical terms, it’s actually hiding something very simple.

A vector is a list of numbers, and it is an arrow in a geometric space.

How can it be both simultaneously?

Because in 2D, every arrow can be represented with two numbers: one for how far to go across (the x-axis), and another for how far to go up (the y-axis).

A diagram showing an arrow that goes 1 along and 2 up, next to a vector (1,2)

In 3D, you would need three numbers: along (x), up (y), and “out” (z). Beyond 3D, it becomes impossible to visualise, but luckily mathematicians can prove that geometric properties that we understand in 2D and 3D generalise to these higher unvisualisable dimensions.

The arrows (vectors) that represent words in language models have been learned from data in such a way that their geometric properties reflect the meaning of the words they represent. That is, arrows pointing in similar directions will represent words with similar meanings, arrows pointing in opposite directions will represent words with opposite meaning, and arrows at right angles will represent words that are unrelated:

A diagram showing “happy” and “sad” as arrows pointing in opposite directions. “Glad” points in a similar direction to “happy”. “Celeriac” is at right angles to “happy”

Vectors that have this magical property are known as embeddings. The above example is in two dimensions for ease of visualisation, but the vectors used in LLM have hundreds or even thousands of dimensions, allowing them to model the many nuanced axes along which word meanings can differ.

Embeddings are created from the information inherent in any text about what words occur near each other, using this to produce vector representations for words that are similar if the words tend to have similar neighbours.³ For instance, lots of examples of the sentences “I’d be glad to help” and “I’d be happy to help” might lead to the vectors of “happy” and “glad” being similar because they frequently occur as the blank in “I’d be ____ to help”.

Typically, the vectors themselves are effectively a byproduct of training a model to perform a specific language task. For instance, the token embeddings in an LLM are actually created as the model learns to predict the next word in a sequence. As it gets better at this task, the vectors that it uses internally to represent words will acquire this desirable property of being similar if they appear in similar contexts.⁴

Truth: LLMs output probabilities for the next token

Despite appearances, an LLM does not actually output text.

Instead, it outputs a probability distribution over its vocabulary of tokens. That is, for every possible next token, it assigns a number between 0 and 1 which represents the likelihood that the token would follow the input, such that the likelihoods of all tokens add up to 1:

A diagram showing the probabilities of the top five most likely continuations for “The capital of France is”. “Paris” is 32.89%, “the” is 7.02%, “a” is 6.24%, “location” is 3.21%, and “one” is 2%. An arrow indicates that all 200,000 tokens would have a probability assigned, but only the top 5 are pictured.

In this sense, LLMs are much more like classifier models — which output probabilities of the input belonging to one of a fixed set of categories — than they at first appear.

For the user to get an intelligible response to their prompt, a token needs to be plucked from this weighted deck of cards. The strategy used to choose a token is known as the decoding method. In one of the most popular decoding methods, a setting called temperature controls the way that the token probabilities affect the chance of actually using a given token.

With a temperature of 0, we will always get the most probable token — in our case, Paris. With a non-zero temperature, we may get a different token, and the higher the temperature, the higher the chance of picking a less probable token.⁵ For instance, we might pick “one” as our next token despite the fact that the model predicts it’s only 2% likely to follow our prompt:

A diagram showing “one” being chosen as the continuation

A single-token response is not very useful, especially if we’ve picked “one”, which strongly implies that more should follow. How can we generate longer outputs?

The solution is for the model to consume its own output in order to continue predicting. The word “one” is appended to the input, and the whole thing is fed back in as a prompt to predict the next word:

To generate a long string of text, this has to be repeated over and over again, until the model decides to stop by generating a special stop token, or until it hits a maximum number of generated tokens configured by the caller. This process of predicting based on previous predicted values is known as autoregression, and in the context of language generation as autoregressive decoding.

Autoregressive decoding is the reason that you see words stream rapidly one by one across the screen when using ChatGPT, instead of receiving the complete output all at once. It makes sense to stream each token back as soon as it is generated, because the time to generate the whole response increases with its length⁶ and would often be more than users would tolerate.

Truth: LLMs (mostly) learn by filling in the blank

Despite the astonishing range of abilities displayed by LLMs, they almost all arise from one simple training objective: guess the next word.

It works something like this. We take some text scraped from the internet, blank out a word, and ask the model to come up with a predicted probability distribution for the blank. The model is initialised with random parameters so its first guess will probably look pretty strange:

A diagram showing random, whacky probabilities for continuations of “The capital of France is”. Banana is on top with 32.89%, then Berlin at 20%, then Paris at 6.24%

We then uncover the blank, revealing the true probability distribution of 100% for the real token and 0% for everything else.

The difference between the guess and the real answer is known as error (or loss), and is used to update the model’s parameters (the weights of the connections in its neural network) such that the next guess will be less wrong.

In the case of LLMs, this “how wrong was my guess?” is calculated using something known as cross-entropy loss.

A diagram showing the model comparing its guessed probability for Paris (6.24%) with the true probability (100%)

Cross-entropy is a measure of difference between two probability distributions, in our case between the model’s guess and the true distribution. The true distribution will always be 0% for every token except the correct one, which drastically simplifies⁷ the calculation to

-log(probability(correct_token)) = -log(probability(“paris”))
= -log(0.064) = 2.748872…

The perfect guess from the model would be 100% (i.e. 1), and indeed -log(1) gives 0 loss for that prediction. The worst guess from the model would be 0%, which for mathematical reasons is not actually a possible prediction, but the closer it gets to 0, the more severely it is penalised as the negative log tends to infinity:

A graph showing the loss going to infinity for a prediction of 0%, being 0 for a prediction of 100%

With a clever use of a 350-year-old technique from calculus⁸, we can feed this error signal backwards through the neural network in a process called backpropagation. This process gives us something called a gradient, which tells us how to adjust the weights of the network such that the next guess will be less wrong, e.g. the model might give Paris a 12% chance instead of a 6% chance.

From completions to chat

What I’ve described above is known as pretraining, and is one of three stages typically used to train language models:

Pretraining
Supervised fine-tuning (SFT)
Reinforcement learning from human feedback (RLHF)

Pretraining is by far the most compute-intensive and expensive stage. Stages 2 and 3 are relatively insignificant in terms of cost but can make a huge difference to the usefulness of the model.

At the end of pretraining, the model might do something like this given the input “What is the capital of France?”:

What is the capital of France? A. Paris. B. Sacramento. C. Sacramento. D. Sacramento. A. Paris B. Sacramento.

The model is generating what is most likely to follow the question given what it’s seen in its training data of random internet text.

To get it to do something more useful, supervised fine-tuning involves continuing the “guess the word” game, but this time on a specially created set of documents that look like a very boring screenplay featuring just two characters, “user” and “assistant”. These contain examples, written by humans, of high quality responses by the “assistant” to queries and instructions by the “user”.⁹

After this process, the output might look more like this:

What is the capital of France? The capital of France is Paris.

Stage 3 involves reinforcement learning, which is a fundamentally different type of training in which the model learns by receiving differing rewards for its actions. In this case, the actions are generated responses to a prompt, and the reward comes from a separate model trained based on human rankings of alternative model outputs. I won’t say much about this type of training here but this excellent blog post from Hugging Face has more detail.

Truth: LLMs are frozen in time

Because chatbot interfaces do their best to hide it from you, it may be surprising that LLMs have no memory beyond their training data.

To illustrate what I mean by this, consider this interaction:

What’s the capital of France? Assistant: Paris. User: And who’s its president? Assistant: Emmanuel Macron.”

The chat interface might give you the impression that your interactions with the model are like this:

What is the capital of France? Assistant: Paris}. Second request: {User: And who is its president? Assistant: Emmanuel Macron}

In order to know what “its” refers to, the LLM would need to remember previous interactions. But as we saw above, the LLM cannot even remember what it has said previously, let alone what a user has said previously. The only way to simulate a conversation is to provide the whole script from the very beginning, meaning the interactions actually look like this:

What is the capital of France? Assistant: Paris}. Second request: {User: What is the capital of France? Assistant: Paris. User: And who is its president? Assistant: Emmanuel Macron}

Holding a stateful conversation via a series of stateless interactions has cost and performance implications.

For instance, interactions with models via APIs are typically billed per token of input and output. But this means that as conversations grow linearly, the cumulative cost will grow quadratically:

A diagram showing turns per request growing 1, 3, 5 while cumulative turns grows 1, 9, 16 (their squares)

This growth curve is dramatically flattened by prompt caching, which can reduce the cost of input tokens by half (OpenAI) or even by an order of magnitude (Anthropic), but it will still be quadratic because input tokens grow linearly with each conversational turn.

A graph showing how prompt caching flattens the cumulative cost growth curve

There are also implications for the quality of the model response as conversations grow. Models can only consider input up to a certain length, known as the context window, when generating a response. If the conversation grows beyond the size, the model simply will not be able to see the earlier messages, and the conversation may become incoherent.

A diagram showing user/assistant turns growing with successive requests, to the point where earlier turns are not seen by the model.

Recent models have context windows of up to a million tokens, making this limit harder to breach. But research has shown that, even within the context window, longer inputs may degrade the quality of model responses, because LLMs exhibit a U-shaped performance curve: they access information well from the beginning or end of their context, but struggle significantly with information in the middle. As conversations lengthen, critical earlier context gets pushed into this poorly-accessed middle region.

“What year is this? Who’s the president!??”

Another consequence of the amnesiac nature of language models is that, when interacting with a model directly or via an API, the model will lack even the most knowledge about the present moment.

Here is what we get from OpenAI’s GPT-4o if we ask it what day it is today, directly via the API instead of via ChatGPT:

What day is it today? Assistant: I’m unable to access real-time data, including today’s date. You might want to check your device’s calendar for the current day.

Even if you provide some help, it may not be enough if you haven’t considered exactly when the model finished training. Here’s what we get in July 2025 if we tell it that the date is 28th July:

What day is it today? Assistant: Today is Friday, 28th July 2023.

GPT-4o thinks it’s still 2023, because it finished training in October 2023.

Of course, if you talk to the model via a UI like a chatbot, this limitation will not be obvious, because information is invisibly passed in for you in the system prompt:

The Claude system prompt, beginning “The assistant is Claude, created by Anthropic. The current date is {{currentDateTime}}.”

The picture is further complicated by the fact that model vendors have begun adding adding memory features to their chatbot apps. For instance, ChatGPT now has the ability to save memories when instructed by the user, but also to reference “insights ChatGPT gathers from past chats to improve future ones”.

But it’s important to remember that these memory capabilities are provided by layers of software wrapping around the model, not by the language model itself. For the time being, LLMs are static artefacts that are incapable of learning continuously in response to new data, though this is an active area of research (see, for instance, this work on Self-Adapting Language Models.)

Fifty First Context Windows

I’ve found a useful analogy for the amnesiac state of LLMs: the movie Fifty First Dates. In this movie, Drew Barrymore plays a woman with a head injury that leaves her unable to form new memories. Each day she wakes up believing it to be the day of her accident without any memory of the days that have passed since then.

Her love interest Adam Sandler eventually comes up with a solution: record a video of all the key events from her life post-accident, and play it to her every morning when she wakes up. This enables her to function in the world despite her impairment.

The situation for poor old language models is even worse! At least Drew Barrymore had a day’s buffer of memories which got wiped every night. For LLMs, every moment they “wake up” to perform a prediction, it’s as if they’re waking up on the day their training ended. Like Adam Sandler, you need to feed the model everything it can’t remember but needs to know in order to function.

Let’s return to our initial list of assumptions based on a chatbot interaction:

LLMs can hold a conversation. FALSE. In fact, they simply output a likely continuation of their input, which may happen to be a conversation between a human user and a chatbot assistant.
LLMs accept text as input. FALSE. They only see tokens, which are represented as vector embeddings.
LLMs output text, of an arbitrary length. FALSE. They output a probability distribution one token at a time. Autoregressive decoding is the process of using this probability distribution to choose a token, appending it to the input to generate the next token, and repeating until a stop token is generated.
LLMs’ different abilities require specialised types of domain-specific training or programming. FALSE. Almost all of LLM training involves nothing more than playing “guess the next word” (though reinforcement learning is often used subsequently to tweak behaviour.)
LLMs can remember information seen after training. FALSE. Learning stops at the end of the training process and the model is then frozen in time. Every prediction is completely stateless, meaning everything needed to predict the next token must be provided as input each time. That includes things previously said by the user, things previously generated by the LLM, and any other world knowledge or context about recent events.

1

OpenAI’s GPT-3 had about 50k tokens in its vocabulary, GPT-4 has 100k and GPT-4o has 200k. [back]

2

Though as others have pointed out, the true surprise is that a character-blind model can spell anything at all. [back]

3

This idea, known as distributional semantics, has deep roots in linguistics and philosophy, traceable at least back to Wittgenstein’s 1930 claim that the “the meaning of a word is its use in the language” and Firth’s “you shall know a word by the company it keeps” (1950). More recently, it has been applied to the study of child language acquisition in Lila Gleitman’s “Verbs of a feather flock together”.

To learn more about how this idea is applied in machine learning to produce embeddings, check out Jay Alammar’s The Illustrated Word2Vec. [back]

4

Other examples of training objectives that can be used to create embeddings include predicting a word given surrounding words (Continuous Bag Of Words, or CBOW), and predicting surrounding words given a word (Skip-gram), both from the original word2vec paper.

Later models like BERT, which formed the foundation of many popular embedding models, were trained to predict a masked word in a sequence, or to predict whether a given sentence follows another. [back]

5

Technically, the model produces logits (unnormalised scores) which are converted to a probability distribution using a softmax function with the temperature parameter. The temperature determines how much we sharpen or flatten out the probabilities from those that would directly be produced by the logits.

A temperature of 1 creates probabilities exactly in line with the model’s prediction. Moving the temperature towards zero skews the distributions towards the highest probability token, until at zero itself it becomes 100% for that token and 0% for all others. A temperature above 1 starts to flatten out the distribution such that very high temperatures would cause token choice to be essentially random (throwing away all the information the model has learned!).

See this Hugging Face article for more details and the temperature equation. [back]

6

Without optimisations, it would grow quadratically, because we need to re-input the growing output each time (“The capital of France is one”, “The capital of France is one of”, etc…). So total work to generate n output tokens is proportional to the nth triangular number, n(n+1)/2, making generation O(n²).

However, key–value caching eliminates the need to re‑compute earlier steps for each new token, so the cumulative work grows linearly in practice. Further optimisations mean that often even this linear growth may not always be apparent in latency when using a model via an API, at least at shorter output lengths. [back]

7

The full cross-entropy loss equation is actual*log(predicted) summed across all actual/predicted pairs of probabilities in the distribution (and then negated):

So if we had a two word vocabulary yes/no, and we predicted “yes” at 30% and “no” at 70%, but their actual probabilities were 20% and 80%, we’d do:

-((0.2 * log(0.3)) + (0.8 * log(0.7))) = 0.228497317…

For a 200k vocabulary this would get laborious. But because the actual probability is zero for all but one example, and that one example has a probability of one, the equation reduces to simply -log(predicted). [back]

8

The chain rule was first recorded as being mentioned by Leibniz in 1676 [back]

9

To see some real examples of the ideal responses written by humans and the effect on the model responses of fine-tuning on these, check out OpenAI’s 2022 paper “Training language models to follow instructions with human feedback” — a relic from the long-forgotten era when OpenAI published research openly.

There are also many other examples here: https://github.com/raunak-agarwal/instruction-datasets?tab=readme-ov-file [back]

Read Entire Article