LLM hallucinations are compression failures, and we can detect them

2 weeks ago 2

In-context learning (ICL) often behaves like Bayesian updating: the model “reads” a prompt, forms something like posterior beliefs, and predicts accordingly. Yet on exchangeable tasks (where order shouldn’t matter), modern LLMs systematically change predictions when you permute the same evidence, violating permutation invariance / martingale properties required by Bayesian posteriors.In the previous post we showed that

  • Log-loss equals MDL: Transformers minimising -log P(x) + λ||θ||² during training are actually minimising the total description length (the bits needed to encode both the data given the model and the model itself).

  • MDL requires sufficient statistics: To achieve optimal compression (entropy-level coding), the model must learn to extract the sufficient statistics T(X), it's information-theoretically impossible to compress optimally without them.

  • Statistics parameterise posteriors: The extracted T(X) parameterises the model's output distribution, which approximates the Bayesian posterior P(next token | T(X)) for exchangeable data.

When positional encoding prevents full extraction of T(X), the model gets incomplete statistics T'(X). Now it must parameterise a posterior with insufficient information. The model still needs to output a valid probability distribution, so it fills in the gaps with learned biases which we call hallucinations. Part 2 discusses how we did this in our new paper.

Consider how you would reason whether a medical diagnosis of appendicitis is likely given a patient summary:

[Patient reports right lower quadrant pain (started 6 hours ago)… Temperature: 38.5°C (mild fever)… White blood cell count: 14,000/μL (elevated)… Nausea with one episode of vomiting… Pain originally started around the umbilicus then migrated to RLQ… Rebound tenderness present on physical exam… Patient ate normally yesterday, no appetite today]

You’d probably follow a Bayesian Reasoning approach without even knowing it. You’d extract the most important (sufficient) statistics to base your reasoning on and consider:

  1. Migration score: Did pain migrate from periumbilical to RLQ? (Yes = 1, No = 0)

  2. Inflammatory marker: How many are positive? (fever + elevated WBC = 2 out of 3)

  3. Symptom scores: RLQ pain + nausea/vomiting + anorexia (3 out of 3 present)

These sufficient statistics capture everything more or less needed for diagnosis. Whether you read "fever" first or last doesn't matter, what matters is that fever is present and contributes to the inflammatory marker count. Transformers behave almost the same but with one massive caveat.

Imagine how a transformer model might answer the same question if it were fine-tuned on thousands of medical notes that put chief complaints first. Consider two types of prompt orderings to the model:

Ordering 1 (symptoms early): "Patient presents with RLQ pain and rebound tenderness. Temperature 38.5°C..."

  • Strong extraction: Symptoms provided early on, captures migration pattern, inflammatory markers, physical findings

  • High confidence in appendicitis diagnosis

Ordering 2 (symptoms late): "Patient is a 25-year-old who came in today. Previously healthy. Works as a teacher. No medications. Family history unremarkable. Ate normally yesterday. Today noted some discomfort. Eventually developed RLQ pain. Later found to have fever 38.5°C..."

  • Critical symptoms appear at positions 50+ where the model expects background information

  • Weak extraction: might miss the migration pattern, underweight the fever

  • Low confidence, might suggest gastroenteritis instead

The Bayesian posterior P(appendicitis | symptoms) should be identical in both cases because the sufficient statistics are the same, but the transformer gives different predictions based on where these facts appear in the sequence.

This is The Extraction Pathway Problem: The same evidence [A,B,C,D] gets processed through different position-dependent pathways depending on ordering. Each pathway has different extraction strength based on what the model learned during training.

Now let's understand exactly what price we pay for having positional encodings on top of the transformers attention mechanism. The total prediction error (log-loss) can be split into two parts:

Total Error = Irreducible Error + Position-Induced Error

  • Irreducible error H(Y|T): is the fundamental uncertainty that even a perfect Bayesian reasoner faces, it's the entropy of the outcome given the sufficient statistics. No model can do better than this; it's the information-theoretic limit.

  • Position-induced error: How much worse we do because positional encodings prevent clean extraction of T(X).

Mathematically the whole identity is a risk decomposition, it shows how the expected log-loss of a position-aware predictor relates to the entropy of the task and a residual KL divergence term.

\(\mathbb{E}_{X,Y,\pi}\!\left[-\log p_\theta\!\left(Y \mid \Gamma_\pi(X)\right)\right] = H(Y \mid T) \;+\; \mathbb{E}_{X,\pi}\, \mathrm{KL}\!\big(P(\cdot \mid T(X)) \,\|\, p_\theta(\cdot \mid \Gamma_\pi(X))\big).\)

The KL term on the right hand side is where it gets interesting, this term measures how far the model’s predictions are from the true, permutation-invariant Bayesian predictive. From this we derive the heart of the paper.

The risk decomposition earlier describes a positional Jensen penalty that quantifies this inefficiency: the extra compression cost from giving different answers for equivalent evidence. It represents model capacity wasted on position-specific patterns rather than robust extraction of the sufficient statistics that actually matter.

J_Γ(P,θ) = [Average KL over orderings] - [KL to averaged model]

In our paper, the KL divergence measures the "cost" in bits when you use the wrong probability distribution

If the true probability of appendicitis is 0.7, but your model says 0.5, you pay a penalty of KL(0.7 || 0.5) ≈ 0.08 bits. If your model says 0.3 you’re further away from the truth and pay a higher cost, you pay KL(0.7 || 0.3) ≈ 0.36 bits.

This measures the average cost across all possible orderings. With our appendicitis example, lets consider three different orderings of the information in the prompt:

  • Ordering π₁ (symptoms early): model predicts 0.8 The model overshoots because the symptoms appear in "expected" positions where its extraction pathways are strongest. The KL divergence from truth to model is: KL(0.7 || 0.8) = 0.028 bits

  • Ordering π₂ (symptoms middle): model predicts 0.6
    The model undershoots slightly because some symptoms are in suboptimal positions. The KL divergence is: KL(0.7 || 0.6) = 0.022 bits

  • Ordering π₃ (symptoms late): model predicts 0.3 The model severely undershoots because critical symptoms appear in positions where extraction pathways are weak. The KL divergence is: KL(0.7 || 0.3) = 0.339 bits

Average KL = (0.028 + 0.022 + 0.339) / 3 = 0.130 bits. This 0.130 bits represents the average inefficiency when the model uses position-dependent predictions instead of consistently extracting the sufficient statistics.

Now instead of looking at each ordering separately, we first average the model's predictions: p̄_θ = (0.8 + 0.6 + 0.3) / 3 = 0.567. Then we measure the KL divergence from truth to this average KL(0.7 || 0.567) ≈ 0.05 bits. This is the cost if we could somehow always use the averaged prediction.

Putting Term 1 and Term 2 back into our original formula, J_Γ = 0.143 - 0.05 = 0.093 bits. This 0.093 bits is the extra cost we pay because the model gives different answers for different orderings.

Why does any of this matter for our work? Firstly, because since we established that the positional Jensen penalty measures the inefficiency from position-dependent predictions, when you implement permutation mixtures, (through the multi-branch this penalty structurally actually vanishes. This is because

  • Term 1: The average KL over orderings stays roughly constant

  • Term 2: The KL to the averaged model decreases as the average gets better

The key take-way: By including more permutations, (the penalty) shrinks toward zero.

Our paper shows that, in expectation over permutations, this mixture achieves a compression rate of H(Y|T) + o(1). Since the o(1) term vanishes as we add more data, compression rate converges to H(Y|T), which is MDL-optimal.

This shows that transformers achieve optimal compression in expectation. But what happens for specific events, especially rare ones? When does the gap between expectation and realisation force the model to hallucinate? This is where we shift from theoretical optimality to practical failure modes.

Transformers achieve optimal compression H(Y|T) when averaged over permutations. But MDL optimality means efficient information use, not high confidence for everything. A model correctly predicting 10% probability for a rare disease is being optimal by accurately representing how uncertain it is.

The problem arises when applications demand higher confidence than evidence supports. If medical diagnosis requires 95% certainty but the symptoms only justify 10%, we face an unbridgeable gap. The model must either abstain or hallucinate details unsupported by evidence.

The Expectation-level Decompression Law (EDFL) quantifies exactly when this gap forces hallucination, establishing the precise information threshold for any target reliability.

At its core, hallucination is a binary outcome, either the model generates content grounded in the extracted evidence (success) or it fabricates information (failure). We formalise this using Bernoulli distributions Ber(p), where p represents the probability of success.

In our framework:

  • Ber(p): Target reliability (e.g., we want 95% confidence the answer is correct)

  • Ber(q̄): Model's average confidence across permutations (e.g., model is only 10% confident for a rare event)

EDFL tells us exactly how many bits of information are needed to bridge this gap from q̄ to p. When positional extraction failures leave the model with insufficient information to reach the target reliability, it faces an impossible task: maintain fluent generation without the evidence to support it. This information shortfall is what forces hallucination.

Theorem 5 in our paper establishes a fundamental lower bound on the information needed for reliable prediction.

\(\Delta := \mathbb{E}_\pi[\mathrm{KL}(P \| S_\pi)] \geq \mathrm{KL}(\mathrm{Ber}(p) \| \mathrm{Ber}(\bar{q}))\)

This expectation includes the variation caused by the Jensen gap. When the Jensen gap is large:

  1. Predictions vary more across orderings (high variance in S_π)

  2. Some orderings will be very far from P (large KL for those orderings)

  3. The average Δ increases, making it harder to achieve target reliability

For rare events where q̄ is small, even small position-induced variance (from the Jensen gap) can push Δ below the critical threshold, forcing hallucination.

Concrete example:

You want 95% confidence the answer is correct: Ber(0.95) but the model's average confidence is only 20%: Ber(0.20), so the KL divergence KL(Ber(0.95) || Ber(0.20)) ≈ 2.3 nats.

This means you need at least 2.3 nats of more information to lift the model's confidence from its average of 20% to your target of 95%. If the information budget Δ is less than 2.3 nats (due to poor positional extraction), the model cannot achieve reliable performance and must hallucinate the continuation.

The profound insight from EDFL is that a single ordering can be deeply misleading about whether the model has enough information to avoid hallucination. Here's why averaging over permutations solves this fundamental problem:

The Single-Order Trap

When you look at just one ordering of your evidence, you're seeing the result of a positional lottery. If critical evidence happens to land in positions where extraction pathways are strong, the model might output q_π = 0.8 confidence. You might think "great, we're well above the threshold for reliable generation!" But this is position-induced overconfidence, the actual information content might only justify 0.2 confidence.

Why Averaging Reveals Truth

Averaging over permutations does something remarkable, it cancels out the positional noise to reveal the actual MDL-optimal information content your model used as evidence. The average q̄ represents what the model can reliably extract regardless of how the evidence happens to be ordered. This gives you:

  1. A guaranteed lower bound: The bits-to-trust calculation using q̄ tells you the minimum information needed across all possible orderings, not just the one you happened to see.

  2. Robustness to reformatting: Minor changes to prompt structure won't suddenly cause hallucinations if your information budget meets the averaged threshold.

  3. A principled decision boundary: If your information budget Δ is below KL(Ber(p) || Ber(q̄)), then some orderings will inevitably fail, even if your current ordering looks fine.

The Practical Implementation

The remarkable finding is that you don't need all n! permutations to get a reliable estimate. Even 8-20 random shuffles can reduce the variance roughly like k^(-1/2), quickly converging to a stable q̄ that reveals the true information content. This makes the approach computationally feasible while maintaining theoretical rigor.

The Bottom Line

EDFL with permutation averaging transforms hallucination detection from hoping you got lucky with positioning to knowing you have sufficient information regardless of order. It's the difference between "this might work if the evidence stays in favourable positions" and "this will work because we have enough information in the evidence itself."

Discussion about this post

Read Entire Article