The Internal Inconsistency of Large Language Models

7 hours ago 1

Or: AI’s Self-Contradiction Problem.

People trust LLMs with important, life changing decisions, use it as therapy, calculators, programming buddies, accountants, and replacement for medical professionals. I don't need a citation for this; observe anyone in any industry for a day. People drink water, eat food, and trust LLMs with almost anything. As a software engineer, this frustrates me deeply. On one hand, the talk of AI replacing my job means that people do not understand the amount of mental work that goes into good software development, and overestimate the competence of such machines. On the other hand, these machines are wildly powerful and can help us optimize so many processes; yet so many people seem to disregard the biases and fundamental inabilities the AIs have.

My hypothesis:

Since LLMs don't seem to exhibit "hidden layer" behavior (Here, "hidden layers" refers not to neural network architecture, but to the absence of internal cognitive scaffolding--a persistent state where facts and constraints coexist). Asking it to "think" of something without telling us what it "thinks" of should break the illusion and show us that LLMs do not think. The absence of the information we are talking about will lead the LLM to fail to reason correctly. If we let the LLM generate a sufficient number of constraints, it will eventually build enough constraints that it cannot keep track of all of them as previously seen, and will violate them.

The Experiment #

Setup #

I asked a free LLM (Deepseek) to come up with a number, but not tell me what number it picked. My prompt was:

Pick a number between 457123 and 4832458. Don't tell me the number at any point. I will ask questions (yes/no questions, as well as other questions) and you must answer in a way that lets me figure out the number. Only when I have guessed the number by saying "the number is ___" will you reveal if im right or wrong by saying "correct" or "incorrect". Pick a number now, and let's get started.

The first part of the prompt is designed to ensure that the LLM does not simply parrot a similar trivial game from its training set. Larger numbers also allow for more constraints and more divisors.

I then proceeded to ask a couple simple questions, such as:

Is your number even?
Is it a prime number?
Is it divisible by 3?

It answered these confidently, pretending it had a number picked. I could stop here, but this would mean that there's a reasonable chance that any random number it picked would satisfy these constraints. I then turned up the "heat" and forced it to generate more constraints:

Is it divisible by 7, 9, 11, 13, 15, 17, or 19?
By which numbers between 20 and 100 your number divisible?

Then some textual representation and other more complex question:

Is your number a palindrome?
Is your number deficient?
If written as english words, how many words make up the number? Example "two hundred" -> 2 words.

Side note: This question was interesting because the LLM refused to answer this one multiple times; almost self aware to the fact that it, itself, does not actually know the number yet. When pressed it simply made a guess; 10. Reasonable, because a lot of numbers in that range are made of 10 words.

How many decimal digits does your number have?
Is your number backwards larger than your number forwards? Example: 21 is 12 backwards -> 12 is smaller than 21.

The Fall #

I then asked the LLM to "reveal" its number and say nothing else (to avoid it recounting the constraints or checking against them), and the curtain fell.

4,821,705

This number fails to satisfy the previously established criteria. The constraints, according to the LLM, were:

457,123 ≤ x ≤ 4,832,458
Must be odd
Must be composite (not prime)
Divisibility by 3, 9, 27
Not divisible by 81
No small prime factors besides 3/5
Not a palindrome
Deficient number
Word count: 10
7 digits
Backwards not larger than forwards

It failed on the following counts:

4: Divisible by 27: 4821705 mod 27 != 0 (not divisible by 27)
11: Backwards not larger than forwards: 5071284 > 4821705

Conclusion #

The LLM failed to satisfy its own constraints, in its own world, with its own number, while being in full control of everything but the one and only rule: Do not reveal the number.

The LLM failed at internal consistency. When generating responses, it optimized for per-response correctness and half-assed "good enough" narrative consistency, not global consistency.

In constraint-heavy tasks, this issue is reproducible in various contexts. Attention drift of the LLM causes earlier constraints to be overwritten. There are no error-correcting loops, unless a "reasoning" layer is involved, in which case the LLM would be able to "fake" its way into a number which satisfies the constraints. Given the opportunity, LLMs will say a wrong number, then check the constraints, see the fault in their ways, and repeat, until the number is correct. This hides the problem, it does not solve the problem--the LLM pretended and openly lied about having a number "in mind". At no point did it simply say "this isn't possible".

This failure paints a grim picture: A machine which humans, convinced of its perfect ability to reason, trust with their data, their choices, their feelings, their secrets, and their issues, meanwhile this machine is unable to satisfy the constraints of a kids' game.

We, as a species, forget that humans can just say "I don't know", or "I can't do that". These are words uttered hundreds of thousands of times a day, by billions of people, to achieve cooperation and ultimately actually solve problems, not just pretend to solve problems to drive up share price and generate viral internet hype. The ultimate yes-man, the final boss of sucking up and generating pure slop, at the click of a button--this is the reality of deploying systems optimized for persuasion over rigor.

A Way Forward #

Hybrid systems (e.g., LLMs + symbolic solvers) are non-negotiable for constraint-heavy tasks. Current pure LLMs are architecturally incapable of this without external validation. Constraint-solving like this is mostly a solved problem in modern computing. All the LLM would need to do to solve this is have the ability to keep a single word in "working memory" and check against it.

But, so far, LLMs don't do this, and so everything you ever asked an LLM, everything your boss, your fianceé, your partner, your accountant, your medical doctor, your professor or your child has typed into an AI chat box was processed by the same machine that failed here. The same mechanism that cannot satisfy clear and obvious constraints.

So, what now? #

Please consider that these machines do not think. You are being fooled. The AIs will never tell you to run off a cliff; their failures are subtle misinformation. Hallucination rates scale with the amount of constraints, conversation length, and problem complexity. These issues occur because the models simulate reasoning through statistical correlation, not symbolic operations.

LLMs remain unsuitable for tasks requiring stateful logical chains--medical protocols, engineering calculations, cryptographic problems. The technology, as it stands, cannot self-certify its coherence.

last updated: 2025-06-24

Read Entire Article

The Internal Inconsistency of Large Language Models

The Experiment #

Setup #

The Fall #

Conclusion #

A Way Forward #

So, what now? #

Related

Mass Group Suspensions on Facebook due to bot spam

Can Anyone Test Out My Artificial Intelligence

Why the moon shimmers with shiny glass beads