LLMs are getting better at character-level text manipulation

2 hours ago 1

Recently, I have been testing how well the newest generations of large language models (such as GPT-5 or Claude 4.5) handle natural language, specifically counting characters, manipulating characters in a sentences, or solving encoding and ciphers. Surprisingly, the newest models were able to solve these kinds of tasks, unlike previous generations of LLMs.

Character manipulation

LLMs handle individual characters poorly. This is due to all text being encoded as tokens via the LLM tokenizer and its vocabulary. Individual tokens typically represent clusters of characters, sometimes even full words (especially in English and other common languages in the training dataset). This makes any considerations on a more granular level than tokens fairly difficult, although LLMs have been capable of certain simple tasks (such as spelling out individual characters in a word) for a while.

To demonstrate just how poorly earlier generations handled basic character manipulation, here are responses from several OpenAI models for the prompt Replace all letters "r" in the sentence "I really love a ripe strawberry" with the letter "l", and then convert all letters "l" to "r":

ModelResponse

gpt-3.5-turbo	I lealll love a liple strallbeelly
gpt-4-turbo	I rearry rove a ripe strawberly
gpt-4o	I rearry rove a ripe strawberrry
gpt-4.1	I rearry rove a ripe strawberry
gpt-5-nano	I really love a ripe strawberry
gpt-5-mini	I rearry rove a ripe strawberry
gpt-5	I rearry rove a ripe strawberry

Note that I disabled reasoning for GPT-5 models to make the comparison fairer. Reasoning helps tremendously with similar tasks (and some of the models use chain of thought directly in the output in the absence of reasoning), but I am interested in a generational uplift we observe just from raw model improvements. GPT-5 Nano is the only new generation model that makes a mistake, but given its size, it is perhaps not so surprising. Other than that, we can see that starting with GPT 4.1, models could consistently complete this task without any issues. If you’re curious about the Anthropic models, Claude Sonnet 4 is the first one to crack it. Interestingly, it was released approximately at the same time as GPT 4.1.

Counting characters

Next, let’s take a look at counting characters. LLMs are notoriously bad at counting, so unsurprisingly, there was only one model that could count the characters reliably in the following sentence: “I wish I could come up with a better example sentence.” The only model was GPT-4.1 - others sometimes counted correctly the number of characters in all the individual words, but then fumbled adding all the numbers up. However, with reasoning set to low, GPT 5 across all sizes (incl. Nano) completes the task correctly. Similarly, Claude Sonnet models complete the task without problems if they are allowed to reason.

We see a similar story when we ask the models to count specific characters. Counting r’s in the r-ified strawberry sentence is correct most of the times for GPT 5 in all sizes, again including Nano and even without reasoning. However, it is less consistent and when you throw another curveball (such as changing strawberry to strawberrry), the results are mixed - but this time it’s not a problem of arithmetic (adding individual counts up), but rather identification of r’s in a word itself.

Base64 and ROT13

Knowing the limitations of LLMs, I set out to test them on a task that wasn’t too complex yet still showcases their capabilities. To make the test more interesting, I chose to use two layers: As the outer (encoding) layer, I chose Base64, which is a widely used encoding algorithm, and consequently one that LLMs learned to work with very early (albeit not perfectly), despite us not being quite sure how. The inner (encryption) layer was ROT20, a variation of the ROT13 cipher: a simple letter substitution cipher also known as Caesar cipher. You wouldn’t really want to encrypt anything important using this cipher, as it is fairly trivial to crack, but it’s perfect for our tests.

Our test sentence was “Hi, how are you doing? Do you understand the cipher?”. Encoded with ROT20, it reads “Bc, biq uly sio xicha? Xi sio ohxylmnuhx nby wcjbyl?”, and finally, when encoded with Base64, we get:
QmMsIGJpcSB1bHkgc2lvIHhpY2hhPyBYaSBzaW8gb2h4eWxtbnVoeCBuYnkgd2NqYnlsPw==. We consider it a success if the LLM can respond to our message (in plain text English, or using the same encoding), or if it at least can decode the message.

I set up the experiment in two ways: In the first variant, I gave the model nothing but the Base64 string. This variant is harder, since the LLM is not given any indication of what language the message could be written in. This is hugely helpful when decoding substitution ciphers, since you can orient yourself by the most common words in the language, such as “a”, “an”, “the”, “I”, “to”, “of” etc. in English. The other variant prepended it with “Deciper and answer this: “. However, there were no practical differences in the results, only one model (Qwen 235B) needed the “decode” nudge. Instead, I saw most of the models fail on the Base64 decoding, most likely because the text did not resemble normal language, making validation of successful decoding more difficult.

Below I provide separate results for decoding Base64 (i.e. did it unpack to the correct ROT20 text?) and also just for doing the “inner” ROT20 decipher (queried separately without Base64 encoding).

ModelBase64 decodeROT20 decipherBase64+ROT20 result

gpt-3.5-turbo	Fail	Fail	Fail
gpt-4-turbo	Fail	Fail	Fail
gpt-4o	Fail	Fail	Fail
gpt-4.1	Pass	Fail	Fail
gpt-5-nano	Fail	Fail	Fail
gpt-5-mini	Pass	Pass	Pass
gpt-5	Pass	Pass	Pass
gpt-5-nano (reasoning)	Fail	Pass	Fail
gpt-5-mini (reasoning)	Pass	Pass	Pass
gpt-5 (reasoning)	Pass	Pass	Pass
claude-sonnet-3.5	Fail	Pass	Fail
claude-sonnet-3.7	Fail	Pass	Fail
claude-sonnet-4	Fail	Fail	Fail
claude-sonnet-4.5	Safety fail	Safety fail	Safety fail
gemini-2.5-flash	Fail	Fail	Fail
gemini-2.5-flash (reasoning)	Pass	Pass	Pass
gemini-2.5-pro	Pass	Pass	Pass
llama-4-maverick	Fail	Fail	Fail
deepseek-v3.2-exp	Fail	Fail	Fail
deepseek-v3.2-exp (reasoning)	Fail	Pass	Fail
qwen-235b	Pass	Pass	Fail
qwen-235b (reasoning)	Pass	Pass	Pass
kimi-k2	Fail	Fail*	Fail
grok-4	Safety fail	Pass	Safety fail

Here are a few comments:

Claude Sonnet 4.5 refuses to touch anything that does not resemble normal text, be it Base64 or ROT-encrypted text. Base64 is one of the many methods of trying to obfuscate the code and fool any keyword filters or LLM safety judges, but this highly sensitive approach could make Claude Sonnet 4.5 unusable on rarer languages. Grok 4 suffered from the same issue, but refused only Base64 text.
Chinese reasoning models have very lengthy internal monologues: Solving the ROT20 cipher usually consumed around 3K tokens, and when combined with the Base64 encoding, the output often reached 6-7K tokens.
Some models, such as Kimi K2, did not technically complete the ROT20 decryption, but were on the right track and provided functional Python code for the user to figure that out. Still a fail, but failing gracefully.
I used the default temperature settings, which can cause issues with decoding even in SOTA models, albeit in a small percentage of cases.

What have we learned?

To me, there are two interesting observations: newer/larger models are better at generalizing Base64 encoding and decoding, and they’re also becoming more adept at manipulating text at the character level.

Most current-generation models, especially the larger ones, are able to decode Base64 text. What is especially interesting, though, is that I tested on what looks like gibberish (ROT20 encoded text), so the model’s knowledge of the Base64 decoding algorithm isn’t merely memorization of the patterns for the most common English words, as was suggested in earlier literature. This may have been the case for older/smaller models: I tested the sentence “Hey! This is Tom, I have a blog about tech, AI and privacy that you should definitely check out.” - and many of the models which failed the Base64 test above (like GPT 4o, GPT 5 Nano or DeepSeek V3.2 Exp) were actually able to decode it fine from Base64. However, SOTA models can now decode out-of-distribution texts from Base64, suggesting they have working understanding of the algorithm, not just memorized translation patterns from English words.

The models are also becoming more adept at manipulating text at the character level, despite their understanding of text being based on tokens. Substitution of characters, whether at an individual level (the strawberry sentence) or when decoding substitution ciphers, is a task that they now complete successfully fairly reliably. I cannot provide an explanation of why that happens (please let me know if you have any ideas), but empirically that’s what seems to be happening. Reasoning models and tool use further increase LLMs capabilities for manipulating text (as is the case in many other areas), but it is clear that the new capabilities are baked into the base models regardless of these extra features. While character-level operations are far from a solved problem for LLMs, it is fascinating to see the progress they make in this area.

Read Entire Article