The Country That Broke Kotlin

3 hours ago 1

Logic vs language: how a Turkish alphabet bug played a years-long game of hide-and-seek inside the Kotlin compiler

Press enter or click to view image in full size

When Turkish software engineer Mehmet Nuri Öztürk posted a short message on the Kotlin discussion forum in March of 2016, he had no idea he was reporting a dangerous standard library bug that would take five years to find and fix. All he knew was that his build didn’t work.

Kotlin 1.0 had been released to the world only a month earlier, promising to breathe much-needed fresh life into the twin worlds of Java and Android development. But for Mehmet Nuri, the new programming language was a frustrating dead end. His code simply wouldn’t build, and the compiler’s output gave him nothing to go on.

He pasted the impenetrable error into his forum post:

Compilation completed with 2 errors and 0 warnings in 10s 126ms

Error:Kotlin: Unknown compiler message tag: INFO
Error:Kotlin: Unknown compiler message tag: LOGGING

The Kotlin team replied quickly, but they didn’t have much to go on either. “Did you see this error just once, or do you see it every time you compile your project?”

It was consistent, Mehmet Nuri replied, not just between builds but also across different machines and operating systems.

It was a full five months before the breakthrough discovery. Muhammed Demirbaş, another programmer working in Turkey, had been running into the same mysterious build failure message, and had started to do some investigation of his own.

“I suspect that the source of the error may be my locale or language,” wrote Muhammed, commenting on Mehmet Nuri’s post. Muhammed even pinpointed the exact line of code where he thought the problem might be. “Apparently this is a uppercase–lowercase Turkish I problem in the CompilerOutputParser.CATEGORIES map: I -> ı, İ -> i.”

This was proof that Mehmet Nuri’s problems weren’t confined to his particular project, but were a symptom of something more serious going on in the compiler itself. The Kotlin team, grateful for Muhammed’s new information, filed an issue report with a link to the forum post in their YouTrack bug tracker:

“Compilation fails on Turkish locale because of locale-sensitive uppercasing.” (KT-13631)

Muhammed Demirbaş couldn’t have been more spot on in his investigation and assessment of the compiler bug. Since Kotlin is open source, he was able to search the compiler’s code for the exact line of code where that “Unknown compiler message tag” string appears:

val qNameLowerCase = qName.toLowerCase()
var category: CompilerMessageSeverity? = CATEGORIES[qNameLowerCase]
if (category == null) {
messageCollector.report(ERROR, "Unknown compiler message tag: $qName")
category = INFO
}

So what does this code do, and why does it sometimes go wrong?

The code is part of a class named CompilerOutputParser, and is responsible for reading XML files containing messages from the Kotlin compiler. Those files look something like this:

<MESSAGES>
<INFO path="src/main/Kotlin/Example.kt" line="1" column="1">
This is a message from the compiler about a line of code.
</INFO>
</MESSAGES>

At the time, the tags in this file were named in all-caps: <INFO/>,<ERROR/>, and so on (source: GitHub), like the HTML 1.0 webpages your grandpa used to write.

In the Kotlin code we just saw, qName is the name of an XML tag that we’re parsing from this file. If we’re looking at an <INFO/> tag, the qName is “INFO.”

To determine what the message means, the CompilerOutputParser next looks up that string in its CATEGORIES map to find its corresponding CompilerMessageSeverity enum entry. But wait: the keys in the CATEGORIES map are lower case! (source: GitHub)

val categories = mapOf(
"error" to CompilerMessageSeverity.ERROR,
"info" to CompilerMessageSeverity.INFO,
…
)

Instead of searching for “INFO,” we need to search for “info.” That’s why the code we looked at calls qName.toLowerCase() before looking it up in the CATEGORIES map. Here’s the code again, or at least the relevant lines:

val qNameLowerCase = qName.toLowerCase()
var category: CompilerMessageSeverity? = CATEGORIES[qNameLowerCase]

And that’s where the bug sneaks in.

If your computer is configured in English, "INFO".toLowerCase() is "info", just like we wanted.
But if your computer is configured in Turkish, "INFO".toLowerCase() turns out to be "ınfo".

Notice the difference? In the Turkish version, the lower case letter ‘ı’ has no dot above it.

The tiny discrepancy might be hard for a human to spot, but to a computer, these are two completely different strings. The dotless "ınfo" string isn’t one of the keys in CATEGORIES map, so the code fails to find the correct CompilerMessageSeverity for our <INFO/> tag, and complains that “INFO” must be a completely unknown category of message.

So why does calling toLowerCase() on a Turkish computer produce this strange result?

Muhammed already provided part of the answer in his reply to Mehmet Nuri’s forum post. Turkic languages have two versions of the letter ‘i’:

an ‘i’ with a dot, as in the word insan (human),
and a separate ‘ı’ without a dot, as in the word ırmak (river).

What’s more, the dotted/dotless distinction is also preserved in the upper case letters:

capital ‘i’ is ‘İ’, as in insan → İnsan,
and capital ‘ı’ is ‘I’, as in ırmak → Irmak.

That uppercase dotless ‘I’ is the same one we use in English. As a result, the single Unicode character I (U+0049) has two different lower case forms: dotted i (U+0069) in English, and dotless ı (U+0131) in Turkish.

For Kotlin’s toLowerCase() function, that’s a problem! When toLowerCase() sees an I character, which lower case form should it use? The lower case form of the Turkish word IRMAK should be ırmak, with no dot. But the lower case form of the English word INFO, which starts with exactly the same character, should be info, with a dot.

When you ask your computer to convert text to lower case, you should technically also specify the alphabet rules to use—English, Turkish, or something else entirely. But that’s a lot of hard work, so if you don’t specify, many systems — including, in those days, Kotlin’s toLowerCase() function — will just use the language settings you chose when you set up your computer. That’s why "INFO".toLowerCase() is "ınfo" when you run it on a Turkish machine, and that’s why IntelliJ installations in Turkey couldn’t match the Kotlin compiler’s <INFO/> messages to the lowercase "info" string they were expecting to see.

But in 2016, all of that was still just a bug ticket waiting to be worked on. Muhammed Demirbaş had identified the right place to start the search, but the YouTrack issue linked to his findings was just one of hundreds of tickets in the Kotlin project backlog. With only a tiny number of people reporting that they were affected by the bug, a more thorough investigation was never a priority.

That would all change with the release of coroutines two years later, when the unassuming little bug wormed its way even deeper into the foundations of the Kotlin compiler.

Press enter or click to view image in full size

Photo by Igor Omilaev on Unsplash

October 2018 saw the release of Kotlin 1.3 — and with it, the first stable version of the new coroutines library, an innovative approach to asynchronous programming that promised to transform the Android app development experience. Coroutines had been in prerelease testing for over a year, and now that they were deemed ready for production use, Kotlin programmers of all kinds were ready to embrace them with enthusiasm.

To get the new tools, developers needed to upgrade their copy of the coroutines library from the prerelease 0.30.x version to the stable 1.0 release, at the same time as upgrading the Kotlin language and standard library to version 1.3.

Have you ever upgraded a dependency in a Kotlin or Java project? If so, you’ll know that making sure all your libraries remain compatible with one another can be a delicate juggling act. If your code references a function that’s been removed or changed in the new dependency, you’ll get a compilation error. But if the newly broken reference comes not from your own code but from another library, you won’t find out until you run the program. That’s because the code inside the library has already been compiled by its author, and isn’t being compiled or checked as part of your own build.

When one of your libraries tries to call a function that doesn’t exactly match what’s on offer in your freshly upgraded project’s new classpath, you’ll get a NoSuchMethodError. As a result, when you’re upgrading dependencies—and especially if you’re upgrading several dependencies at once—the occasional NoSuchMethodError is pretty much par for the course, until you figure out exactly which versions of each library are compatible with one another.

So when Kemal Atlı, an Android developer based in Turkey, ran into a NoSuchMethodError while upgrading his app to use the shiny new coroutines library, it looked for all the world like just another dependency version mismatch. Kemal wasn’t having any luck fixing this one, though. Unsure if it might be a bug with the coroutines library itself, he opened a GitHub issue, pasting the stack trace from his crashed app:

java.lang.NoSuchMethodError:
No static method boxİnt(I)Ljava/lang/Integer;
in class Lkotlin/coroutines/jvm/internal/Boxing;

This exception already contained the vital clue—a tiny dot above the upper case letter ‘İ’ in boxİnt()—but who’s going to spot that if they’re not looking for it? For now, nobody did.

“Does restarting your IDE and running a clean build resolve the issue?” responded the coroutines library maintainers, immediately suspecting a version conflict. Kemal had said that the issue only happened on one of his two machines, which suggested the problem might just be an old incompatible dependency version hanging around in his build cache after the upgrade.

A week later, another bug report, from another Turkish developer seeing their app crash with the same exception. By now, the coroutines library maintainers were certain the problem could only be caused by a dependency version mismatch—a conclusion which, in any other circumstances, might have been entirely valid. They had no luck reproducing the issue on their end, which just provided further evidence that the problem was specific to the way those two issue reporters had configured and built their projects.

It took a month for someone to spot the smoking gun.

“It has to be a locale problem,” wrote Erel Özçakırlar, another Turkish software engineer, commenting on Kemal’s issue report in late December 2018. Erel pointed out what everyone had missed so far: the real function should be called boxInt(), but the stack trace showed boxİnt(). It wasn’t a simple case of trying to call an older version of an existing function. Instead, Kotlin had seemingly used the Turkish alphabet to invent a function name that had never existed in the first place. What’s more, Erel found he could fix the problem by running the code on a computer that used English as its system locale.

“I think Erel might have a point regarding system language settings,” replied Kemal, saying he’d look into it further. But figuring out why Kotlin’s compiler internals are suddenly inventing imaginary functions using Turkish characters — well, where would you even start? There was little for Kemal to offer beyond the original bug report, so the issue kept its “waiting for clarification” label, and was closed — along with the second similar bug report — in early 2019.

Once again, the bug was back in hiding. But this time, it had sunk its claws deep into Kotlin’s compiler internals. The misspelled boxİnt() function wasn’t being called in Kemal’s own code, or even in a library he was using. Instead, the mistake was being added to his app by the compiler itself.

To understand why, we need to talk a little about how coroutines work.

Much of Kotlin’s coroutine magic takes places in the dedicated kotlinx.coroutines library. But there’s one core building block that’s more tightly integrated with the Kotlin language and its compiler: the suspend keyword.

When you label a function with the suspend keyword, the Kotlin compiler rewrites the function signature to make the function work asynchronously. For example, when you write a suspending function with two parameters, the corresponding output generated by the Kotlin compiler will actually include three parameters. The invisible third parameter is a Continuation, which both stores the state of the function and acts as a callback to receive the function’s asynchronous result.

A Continuation stores and sends all kinds of values, depending on the code inside your suspending function—and that’s where the mysterious boxInt() function comes into play.

You might know that when you create an Int in Kotlin, it can be stored as one of two different underlying Java types: a primitive int, or an Integer object. Kotlin makes the choice for you automatically, depending on how the value is used.

If you use an Int in a coroutine, Kotlin will sometimes need to convert it from the primitive int storage mechanism to an Integer object, so that the value can pass through the generic coroutine continuation machinery. This conversion is called boxing. Java will happily perform the conversion automatically—but for the stable release of coroutines, the Kotlin team wanted to make sure the conversion was as efficient as possible.

To help the JVM optimize its execution of suspending functions, Kotlin 1.3 added a set of new functions for the compiler to use in its generated coroutine code: boxBoolean(), boxByte(), boxShort(), boxInt(), and so on (source: GitHub). Since the suspend keyword is part of the core language, these functions must be available to all Kotlin programs, which is why they’re in the standard library, not the coroutines library—though they’re marked as internal, and aren’t available for you to call directly.

The functions themselves aren’t the problem: they’re spelled correctly, and their implementations are so trivial that there’s nowhere for anything to go wrong. No, the bug happens when the compiler generates the code that calls these functions.

To correctly box a value, Kotlin needs to map the Java primitive type to its corresponding box function. A boolean value must be passed to boxBoolean(); a byte value to boxByte(), and so on.

There’s an obvious pattern there: capitalize the first letter of the primitive type, and then add “box” to the start. And that’s exactly what Kotlin 1.3 did, using the standard library’s capitalize() function: (source: GitHub)

map[name] = "box${primitiveType.javaKeywordName.capitalize()}"

The capitalize() function modifies only the first letter of a string—so "boolean".capitalize() becomes "Boolean", "int".capitalize() becomes "Int", and so on.

Unless you’re in Turkey.

Once again, the behaviour of capitalize() can vary depending on your computer’s language settings. It’s that pesky letter ‘i’ again:

With Turkish language settings, the upper case form of i is İ,
whereas in English, the upper case form of i is I.

If you’re working in Turkish, the correct result for "int".capitalize() is "İnt". The capitalize() function has no way of knowing that “int” is a special programming keyword that needs to be treated as English text rather than Turkish. So when the Kotlin 1.3 compiler, running on a machine with Turkish language settings, needs to box up a primitive Java int inside a suspending function, it’s going to generate a call to a non-existent standard library function called boxİnt().

Oops!

It was Fatih Doğan, another Turkish programmer, who in September 2019 managed to put all the pieces together, filing the issue report that would finally lead to a fix. Fatih clearly pointed out the misplaced dot above the ‘İ’ in boxİnt(), and, crucially, set up a GitHub repository with instructions to reliably reproduce the issue.

Within a day of this new detailed issue report, the Kotlin team had found the line of code that was causing the issue. Less than a week later, they had a fix ready: (source: GitHub)

map[name] = "box${primitiveType.javaKeywordName.capitalize(Locale.US)}"

Passing a specific Locale to the capitalize() function means it will always use the same language rules, no matter what machine you run it on. It’s an easy change: like all the case conversion functions in the Kotlin standard library, capitalize() already accepted an optional Locale argument. It only fell back to the system’s default Locale if you didn’t specify your own.

The fix was released as part of Kotlin 1.3.6, in November 2019, finally giving Turkish developers a stable way to use suspending functions.

But that’s not the end of this story—far from it. Coroutines might be working again, but they had proved just how easy it was to fall disastrously foul of locale-sensitive case conversions. And that original build error from the start of the story still wasn’t fixed…

It took one more bug to demonstrate the true severity of the problem.

In September 2020, nearly a year after the coroutines bug had been fixed and forgotten, Muhittin Kaplan was just starting to learn Kotlin. He wrote a simple program to check his understanding of arrays:

fun main() {
println("Hello, world!!!")
val nums = intArrayOf(1, 2, 3, 4, 5)
println(nums[2])
}

But when he ran the program, he saw a baffling error:

java.lang.NoSuchMethodError:
'int[] kotlin.jvm.internal.Intrinsics$Kotlin.intArrayOf(int[])'

The intArrayOf() function is one of the most basic tools in the Kotlin standard library—and it has existed in every Kotlin version since 1.0 and before. Even if it didn’t exist, or if it was being called incorrectly, the error should happen at compile time, not at runtime.

Muhittin knew something fishy was going on, and he filed a YouTrack issue describing what he was seeing.

“Hi from Türkiye,” he began.

Press enter or click to view image in full size

Photo by Markus Winkler on Unsplash

This time, the Kotlin team knew what to look for. It wasn’t long before they had tracked down the faulty line of code in the compiler: (source: GitHub)

StringsKt.decapitalize(type.getArrayTypeName().asString()) + "Of"

Although it’s written in Java, this code is calling the decapitalize() function from Kotlin’s own standard library. And once again, it’s relying on the system’s default language settings, instead of using a fixed Locale.

The code is part of a procedure that’s responsible for configuring intrinsics—functions that don’t really have an implementation in the Kotlin standard library, but are instead replaced directly by the compiler with the corresponding Java instructions or even JVM bytecode. When you write intArrayOf(1, 2, 3), Kotlin doesn’t really call an intArrayOf() function. Instead, it recognizes that intArrayOf() was registered as an intrinsic, and just outputs the bytecode to create and populate an array.

Much like the boxInt() function we saw before, intArrayOf() is part of a wider family of functions: one array-builder function for each primitive type. The call to type.getArrayTypeName() returns the name of the Kotlin class for each array—IntArray, BooleanArray, and so on. The corresponding function—intArrayOf(), booleanArrayOf(), and so on—should start with a lower case letter, so we need to call decapitalize().

And there’s our bug.

On a machine with Turkish language settings, "IntArray".decapitalize() (or StringsKt.decapitalize("IntArray"), as it appears in Java) returns "ıntArray", with that all-too-familiar dotless lowercase ‘ı’. Add the "Of" suffix, and we’ve just registered an intrinsic bytecode implementation for a function called ıntArrayOf()—not the same as the intArrayOf() function that the standard library is advertising!

When they came to fix this issue, the Kotlin team weren’t leaving anything to chance. They scoured the entire compiler codebase for case-conversion operations—calls to capitalize(), decapitalize(), toLowerCase(), and toUpperCase()—and replaced them with locale-invariant alternatives. 173 lines of code changed, across 53 files—including the compiler-output XML parser that had caused Mehmet Nuri Öztürk’s build to fail in Kotlin 1.0, all those years ago (source: GitHub).

The slew of fixes was released as part of a more general compiler upgrade project in Kotlin 1.5, in May 2021. After five years in the backlog, KT-13631 was finally closed.

Three different bugs—in compiler outputs, coroutines, and arrays—caused by three different functions—toLowerCase(), capitalize(), and decapitalize(). Without a more foundational solution, Kotlin’s case-conversion trap was just waiting to claim its next victim.

Even before Kotlin 1.5 was released, the Kotlin team were hard at work on a project to make sure locale-sensitive case conversions would never crash another Kotlin program.

In October of 2020, they published KEEP-223, “Locale-agnostic case conversions by default”—a proposal to replace Kotlin’s case-conversion functions with a new set of functions that would ignore your system’s language settings and simply default to a fixed locale. The new uppercase() and lowercase() functions were added to the standard library in Kotlin 1.5, and as of Kotlin 2.1, using the older toLowerCase() and toUpperCase() functions generates an error.

What about capitalize()?

When KEEP-223 was being discussed, it gradually became clear that capitalize() had more problems than just locale sensitivity. The function name itself is surprisingly ambiguous. Look up capitalize in almost any English dictionary and you’ll find two competing definitions. Here’s one example from Collins:

6. to print or write (a word or words) in capital letters

7. to begin (a word) with a capital letter

Kotlin’s capitalize() function had always been designed to modify only the first letter of a string, and this was the perfect opportunity to clear up the confusion. If capitalize() was an ambiguous name, what should its replacement be called? Can you think of a name that describes the function’s behaviour more clearly?

In the end, the Kotlin team chose not to provide a replacement at all. If the function doesn’t exist, it can’t cause confusion or bugs! In modern Kotlin, when you want to modify the first character of a string, you provide a custom lambda to replaceFirstChar { … }.

Kotlin 2.1 was released in November 2024, drawing our story to a satisfying close.

What can we learn from it? I think the biggest lesson is just how much responsibility rests on a language’s standard library. It’s easy to think of the standard library as simply a starter pack of stock algorithms and data structures. But dig just a little deeper, and you’ll find that even the simplest string operations rely on a detailed digital model of the complexity and creativity of human culture.

Much of that digital model, by the way, is provided by the Unicode Common Locale Data Repository (CLDR), which documents language rules, date and time formats, measurement units, currencies, and much more. Unicode isn’t just for emojis!

Now, there’s just one question that’s still bugging me. Did Mehmet Nuri Öztürk ever get that app to build?

Thanks for reading!

I write books, too. If you want more Kotlin oddities and compiler quirks, check out my puzzle book, Kotlin Brain Teasers:

And if you enjoyed learning about suspending functions and coroutines, you might like Kotlin Coroutine Confidence:

Read Entire Article

The Country That Broke Kotlin

Logic vs language: how a Turkish alphabet bug played a years-long game of hide-and-seek inside the Kotlin compiler

Related

Satyagraha

Madagascar's president says a coup is underway after soldier...

2025 Tech Predictions