Oodle 2.9.14 and Intel 13th/14th gen CPUs

1 week ago 6

There’s a hardware problem affecting Intel 13th/14th gen CPUs, mostly desktop ones. This made the rounds through the press last year and has been on forums etc. for much longer than that. For months, we thought this was a rare bug in the decoder, but from stats in Epic’s crash reports for Fortnite (as well as stats for other Unreal Engine and Oodle licensees) it was fairly striking that this issue really seemed to affect only some CPUs.

Much has been written about this problem already. At RAD we have an official page about it, which includes a link to an older version (when we still weren’t sure what was causing it).

Intel’s statement on the issue confirms that it’s a hardware problem where there’s physical degeneration of the clock tree circuitry over time resulting in clock skew which ultimately makes affected CPUs behave in increasingly glitchy ways.

This is a hardware issue, and Intel has now released multiple versions of microcode updates to try and prevent or at least limit that degradation. But the symptoms are just frequent crashes or errors in certain components. For Unreal-using games, the most common symptoms are

Shader decompression failures (as mentioned on our official page)
Shader compilation errors in the graphics driver
Spurious “out of memory” errors reported by the graphics driver (even when there’s plenty of GPU and system memory available)

These shader (and sometimes other) decompression errors are where Oodle (the data compression library I work on) comes in. Specifically, we notice during decompression that the data we are currently decoding is corrupted because internal consistency checks trip and return an error code. Most commonly, these errors occur as a result of either corruption of on-disk data, or bad RAM. But in this particular instance, we were (and still are) seeing these errors on machines with RAM that passes memory tests just fine and where the on-disk data passes hash validation.

We just released Oodle 2.9.14 which includes an experimental workaround that seems to at least reduce the frequency of failed Oodle decompression calls on affected machines. We’re not yet sure whether it helps with overall crash rate on those machines or if crashes happen with more or less the same frequency, just elsewhere. We’ll have more data in a few months! But meanwhile, the background for that workaround is interesting, so I’m writing it down.

Background: spring 2023 to early 2024

I’ll keep this history section relatively brief, because this is not the main point of this post.

We first started hearing about Oodle crashes on the then-new Intel 13th-generation CPUs around March 2023, concurrently from three separate sources: Fortnite was seeing a big spike in crash reports especially due to shader decompression failures (as mentioned above), with a smaller spike of errors in non-shader decompression; other UE licensees as well as Oodle-but-not-UE licensees were reporting something similar in their crash reports; and we were even getting a few mails from end users asking us about frequent Oodle crashes. (This is rare since RAD/Epic Games Tools is B2B, we don’t normally interact with end users directly).

We did notice that all of the reports seemed to involve recent Intel CPUs. From there follows months of red herrings and dead ends as well as trying (and failing) to reproduce these issues consistently. Condensed version:

Our first lead was that we had seen a lot of instability with a bunch of new demo machines at a recent event. Those occurred on a specific motherboard/CPU combination, and ultimately turned out to be due to bad sound drivers (which corrupted the contents of vector registers in user-mode code, yikes). The first batch of reports we had were all from one single motherboard vendor, so we thought that might be it.
Then we had a report from an UE licensee, still on the same motherboard, but removing the faulty sound drivers didn’t seem to help. They later reported that their crashes went away after they disabled AVX support in the BIOS. Between this and the vector register content corruption for the bad sound drivers, we started going over all SIMD code in Oodle Data with a fine-toothed comb.
By June, we still hadn’t found anything useful; we did have some experiments that disabled vector/SIMD usage in the decoder entirely but that didn’t seem to help, so back to square one.
Around July, so forum posts were making the rounds that disabling the E-cores seemed to help. We also had some reports from other motherboard vendors at the time, ruling out our initial theory that it might just be one.

We still didn’t have any machines that reproduced this behavior locally (Epic uses mostly AMD CPUs in dev machines); we were in contact with Intel since at that point we were reasonably sure it was a HW problem. All of the code in question had been shipping with very little changes since 2017 – Oodle changes over time are mostly to the compressors, as well as port to new platforms. The decoders don’t change that often. The spike in crashes was also still limited to certain newer CPUs.

By October, we found an Epic employee experiencing this crash when running UE games on his private home machine. Having a mixture of graphics shader compilations on some threads with the Oodle decompression calls on some others seemed to be key (and that still seems to be the recipe today). We tried to find a reproducer that is less complicated than entire games, but with very limited success. We did eventually manage to get an infuriatingly inconsistent Oodle-only reproducer that would usually, but not always, produce a crash within 10 minutes of completely hogging the machine. We learned a few things from that, including confirmation that SSE/AVX usage seemed to be a red herring – we could turn off usage of SSE/AVX loops entirely and we’d still get crashes at the same rate, but didn’t make a lot of progress.

We did, however, notice that various of the stock BIOS settings, especially regarding CPU voltage, current limits and frequency scaling, were what I can only describe as completely insane. 500W power and 500A current limits on parts with 253W TDP, BIOS defaults aggressively overclocking past the already aggressive default base/boost clocks, etc. So for a while we thought this was mostly a problem with aggressive motherboard defaults driving the relevant parts well outside their intended operating range.

All the while the issue was picking up more and more steam; in late 2023, we started putting up our web page at RAD noting the aggressive motherboard defaults and relaying instructions from Intel on what they should be set to instead. More users were hitting this all the time, and the tech press was starting to notice. And Intel had recently released the 14th gen parts which were suffering from the same issue.

A few months later it was clear that changing the BIOS settings was, at best, a temporary fix; affected systems continued to be unstable even after the setting changes, and Intel started looking into their own voltage/frequency scaling logic and issuing new microcode updates trying to work around the issue every few months. By late summer of 2024, we had the confirmation from Intel (linked above) that the symptoms were caused by physical circuit degradation of the clock tree circuitry.

Not something we can do much about in SW. For that matter, once machines show these symptoms, their HW is physically damaged and needs replacement. The best we could hope for with Microcode updates and less aggressive BIOS defaults was to prevent the replacement parts from developing the same problems within a matter of months.

So, that’s the background up to middle of the last year; but I mentioned we just released Oodle 2.9.14 with an experimental workaround. What’s that to do with anything, and what changed from last year?

A semi-consistent repro, at last

In spring of 2025, we were contacted by a customer about some Oodle decode issues in their game that seemed random and looked like they might be a race or memory stomp.

I’ll spare you the details, but this does come up fairly regularly and it’s usually use-after-frees or similar on the app side, so we had a few weeks of back and forth to rule out the most likely causes.

At the end of that we had a smoking gun: a reasonably consistent test setup using their game where, on decompression failure, they were just retrying the exact same call with all the same input and output buffers again, and that would often work. For context: it is expected that, on a large enough sample of end user machines, decompression calls sometimes fail, be it because the data on permanent storage is actually corrupted or due to issues like bad RAM etc. In any sufficiently large sample, you’re going to get some of those. But, generally speaking, these errors happen some distance away from the CPU. Having one CPU core fail to decompress 256k of data with an error, and then frequently having the same core succeed trying to decompress those same 256k again (or having them fail again, but with a different error), rules out those causes, because on the second attempt on the same core, that data is usually coming from the L2 or at most L3 cache, which unlike consumer DRAM has fairly strong error correction everywhere these days.

It’s not impossible for a context switch or core migration to coincide with the retry (or close enough), but expecting these to line up frequently enough for “just retry it” to be an effective remedy is a stretch at best.

More to the point, retries didn’t seem to help everywhere. They seemed to help only on one particular type of test machine – you guessed it: 13th/14th gen Intel CPUs.

The customer was getting crashes at a fairly consistent rate when starting up their game. Again, the mixture encountered during loading, with lots of shader compilation plus Oodle decompression, seemed to be key. And their reasonably consistent repro involved overclocking the CPU slightly (by about 5%) – to be fair, technically out of spec, but also a common thing for gamers to do.

Anyway, this was definitely a reason to do some more debugging. Rare decompression errors caused by what seems to be actual data corruption are not particularly actionable, but repeated decompression of the same chunk of data either failing in different ways on subsequent attempts, or sometimes succeeding or sometimes failing, was definitely something we needed to look into.

Especially since it seemed like that customer might have finally found the thing we’d been trying and failing to find since spring of 2023: a reproducer for what seemed to be the Intel 13th/14th gen bug, consistent enough to do some actual debugging with. Full disclosure: we do not know whether this issue is the same as the spurious decompression errors that users with affected machines have been seeing. But at the very least, the symptoms are very similar.

Evidence!

Oodle Data decompression needs a bit of workspace memory. By default we just allocate and free that ourselves, or you can allocate that slice up front and pass it into Oodle to avoid the allocation churn. As part of the earlier triage process, we had already asked the customer to switch to statically allocated workspace memory – the idea being that if the problem was caused by workspace memory corruption due to a use-after-free or similar, moving our workspace to statically allocated memory that was never freed would have likely fixed the issue. (And made it clear what issue in the calling code to look for.)

Anyway, passing in the workspace memory manually also means the caller knows where it lives, and has its contents even after a failed call. This is great, because most of the “interesting” decoder data is in that workspace memory and in the event of a decode error, we have a few hundred kilobytes of data that contain a memory image of all the larger decoder data structures at time of failure. We don’t get a stack trace, but we don’t generally need one because Oodle Data has very predictable control flow and also logs error messages; given the log message for the error, we know exactly where in the decode flow we were when the error was first detected.

I’ve written prior blog posts on how Oodle Data works. The important part being that instead of one monolithic decoder, we run a sequence of small, specialized loops that are more modular, easier to optimize, and easy to mix-and-match as desired. The basic currency that most of these building blocks work with is arrays of bytes, and these arrays of bytes go (among other things) into that workspace memory.

In short: all we needed to get some actionable information was a dump of the decoder workspace memory after some failed decodes. The customer was able to provide multiple examples of 10 failed decodes in a row, all failing in slightly different ways. This was fantastic – we could compare the workspace memory contents with known successful decode runs and hopefully get some more information about what was actually happening. This involved writing a small impromptu Python script to load workspace memory dumps and compare them against known-good reference decoder state.

Even just the log messages were already surprising, though: none of the failed decode logs we were sent had any evidence of a bitstream desync. Normally, if you’re going to poke around in a compressed data blob then try and decompress it, that’s what’s most likely going to happen – at some point, there’s some difference that causes the decoder to get “out of phase” with the encoder, maybe read a few bits too few or too many, and from then on it’s a cascade failure where the rest of the bitstream (up to the next sync point, anyway) decodes to complete garbage and you get some “unexpected end of file” or “invalid header” or similar error.

This literally never happened in the logs of failed decode attempts we were sent. Not once. That is, by itself, incredibly suspicious. There were errors that caused the LZ portion of the decode to go out of sync – matches with the wrong offsets, or the wrong lengths, things like that, which then ultimately result in a desync at a later stage (producing the app-visible decode error). But none of the bitstream reading ever failed in that way. We always made it fully through bitstream decoding then noticed inconsistencies later.

More to the point: when diffing the decoded bitstream contents against the expected values, there were never any big diffs either. Nor were there any “burst errors”. It was always single individual bytes being corrupted, then the next 7000-10000 bytes would be completely fine (and correct) before we got another single-byte error.

This also made it unlikely that we had a memory stomp or similar. In principle these can be anything, but in practice, in a memory stomp, you’re usually seeing at least 4 adjacent bytes being overwritten with a value that clearly doesn’t belong – usually a 32-bit or 64-bit write, because those are the most common access sizes. It possible to get individual 1-byte values stomped, with thousands of bytes in between them, it’s just not very likely. You would need awfully specific memory access patterns in the writer to get that kind of thing.

Finally, the bytes that were corrupted always seemed to contain small byte values, always between 1 and 11 in fact.

Values between 1 and 11, you say?

After I made that last observation, I had a hunch that soon proved to be correct: all the arrays with incorrect decoded data were Huffman coded (itself not surprising since that’s our most common entropy coder choice by far), and all the bytes that were corrupted were not corrupted arbitrarily – instead, instead of the value assigned to each Huffman code word, the corrupted bytes stored its length (in bits) instead. I have this post describing how Huffman decoding in Oodle works, but the part you need to know is just this: we normally use a 11-bit length limit for codes, and decode using a 2048-entry table whose entries look like this:

struct HuffTableEntry { uint8_t len; // length of code in bits uint8_t sym; // symbol value };

And I’ve already written yet another post on the details of the Huffman decoder, so I’ll skip to the end: on x86-64 CPUs with BMI2 support, the 4-instruction sequence we use to decode one byte is this:

andn rcx, rMask, rBits ; peek (rMask = ~2047) movzx ecx, word [rTablePtr + rcx*2] ; table fetch shrx rBits, rBits, rcx ; bit buffer update mov [rDest + <index>], ch ; emit byte

If you can’t read x86 pseudo-assembly: in C-like code, these 4 instructions correspond to

HuffTableEntry e = table[bits & 2047]; // andn, movzx bits >>= e.len; // shrx dest[index] = e.sym; // mov

That’s all we do to decode a single byte. We do this for 6 bitstreams in parallel (to hide latency) and unroll it a bunch of times, and every few bytes decoded there’s some other things that need to happen that don’t matter for this post, but that’s the key decode step.

Well, it seems that on the affected machines, and in this particular test setup (which includes running that game and overclocking the machine a bit) some of the time (ballpark, once every 10000 bytes decoded), that store doesn’t store CH (bits [15:8] of RCX), but CL (bits [7:0] of RCX) instead.

Oof.

That was the theory, anyway. Not hard to check: instead of using the store high, we can throw in an extra shift and then store from CL instead:

andn rcx, rMask, rBits ; peek (rMask = ~2047) movzx ecx, word [rTablePtr + rcx*2] ; table fetch shrx rBits, rBits, rcx ; bit buffer update shr ecx, 8 ; shift out len mov [rDest + <index>], cl ; emit byte

We sent a version of Oodle with that decoder to the customer and asked them to re-run the test, and yup, that seems to fix the transient decode errors for them, at least in their test setup.

So, that’s the workaround we’re shipping on Oodle 2.9.14: on the affected CPUs, throw in the extra shift instructions to avoid the stores from “byte high” registers. This is slightly slower, but with the emphasis on slightly – this particular loop was originally optimized for Haswell Generation Intel CPUs (2013). x86 CPUs have gotten wider since, supporting more instructions per clock cycle, and the main speed limiter in that loop on newer Intel CPUs is the latency along the dependency chain (see the linked article on the 6-stream Huffman decoders for analysis). Adding an extra instruction and an extra cycle of latency on the non-critical “store byte” edge is not the end of the world on these newer CPUs. It’s a minor decode speed regression, but on typical Kraken data, it’s about 0.5% slower, which is not that bad. (And the workaround is only necessary on those specific CPUs.)

Conclusion: what actually causes this?

In the hardware? Well, I don’t really know, so this is all speculation; we’re well into the weeds of uArch implementation details here.

I talked about this with some friends and our best guess boils down to the following: the data we have, as originally loaded, is just in RCX, and that store needs to store bits [7:0] if it’s a regular byte store or bits [15:8] from that register if it’s one of the rare 8-bit high byte stores. The actual register number here almost certainly doesn’t matter, this is all renamed after all; but an x86-64 CPU that supports the high byte stores needs to be able to select which byte of a register a byte store is actually writing to memory.

That means there’s gotta be a bunch of multiplexers somewhere, these multiplexers get some control signal telling them which byte to store, and that control signal seems to have very little timing slack. At least on some parts, not enough. If they’re overclocked (or at the high end of their “turbo boost” clock range), sometimes, that control signal doesn’t make it in time, and what’s stored is instead the low half of the register.

This is almost certainly not the only signal that is cutting it close, but this workaround seems to help for Oodle decoding, at least for this customer. Whether this counts as a real fix, I’m doubtful; the affected machines are already also crashing in other places (see list at the top of the post for other examples). I think best-case, this might reduce crash rate on the relevant machines somewhat, but probably not in a way that actually leads to a good user experience. It still seems worth trying.

For now, this Oodle release is brand new and hasn’t been rolled out in games (as of this writing). We’ll know in a few months whether it actually helps reduce crash rates.

Another thing we were wondering was whether the hand-written ASM code in Oodle was the problem. After all, this is using a pretty niche x86-64 feature, and the high byte registers aren’t actually used much. But a quick test on godbolt.org shows that Clang and GCC end up using the exact same trick for a C++ version of the decode dependency chain. They do end up using 5 instructions per byte though, not 4, because unlike the code that ships in Oodle, they use regular AND, not ANDN (which has a 3-operand form). Either way, the code that actually ships in Oodle is not exactly what any mainstream x86-64 compiler would produce, but using the 8-bit high registers (the esoteric feature in question) is not the distinguishing factor here, mainstream compilers use that too.

This bug made zero sense to me for the longest time, and once you have strong evidence that a misbehaving CPU is involved, debugging it purely from the SW side is pretty miserable since CPUs are complex, full of hidden internal state and generally black boxes. So it was pretty exciting to get these data dumps that indicated that what might be happening is not only comprehensible, but in fact was very suggestive of the underlying problem.

How useful this actually ends up being, and whether it actually helps users in the wild (or just this special and somewhat contrived test setup) remains to be seen, but I’m hopeful.

Read Entire Article