Disassembling Terabytes of Random Data with Zig and Capstone to Prove a Point

2 hours ago 1

Published on November 4, 2025.

I had a friendly disagreement the other day. Imagine looking for ARM (Thumb mode) instructions in a stream of random bytes. A friend claimed that the random stream is more likely to contain DEFLATE-compressed Thumb instructions than it is to contain uncompressed Thumb instructions.¹ I disagreed.

The gist of his argument was that random bytes have high Shannon Entropy in expectation, and compressed bytes have higher entropy than raw assembled instructions. Thus, compressed streams of assembled instructions will be more likely to occur than raw assembled instructions in sequences of random bytes. I don’t debate the premise, but the conclusion does not follow.

I argued that the joint probability of random bytes being a valid DEFLATE stream—and that stream inflating to valid instructions—is lower than the probability of random bytes being valid instructions. Having written a DEFLATE/INFLATE implementation from scratch, I know how many ways there are for compressed data to be invalid. I also know that ARM’s Thumb instruction set is designed for high code density, which directly implies that a large proportion of possible byte sequences disassemble as valid Thumb instructions. Just because the entropy of typical Thumb code in the wild is lower than that of typical compressed data doesn’t mean that the theoretical maximum entropy of valid Thumb is lower too. He was not convinced. I was nerd sniped.

Since he has a PhD in an applied field of computer science, I figured he is most likely to be compelled by strong experimental results in a poorly-written, LaTeX-typeset paper with lots of graphs that he can skim and pretend to have read fully. (I’m only mostly joking.)

Here is a link to the typeset PDF research paper version of this post.

AI Usage Disclosure

Large Language Models (LLMs) were not used for any part of this research. The code and prose are entirely human-written. I wrote a separate post detailing some opinions about LLMs that you should read if you want to know more.

We write Zig code to generate random byte sequences. Then we try to disassemble the bytes using Capstone. We also try to inflate and then disassemble the bytes. We record successes and failures to output statistics at the end.

We test code samples with Zig version 0.14.1.² Zig changes fast, so it is likely that code samples will need to be modified to work on other versions.

The full code repository is available on GitHub.

Capstone and Zig

Zig’s build system makes it easy to automatically clone, compile, and link against Capstone.³ Assuming a build.zig file already exists in the working directory (use zig init to create an empty project if not), the following command updates the build.zig.zon file with Capstone as a dependency.

zig fetch \ --save=capstone \ 'git+https://github.com/capstone-engine/capstone.git#5.0.6'

Then we add the following to build.zig.⁴

// There should already be a main executable created in the build script // const exe = b.addExecutable(.{ ... }); // Build Capstone const capstone_dep = b.dependency("capstone", .{}); const capstone_cmake = b.addSystemCommand(&.{ "cmake", "-B", "build", "-DCMAKE_BUILD_TYPE=Release", "-DCAPSTONE_BUILD_SHARED_LIBS=1", }); capstone_cmake.setCwd(capstone_dep.path("")); const capstone_make = b.addSystemCommand(&.{ "cmake", "--build", "build" }); capstone_make.setCwd(capstone_dep.path("")); capstone_make.step.dependOn(&capstone_cmake.step); exe.step.dependOn(&capstone_make.step); // Add the Capstone lib and include directories so we can import capstone.h // and link against the Capstone shared object or DLL exe.linkLibC(); exe.addLibraryPath(capstone_dep.path("build")); exe.linkSystemLibrary("capstone"); exe.addIncludePath(capstone_dep.path("include"));

In our main.zig file, we import and use Capstone.⁵ Our Capstone.disassemble method returns a usize that represents the percentage of disassembled instruction bytes (between 0 and 100, inclusive). We only care the proportion of bytes that successfully disassemble; we don’t care about the disassembled instructions themselves.

Generating Random Bytes

We generate random bytes using ChaCha seeded by cryptographic randomness. ChaCha is a fast CSPRNG, so it outputs high-quality random bytes, but won’t slow down our hot loop.

var random = random: { var seed: [std.Random.ChaCha.secret_seed_length]u8 = undefined; std.crypto.random.bytes(&seed); var chacha = std.Random.ChaCha.init(seed); break :random chacha.random(); }; // Fill a buffer with random bytes var in_buffer: [128]u8 = undefined; random.bytes(&in_buffer);

Inflating Bytes

In Zig 0.14.1, the standard library INFLATE implementation operates on reader and writer streams.⁶ We build those streams out of pre-allocated buffers, one of which contains the random input data.

var out_buffer: [128 * 1024]u8 = undefined; var in_stream = std.io.fixedBufferStream(&in_buffer); var out_stream = std.io.fixedBufferStream(&out_buffer); try std.compress.flate.inflate.decompress( .raw, in_stream.reader(), out_stream.writer(), )

Parallelizing

To maximize throughput, we run our main loop across all CPU cores. We keep statistics local to each thread, only compiling global results from the thread-local results when the loop terminates. That way the hot loop is lock-free, and we avoid overhead from contention. The downside of this strategy is that we can’t print aggregated statistics while the trials are running.⁷

const total_iterations: usize = 100_000_000; var results: struct { disasm_count: u64 = 0, inflate_count: u64 = 0, inflate_disasm_count: u64 = 0, lock: std.Thread.Mutex = .{}, const Self = @This(); pub fn update(self: *Self, disasm_count: u64, inflate_count: u64, inflate_disasm_count: u64) void { self.lock.lock(); defer self.lock.unlock(); self.disasm_count += disasm_count; self.inflate_count += inflate_count; self.inflate_disasm_count += inflate_disasm_count; } fn print(self: *Self) !void { // ... } } = .{}; fn loop(iterations: usize) !void { var disasm_count: u64 = 0; var inflate_count: u64 = 0; var inflate_disasm_count: u64 = 0; // Other initialization for (0..iterations) |_| { // Do stuff... } results.update(disasm_count, inflate_count, inflate_disasm_count); } pub fn main() !void { const thread_count = @min(std.Thread.getCpuCount() catch 1, 1024); var thread_buffer: [1024]std.Thread = undefined; const threads = thread_buffer[0..thread_count]; const iterations = total_iterations / thread_count; for (threads) |*t| { t.* = try std.Thread.spawn(.{}, loop, .{iterations}); } for (threads) |t| { t.join(); } try results.print(); }

Collecting Decompression Errors

To collect decompression errors, we use Zig compile-time (comptime) metaprogramming. At comptime, we build an array of all possible errors returned from the decompression function. We use that array to initialize a corresponding array of counts, where we tally the number of occurrences of each error.

Notice that to reify an error from its name, we use @field(anyerror, name). This works because anyerror can be considered an enum that collects all errors reachable in the compilation unit.⁸ Hence, every possible Zig error is a field of that global enum.

Reifying errors from their names is important for inlining the mapping between returned errors and their corresponding index in the counts array. It is faster to check the equality of the received error and the expected error directly than to serialize the received error to a name and then do string comparison.

if (std.compress.flate.inflate.decompress( .raw, in_stream.reader(), std.io.NullWriter{ .context = {} }, )) { inflate_count += 1; } else |e| { inline for (errors, 0..) |err, i| { if (err == e) { counts[i] += 1; break; } } }

Final Script

The full script that combines each of these parts is available in the GitHub repository.

The most important result is that successful disassembly is over 125x more common than successful decompression, and over 350x more common than decompression and disassembly.

Total	1,000,000,000
Disassembled	642,628,412
Decompressed	5,037,627
Decompressed and disassembled	1,810,170

Statistics for 1,000,000,000 random 128-byte buffers, counting instances that disassemble 95% or more of the bytes.

Disassembly Results

Given that about 65% of 128-byte buffers consist of at least 95% valid Thumb instructions, there are two natural questions: since most Thumb instructions are two bytes long, what proportion of 2-byte sequences are valid Thumb instructions? And what percentage of 128-byte buffers consist entirely of valid Thumb instructions?

Total	1,000,000,000
Disassembled	892,800,538

Statistics for 1,000,000,000 random 2-byte buffers, counting instances that disassemble completely.

According to our measurements, any 2-byte sequence has an 89.3% chance of disassembling as a valid Thumb instruction. Evidently, Thumb has very high code density.

Note that some few Thumb instructions assemble as four bytes, so we cannot assume that every pair of bytes has an independent, identically (Bernoulli) distributed probability of being part of a valid assembly sequence. Our measurements indicate that 855,008,565 (or 85.5%) of 1,000,000,000 4-byte buffers disassembled completely as valid Thumb instructions, compared with the 0.892² ≈ 79.7% rate we would expect if all Thumb instructions were 2-bytes long.

Though we can’t assume every pair of bytes is independent of every other pair, we may assume that every 4-byte sequence is independent and identically distributed in terms of the probability of successful disassembly. Using this assumption, we can approximately predict the proportion of fully disassembled 128-byte sequences.

0.855008565^(128/4) = 0.006653572 ≈ 6.65%

Compare our prediction with the actual, measured incidence of 4.47% of 128-byte buffers that completely disassemble.

Total	1,000,000,000
Disassembled	44,726,130
Decompressed	5,036,173
Decompressed and disassembled	1,577,559

Statistics for 1,000,000,000 random 128-byte buffers, counting instances that disassemble 100% of the bytes.

Even when we require 100% of bytes to disassemble as valid Thumb instructions, there are still an order of magnitude more buffers that disassemble than decompress.

If we assume groups of four consecutive bytes have an independent and identically distributed probability of successful disassembly, we expect the likelihood of completely disassembling a buffer to decay exponentially relative to the size of the buffer. In other words, the probability of finding a sequence that does not disassemble should quickly approach 1 as the size of the buffer increases. The experimental results validate that theoretical prediction.

Statistics for 1,000,000,000 random buffers of each size (in bytes), counting instances that disassemble 100% of the bytes.

Also, unsurprisingly, as the minimum threshold for successful disassembly decreases, the proportion of successes increases for each buffer size.

Statistics for 1,000,000,000 random buffers of each size (in bytes) and threshold for disassembly, counting instances that disassemble at least the threshold percent of the bytes.

Decompression Results

Based on our experiments, we know why disassembly is likely to succeed. But, independent of successful disassembly, why is decompression likely to fail?

The rate of successful decompression is fairly constant for buffers above 32 bytes in size.

Statistics for 1,000,000,000 random buffers of each size (in bytes), counting instances that decompress successfully.

This suggests that the majority of errors from unsuccessful decompression happen in the first 32 bytes of the buffer, which makes sense, since that is where the small DEFLATE header would be found for valid compressed streams. In other words, it’s unlikely that random bytes make a valid DEFLATE header. But if the header is valid, the rest is likely to successfully decompress, regardless of buffer size.

The errors returned when decompression fails corroborate the hypothesis that failures are primarily caused by invalid DEFLATE metadata.

Total	1,000,000,000
Success	5,030,965
Wrong Stored Block Nlen	251,255,219
Invalid Block Type	251,254,609
Invalid Match	214,744,118
Oversubscribed Huffman Tree	146,011,334
Incomplete Huffman Tree	74,024,330
Invalid Dynamic Block Header	31,141,638
Invalid Code	26,438,043
Missing End of Block Code	95,816
End of Stream	3,928

Statistics for 1,000,000,000 random 128-byte buffers, counting the errors that are returned when decompression fails.

To understand these errors completely, we must explore the structure of the DEFLATE stream.

DEFLATE Structure

A DEFLATE stream consists of one or more blocks. The first bit of every block denotes whether it is the final block in the stream.⁹ If the first bit is 1, the decompressor terminates when it sees an “end of block” token. On the other hand, if the first bit is 0, the decompressor will look for another valid DEFLATE header after an end of block token in the stream. Thus, in that case, there is an additional opportunity for any of the errors that could have caused decompression to fail on the first block header.

As a result, we can reduce the failure rate for most classes of error by fixing the first bit of the stream to be 1, so that only one valid header is required, instead of two or more.

	First bit 0	First bit 1
Total	1,000,000,000	1,000,000,000
Success	56,372	10,010,648
Wrong Stored Block Nlen	252,530,381	250,005,035
Invalid Block Type	252,495,113	249,999,800
Invalid Match	215,816,449	213,698,259
Oversubscribed Huffman Tree	146,737,762	145,281,110
Incomplete Huffman Tree	74,398,039	73,635,423
Invalid Dynamic Block Header	31,290,580	30,987,133
Invalid Code	26,574,767	26,283,441
Missing End of Block Code	96,424	95,216
End of Stream	4,113	3,935

Statistics for 1,000,000,000 random 128-byte buffers with the first bit fixed to either 0 or 1, counting the errors that are returned when decompression fails.

The second and third bits of the DEFLATE block header denote how the data is encoded.¹⁰ There are four possible options for these two bits, and three of the four options are considered valid.

A value of 11 for the second and third bits is considered invalid, and the decompressor will return an “invalid block type” error in that case. Because this pattern occurs for one in four random streams (in expectation), it accounts for approximately 25% of the attempts that end in decompression failure. Setting the second and third bits to 11 makes 100% of decompression attempts fail.

Total	1,000,000,000
Success	0
Invalid Block Type	1,000,000,000

Statistics for 1,000,000,000 random 128-byte buffers with the first three bits fixed to 111, counting the errors that are returned when decompression fails.

If the second and third bits are 00, the data is included “raw” – it is not encoded or compressed before being included in the stream. In this case, the second and third bytes of the stream represent the length of the raw data, and the two bytes after that are their ones’ complement (logical negation). According to the DEFLATE specification, the data after the first byte in a raw DEFLATE stream looks like this:

2 3 4 5 6... +---+---+---+---+=================================+ | LEN | NLEN |... LEN bytes of literal data ...| +---+---+---+---+=================================+

The likelihood of the fourth and fifth bytes being the exact ones’ complement of the second and third bytes is:

1 / 2¹⁶ ≈ 0.0015258789%

The extremely low probability of success for the raw block type explains another ~25% of decompression attempts that end in failure with the “wrong stored block nlen” error. Setting the second and third bits to 00 makes almost all of the decompression attempts fail.

Total	1,000,000,000
Success	24
Wrong Stored Block Nlen	999,984,702
End of Stream	15,274

Statistics for 1,000,000,000 random 128-byte buffers with the first three bits fixed to 100, counting the errors that are returned when decompression fails.

In the other two block modes, raw data is first LZ77-encoded, then Huffman-encoded to compress it. In Huffman coding, an efficient bit representation of each symbol is generated from the frequency distribution of symbols in the uncompressed data. Symbols that appear more frequently will be represented using fewer bits. The rules for decoding symbols from bits are stored as a “Huffman tree” data structure.

If the second and third bits of the DEFLATE stream are 01, the data is decoded using a hard-coded (“static”) Huffman tree whose structure is specified in the DEFLATE specification. If the second and third bits are 10 instead, the Huffman tree was built dynamically, and is included in the stream itself. Using the hard-coded Huffman tree typically leads to worse compression ratios than using a dynamic Huffman tree, but the static tree was more useful for our purpose of decompressing random bytes since it presents fewer opportunities for erroneous header data.

	Dynamic Huffman Code	Static Huffman Code
Total	1,000,000,000	1,000,000,000
Success	0	40,055,267
End of Stream	233	570
Invalid Code	15,254	105,125,696
Oversubscribed Huffman Tree	581,086,654	0
Incomplete Huffman Tree	294,586,455	0
Missing End of Block Code	380,197	0
Invalid Match	0	854,818,467
Invalid Dynamic Block Header	123,931,207	0

Statistics for 1,000,000,000 random 128-byte buffers with the first three bits fixed to either 101 or 110, counting the errors that are returned when decompression fails.

In order to refute the strongest version of the claim that decompressing and then disassembling random bytes is more likely to succeed than just disassembling random bytes, we re-run our initial test after fixing the first three bits of the random buffer to 110. That way the otherwise random bytes are always considered the final block in the stream, and always use static Huffman codes to avoid header errors.

Total	1,000,000,000
Disassembled	44,724,849
Decompressed	40,056,259
Decompressed and disassembled	12,543,383

Statistics for 1,000,000,000 random 128-byte buffers with the first three bits fixed to 110 before decompressing (for static Huffman trees), counting instances that disassemble 100% of the bytes.

Though the rates of success are much closer, even in this most generous case, disassembly is more likely than decompression, and once again much more likely than decompression and disassembly.

Other Architectures

Besides ARM in Thumb mode, Capstone supports disassembling a wide variety of other architectures and modes. Using C macro reflection in Zig, we can re-run our randomized disassembly across all available architectures and modes.

In many architectures, instruction disassembly is stateful, and is conditioned on previous instructions (especially when assembled instructions have variable length, such as in x86). As a result, many other architectures do not admit the assumption that aligned, fixed-sized buffer subsections have an independent probability of being disassembled successfully.

Statistics for 10,000,000 random 128-byte buffers, counting instances that disassemble at least 95% of the bytes.

Statistics for 10,000,000 random 4-byte buffers, counting instances that disassemble 100% of the bytes.

Performance

Overall, the Monte Carlo code that powers these “experiments” is very performant. Zig is a fast language, and even without any deliberate optimization, the code can disassemble and decompress one billion 128-byte buffers in about a half hour on the 10-year-old, 48-core server we used for testing.

jacob@jacob:random-instructions$ time ./zig-out/bin/random_instructions --total-iterations 1_000_000_000 --disassembly-threshold 100 [ ... ] real 30m35.158s user 1396m34.236s sys 1m0.426s

Most of the computation time was spent in Capstone. Without profiling in detail, we suspect that overhead from dynamic memory allocation performed inside of Capstone during disassembly accounts for a significant portion of that time. The code is fast enough for testing that it wasn’t worth experimenting with different allocator implementations in Capstone. It likewise wasn’t worth doing any detailed profiling.

The program’s running times scaled approximately linearly with the size of the input buffer. There was no evidence of the disassembly threshold impacting the running time, nor was there an impact on decompression time resulting from fixing particular starting bits.

Adding evidence to the theory that Capstone was the slowest part of the code, the throughput of a different script that only decompressed without attempting to disassemble was much higher. In a randomly chosen run, that script decompressed one billion 128-byte buffers in 5 minutes and 46 seconds.

jacob@jacob:random-instructions$ time ./zig-out/bin/random_inflate --total-iterations 1_000_000_000 [ ... ] real 5m46.619s user 261m22.503s sys 1m6.833s jacob@jacob:random-instructions$ time ./zig-out/bin/random_inflate --total-iterations 1_000_000_000 --first-bits 0b011 --num-bits 3 [ ... ] real 5m47.705s user 262m18.896s sys 1m6.702s

In other words, without any particular optimizations, that program had a decompression throughput upper-bounded by:¹¹

(128 × 1000000000) / (5 × 60 + 46) = 369942197 bytes per second ≈ 345 MB/s

The code and raw data used to generate the graphs are on GitHub.

Our theoretical and experimental results point clearly to one conclusion: Thumb instructions have very high code density, and are much more likely to occur in sequences of random bytes than compressed data (especially more likely than compressed data containing valid Thumb instructions).

Even in cases where we artificially modify the random data before decompressing it to make it as likely as possible to decompress, it is still less likely to decompress than disassemble.

Zig makes it fun to write fast code, and was pretty awesome to use for this exploratory analysis. If you’ve not heard of it, you should definitely check it out.

Other ISAs supported by Capstone show varying code density, and it would be worth researching better candidates than Thumb for maximizing the likelihood of successful disassembly of random byte sequences. For example, does M680X actually have absurdly high code density, or was there a methodological error or code bug artificially inflating its disassembly success rate?

We have done no work to assess whether the instruction sequences appearing in our random buffers are of particular interest. It would be unsurprising if an instruction set with relatively low code density (and therefore low rate of random disassembly success) might have a higher likelihood of useful instruction sequences appearing in random byte streams, compared with a dense instruction set with a high likelihood of disassembly.

There might be other ways to cook up the header of a DEFLATE stream to maximize the decompression likelihood of an otherwise random stream. For example, is it possible to craft a “dynamic” Huffman tree (that we insert at the beginning of every random sequence) that maximizes the likelihood of successful decompression?

If anyone continues this research, please email me or use my contact form to notify me of your results.

Nice try, but this isn’t actually a real research paper. I hope you weren’t hoping for APA-style citations or any such nonsense. The important references are linked inline.

Thanks to Logan Snow and Amy Liu for reading early versions of this post.

Shout out to my anonymous friend whose incorrect—but strongly-held—convictions inspired this work in the first place.

Read Entire Article