Evaluating Image Compression Tools

4 months ago 6

2025-06-13

[Update 2025-06-24: I found a significant issue in the encoder comparison script around threading. This has now been fixed, and the graphs recalculated.]

When I wrote the last part of my AVIF encoder series, I only did a very minimal comparison between tinyavif and other, more widely used, image and video encoders. The plan was to come back and write a better comparison tool later, along with a blog post explaining the basics.

However, shortly after I finished the AVIF series, Gianni Rosato wrote a great blog post covering a lot of what I originally planned to say. So instead of retreading the same ground, I decided it would be good to look at some more advanced topics, building on what he wrote.

So today we'll explore: How still-image compression differs from video compression, how to gather good data for the comparisons, and Netflix-style multi-resolution encoding.

And, last thing before we start: All of the graphs in this post were generated using version 2.1 of my image encoder comparison tool. It's almost a complete rewrite from the version used in the AVIF series, based on experience from writing the first version. Currently this tool only handles image compression, not video compression, so if you want to compare video encoders I would recommend looking at the tool Gianni used in his post, written by the authors of the SVT-AV1-PSY fork of SVT-AV1.

Before we can compare anything, we need to decide what data we're going to collect and how. This is something where every bit of effort pays off significantly, because it shapes so many downstream decisions. This is especially true for comparisons used as part of codec development: a huge chunk of the progress in video and image compression in the last 30+ years has been the accumulation of many small improvements. And to do that, we need to be able to reliably identify small differences to help guide development decisions.

The main things we need to decide on are: Which input files to encode in the first place, how to evaluate the quality of the resulting files, and how to measure runtime. We'll tackle each of those in turn.

Selecting input files

Something I glossed over last time is the importance of testing on a wide variety of different images and videos, not just a single file. Ideally, unless you're writing some kind of extremely special-purpose encoder, the test set should include a relatively even sampling of different genres and styles, to make the results as widely relevant as possible.

Unfortunately finding good test files can be quite difficult, because so much of the material we might want to test on is copyrighted. Fortunately this situation has improved in recent years thanks to several organizations making uncompressed files available for research purposes. Xiph.org have helpfully collected together many of these datasets here.

At the same time, we need to think carefully about how many files we include in our test set, and how many encodes we run for each file. Thinking about it from a statistics perspective, each encode we run gives us a noisy sample of one of our chosen encoders. Taking more samples - either by using more input files, or running more quality points per file - helps, because we can average them out to get a less noisy estimate of the "true" performance of each encoder. But at the same time, each encode we run takes more compute time, so there's a balancing act here.

One thing I would like to do is work through the statistics to calculate some proper error bars, so that we can make an informed decision about how many inputs to run. Unfortunately I haven't seen anything about this elsewhere, so for now I'm just going to present averages as well. But please do let me know if you've seen anything about this that I've missed! Otherwise I'll have to have a go at it myself in a future post.

With all that in mind, I wanted to pick a set of inputs with consistent resolutions (to simplify the multi-resolution encoding we'll talk about later). So I picked out the first frame from 19 videos provided by the NTIA in the US, Taurus Media Technik in Germany, and Sveriges Television AB in Sweden - for which I would like to thank all three organizations.

I had to reject 3 of those 19 videos because some encoders couldn't manage the full quality range I wanted, even pushing to the absolute highest and lowest qualities the encoders would allow. That left a set of 16 images, which will be referred to throughout this post as the "Mixed test set".

To demonstrate the importance of using a varied test set, I decided to try comparing different speed settings of libaom, first on the single source file I used last time (a frame taken from the short film Big Buck Bunny), and then on the full 16-file test set. These turned out to give very different size vs. quality graphs:

We can see that libaom speed 8 is much worse than the surrounding presets on the specific frame I picked out from Big Buck Bunny, but this difference disappears when averaging over multiple files. This shows how testing on a single file can be quite misleading. As I did that in my tinyavif series, I've revisited that comparison toward the end of this article to see if the results hold up to more thorough testing.

Evaluating image quality

Once we have our image files, we need to compare each one to the uncompressed original and assign each one a numerical quality score. The gold standard for this is of course human evaluation, because at the end of the day what we're really trying to do is minimize the file size while still looking good to humans.

Human tests are done by having many people look at original/compressed image pairs, often in a controlled environment to minimize the influence of unintended variations in lighting, screen settings, and so on. These scores are then averaged, and recorded as a "Mean Opinion Score" (MOS) for each compressed image.

However, these evaluations are relatively expensive to run. So we often use a variety of proxy metrics to try to more cheaply approximate human ratings. These run the whole gamut from easy-to-calculate metrics like PSNR to complex models of human perceiption like the recent SSIMULACRA2 and Butteraugli metrics.

Generally speaking, the more complex metrics correlate better with human ratings than simpler metrics do, especially when comparing across encoders. When comparing incremental improvements to the same encoder, the received wisdom is that PSNR is often good enough - though I must admit I'm not entirely convinced of that.

In any case, I'm running a comparison between multiple different codecs, so for these tests I'm using SSIMULACRA2, like I did in the tinyavif series.

Measuring runtime

Trying to measure how long it takes to encode a video runs into all of the usual pitfalls that affect any benchmark - there are a whole lot of noise sources which need to be controlled to get high-quality results. I want to talk about benchmarking more generally in a future post, so for now I'll just mention one trick I'm quite fond of:

For context, how long an encode takes can be quite strongly affected by whether it's the only thing running, or if there are other encodes running on the same system. A lot of this is because of competition for memory bandwidth - which, as Alex Yee observed in his review of AMD's Zen 5 microarchitecture is a major bottleneck on most modern systems. As a result, one single-threaded encode might not run into memory bandwidth issues, but (say) 16 threads' worth of encodes in parallel definitely will.

Now, running one encode thread per (logical) CPU will still maximize overall performance, so that's what any serious encoding box will do as much as possible. Which means that's what we want to test, because we want to model real-world usage as closely as possible.

Thing is, we can't perfectly schedule our encodes to all finish at the same moment, so we'll always have some trailing encodes which don't have to compete quite as hard for memory bandwidth as the rest - resulting in those encodes being disproportionately faster. This is very hard to completely avoid without serious shenanigans, but we can reduce the impact by being clever about the order in which we run our encodes:

By starting the longest encodes in a batch (generally the highest-quality, highest-resolution ones) first, and the shortest encodes last, the short encodes can effectively fill in the gaps formed when the longer encodes finish. This helps in two ways: It shortens the "tail" period where we just have a few stragglers still running, and it biases that tail toward lower-resolution encodes which tend to demand less memory bandwidth in the first place and so are less sensitive to interference from other threads. As a nice bonus, shortening the tail period also means that the encode batch as a whole finishes a little quicker.

By and large, image compression tends to be harder to improve than video compression, so the differences between codecs/encoders (and between different settings of a single encoder) tend to be smaller for image compression than for video compression. As a very rough ballpark, each new generation of codecs tends to reduce the size of a typical image file by 20% and of a typical video by 30-40% at the same quality. This will mainly manifest as smaller BDRATE differences in the graphs below than you might be used to if you're coming from the video compression world.

Another difference, which I hadn't expected when I started this project, is that image and video compression programs (not necessarily the codecs themselves, but the front-ends) expect different input formats. In the video world, the YUV4MPEG2 format is ubiquitous for describing uncompressed video. Unfortunately, neither of the image compression programs I wanted to test (JPEGli and JPEG-XL) supported this, instead mainly expecting PNG-format input.

This turned out to cause problems when trying to set up a truly fair comparison, as all of my test inputs are from videos originally, so are in YUV-style colour spaces with 4:2:0 subsampling. All of the AV1-based image compression programs accepted this, but for the others I had to first convert the images to PNGs, which (as far as I can tell) only support non-subsampled sRGB. This was true even for JPEGli, which does support 4:2:0 subsampling internally - there's just no way to get lossless 4:2:0 format images into it in the first place!

This creates a bit of a bind, because every format conversion causes a certain amount of degradation of the image quality, which will artificially reduce the quality scores of the resulting compressed files. So, in the interest of maximum fairness, I made sure that all of my compression and evaluation pipelines did exactly one 4:2:0 -> 4:4:4 conversion, regardless of codec. For the JPEG-like codecs, this meant converting first and then compressing, while for the AVIF encoders the files were compressed first and then encoded.

I'm still a little unsatisfied with that approach - it's still not perfectly fair, but it's the fairest of the easy options. Ideally the command-line tools for JPEGli/JPEG-XL would add support for a lossless 4:2:0 format like YUV4MPEG2, as that would allow a fully fair comparison.

Multi-resolution encoding is an idea I originally learnt about from Netflix. The basic idea is that, generally speaking, video compression tends to be more effective at high quality settings than at lower quality settings. There are a few reasons for this, but the upshot is that, if aiming for low qualities / low bitrates, it can often be more efficient to downscale the image before encoding.

The exact cutoff point where it makes sense to encode at (say) 720p resolution instead of 1080p resolution varies a lot between different images/videos. So we need to make this decision per input file, before we average the statistics from multiple files together. And of course we need to make the decisions separately per codec, before comparing the averaged quality curves.

As one example, let's look again at the example I've been using from Big Buck Bunny. We'll try encoding at four resolutions: 1080p, 720p, 480p, and 360p. Note that each of these sizes has roughly half the total number of pixels of the next larger size, while still sticking to fairly-standard sizes. After compression, each file was rescaled up to 1080p and compared to the original input, giving a set of effective size:quality curves:

We can then think about what we want to achieve from two different angles: If we're targeting a specific quality, we can use whichever resolution gives the smallest file size for that quality (ie, whichever curve is leftmost at that vertical position). Or if we're targeting a specific bitrate, we can use whichever resolution gives the best quality (ie, whichever curve is highest at that horizontal position). These both generate the same "multi-resolution" curve, which looks like this:

Here the dotted line is the tradeoff curve for encoding at 1080p only, while the solid line uses the best of the available sizes at each quality point. You can see that it makes a big difference at the lower quality end.

So let's try this with a more capable encoder - libaom at speed 6:

Now the difference is much smaller. Which makes sense: libaom is already reasonably well optimized for a wide range of quality settings, so there's less room for this trick to improve things. Still, it provides some benefit even for a mature encoder, so it's worth keeping this trick up our sleeve.

As an aside, AV1 actually has a couple of ways of doing this internally: individual frames can be encoded at different sizes, and there's a "super-resolution" (or superres for short) feature which lets the encoder apply some filtering after upscaling. But both of these are disabled by default, and so don't factor into our comparison here.

Putting everything together, we can redo our comparison from last time, only this time with higher-quality data:

This tells a slightly different story to last time: While libaom still manages to beat out the others (though using much more CPU time) in most cases, occasionally jpegli takes the lead. Meanwhile, tinyavif doesn't quite manage to beat out jpegli at the high end - though it's not too far off by the time we get to a score of 90 (representing visually-lossless compression).

We can also look at how this is affected by multi-resolution encoding. Once again, the dotted lines are for encoding at 1080p only, solid lines pick the best resolution for each quality point:

So, as we might expect from our earlier results, multi-resolution encoding tends to narrow the gap between the encoders, in both size and runtime. One consequence of this is that, if developing an encoder, we might want to focus only on the mid-to-high quality range, and let the low-quality end be covered by encoding at a reduced resolution with a higher relative quality.

Speed settings and other encoders

Finally, I wanted to bring a couple of other factors into this comparison:

libaom has a whole bunch of speed settings; so far we've only looked at the default, which is speed 6
There are a whole raft of other encoders we could test. In the end I decided to compare all three major AV1 encoders (libaom, SVT-AV1, and rav1e), as well as JPEGli and JPEG-XL - and, of course, my own tinyavif. I also wanted to try WebP, but surprisingly the reference encoder couldn't reach an SSIMU2 score of 90 on many of the test images, so I left that out as it wouldn't fit nicely into this comparison.

Of course, this leads to an explosion in the number of encodes we need to run, and there's no way we could plot all those curves on a single graph. But fortunately, there's another standard way to slice the data: If we pick a quality range and average the size and runtime numbers over that range, we can calculate a sort of "representative size" and "representative runtime" for each encoder + speed setting combination.

Averaging over the full range recommended by the SSIMU2 reference implementation (low quality at SSIMU2 = 30, up to visually lossless at SSIMU2 = 90), and making use of multi-resolution encoding, we get:

Encoding at 1080p, and averaging across the full quality range (low quality to visually lossless, SSIMU scores 30 - 90), we get:

In this graph, down and left is better, so this tells us that for still images (we can't generalize to video from this!) libaom is the best encoder to use for AVIF files, and is certainly the most versatile encoder (having both the fastest preset and the best-compression preset), JPEG-XL beats it handily for more typical settings.

We also see that, even with the score compression that multi-resolution encoding gives us, tinyavif is effectively about 1-2 generations behind any other AV1 encoder. That's honestly not bad considering how simple tinyavif is!

And for completeness, here's the graph without multi-resolution encoding, so we only encode at the full resolution (1080p):

As we saw before, tinyavif really suffers at the low-quality end when forced to encode at full size, and that skews the graph quite a lot. But what's particularly interesting is that the difference between JPEG-XL, libaom, and rav1e is much less clear-cut than it was when using multi-res encoding. That must be driven by the low-quality end, because at the high-quality end multi-res encoding just falls back to encoding at the full size anyway.

I'm fairly happy with this testing infrastructure now! Of course, there's a near-infinite amount of things we could do to further improve the results, especially in terms of reducing noise in the runtime measurements - and there's a good chance I'll make some of those improvements in the future. But it's accomplished what I wanted for the moment.

It's especially interesting to me just how much difference multi-resolution encoding makes to the relative positioning of different encoders. To repeat an observation from earlier, one thing this suggests is that an encoder could feasibly get away with only focussing on the mid-to-high-quality range (SSIMU2 scores 50-90, maybe), and rely on encoding at lower resolutions to fill in the low-quality end for users who want that.

Finally, these results have me really excited for JPEG-XL - hopefully more web browsers will support it in the not-too-distant future so I can use it on this blog!

Read Entire Article