04 Jun, 2025
This summer, I've been at the recurse center intensively trying to catch up to the current state of the machine learning world. I don't have any prior background in ML, so I've been taking some classes and reading a lot of papers.
Two weeks in, I now have some basic working knowledge and wanted to get my hands dirty. After reading the Deep Double Descent paper, I wanted to see if I understood enough to reproduce the results. In a previous post, I went over some notes about doing the training for this on a rental GPU, but I figured I'd go into details about the project itself.
Please note the understanding here is still one of a student - if you spot something wrong, please send me a message!
For a long time, the ML community thought that models could only be so big before they started degrading in accuracy. Around the start of the GPT era, folks realized that you could get better test-time results from a model just by training it for much, much longer.
In 2019, folks at OpenAI and Harvard wrote a paper that tries to formalize this effect and also goes into how model size can impact results, i.e. model-wise double descent where bigger models are eventually better.
The phrasing double descent refers to this behavior where error gets better at first, then peaks much worse, then eventually comes back down again.
Models are trained on a training set and then evaluated against a separate test set. When a model is really good at the training set but really bad at the test set, we say it generalized poorly. Imagine: the model memorized the multiple-choice answers on the homework but it doesn't help with taking the final exam. It's not super clear why this happens, but here's the rough intuition I came away from.
With smaller models, the model can do its best to approximate the right behavior for test time but just doesn't have enough "brain cells" (parameters) to fit the whole problem in its head. To a certain point, giving small models more brain cells leads to better performance. This region is called underparameterized.
At around the model size where the model is just barely large enough to match the problem (called the interpolation threshold), the model can memorize "the" solution that lets it ace the training set, but that makes it prone to doing really poorly on the test set.
And finally, at larger model sizes, the model has plenty of brain cells (overparameterization) and can fit enough underlying features to classify well without overfitting to just the training set. This causes the second descent where the model converges to low error again.
There's also a factor here of how well the model can inherently learn the data. The paper discusses that introducing noise (purposefully adding incorrect labels) can be a proxy for this effect to highlight double descent.
To be honest, I'm not sure how much I trust my intuition so far about why double descent happens. It feels like if I squint at this formulation it kind of makes sense to me.
I set out to repro a narrow portion of the paper: just the parts where they trained and compared varying size resnet18s from circa 2015.
To challenge myself, I wanted to start with a blank .py file and not look at existing repro work (though, as we'll see later this did not work out well for me). The final code is here on github.
The paper has quite a bit of detail for the experimental setup. Some helpful excerpts:
We follow the Preactivation ResNet18 architecture of He et al. (2016), using 4 ResNet blocks, each consisting of two BatchNorm-ReLU-Convolution layers. The layer widths for the 4 blocks are [k,2k,4k,8k] for varying k and the strides are [1, 2, 2, 2].
Their experiment ran models from size 1 to 64. I ended up deciding to cut down to "only" 7 sizes (k=1, 2, 4, 8, 16, 32, 64) to save on training time.
For ResNets and CNNs, we train with cross-entropy loss, and the following optimizers: (1) Adam with learning-rate 0.0001 for 4K epochs
Adam is a modern gradient descent variant that works better™ and we run it for 4,000 iterations. For reference, the original resnet paper trained the model on ~100 epochs, so we're going for 40x longer.
[...] we apply RandomCrop(32, padding=4) and RandomHorizontalFlip. In experiments with added label noise, the label for all augmentations of a given training sample are given the same label.
We introduce some variation during training as a common trick to improve performance.
Batch size: All experiments use a batchsize of 128.
One of the figures also shows that they trained the models with 0%, 5%, 10%, 15%, and 20% label noise (purposefully mislabeling some items). I opted for doing just 0%, 10% and 20% in the interest of cost savings.
configuring the model
At first, I tried to use the off-the-shelf resnet18 available in the torchvision package but I quickly ran into issues. The biggest problem was that the paper wants us to vary model size using the k parameter for the convolution widths. resnet18 has a width_per_group setting that seems to correspond, but unfortunately...
It's not supported! Digging into code, the width_per_group was meant for "bottleneck" variants of resnet18, which we're not using. The resnet class hardcodes the model width for each layer so we'll either have to subclass and dig around or fork the code.
complications
The original resnet18 was designed to run on ImageNet, a large corpus of 224x224 images with 20k categories. We're going to run it the much smaller CIFAR-10 data set that only has 10 categories and 32x32 sized images. This meant I had to modify the model a little to replace the final output layer to give one of 10 results, not 20k.
I figured I could monkeypatch the model (model.fc = t.nn.Linear(...)), but at this point I also realized that the paper uses an old variation of resnets where the order of the convolution, activation and batch norm is different from the way pytorch implements it. I figured I'd fork the code to make things easier.
Stripping away some extra parameters I didn't need, I ended up with a tidy ~100 line model.
training a run on my macbook
While developing the model, my macbook was fast enough to handle test runs of the model at around 10s per epoch. However, the paper calls for 4,000 epochs across 64 model sizes and 5 noise configurations. 10s x 4,000 x 64 x 5=768,000 seconds, or nearly 5 months of training. Yikes.
As mentioned earlier, I cut this down to 3 noise configs and only 7 model sizes, cutting the time estimate to 10 days. 10 days is kind of a viable time to wait for results, but it would have meant keeping my macbook stationary and blasted the whole time. I ended up renting GPU time instead to cut the training run down to ~20 hours.
debugging hell
At this point, I was two days in, feeling pretty good about myself for having a working model and making good time. I didn't know that it would end up taking me 3 days of-on-and off debugging to figure out a few pieces I was missing:
I was calculating label noise but not applying it correctly, which took me longer than I'd like to admit to realize.
I didn't think about the fact that resnet18 is designed for 224x224 ImageNet-sized images, so it does an aggressive downsampling as a first step. Unfortunately, this really kills the model's ability to see what's going on for the much smaller 32x32 CIFAR images. It's not mentioned in the paper, but I finally gave in and looked at the reference implementation and realized I needed to adjust the model.
And finally, I did not realize that the paper is using the words test error to refer to accuracy (% incorrect), not loss (cross-entropy loss). This had me going crazy because for the life of me I could not get my test loss to converge to a reasonable number. After a few hours of re-reading the paper and my code, I finally noticed that all of their graphs for error are bound to [0, 1] and also that some other graphs are specifically labeled "loss" instead. I couldn't find a definitive source, but I'm pretty sure that means we're actually talking about % accuracy. I am still vaguely annoyed by this oversight.
the results
After several training re-runs, the final results came out well! I didn't draw out the same model size x epoch chart that the paper had, but both double descents are legible.
No label noise:
10% label noise:
20% label noise:
some observations
With no label noise, there is no double descent. This matches the paper's findings as well. My interpretation is that the model has relatively low noise and there's more leeway for the model to learn good features, i.e. since there are only 10 output categories and 32x32 size images.
At 10% and 20% label noise, we see both model-wise and epoch-wise double descent!
Visually scanning up/down the 10% noise graph at the later epochs, we see that with model-size the performance from worst to best is k=1, 2, 8, 16, 4, 32, then k=64. The larger k=8 and k=16 models perform worse than the smaller k=4! The middle-sized models are too busy memorizing the test set to learn the material.
On the same 10% graph, scanning left/right, the larger models starting with k=16 also present with double descent. Right around the 100-300 epoch mark, we see the bigger models get worse error but eventually recover and even exceed their minimum error from the underparameterized region.
When we look at the 20% graph, the effects are even more pronounced. The k=8 model can't even outperform the k=2 model. It's also interesting to see that the larger models never end up achieving better error than their original minima during underparameterization. My guess here is that adding too much noise makes the model unable to learn well enough, but maybe we'd see it converge lower if we gave it more epochs to run with?
takeaways
It was pretty exciting to see that I could actually repro the paper results! I did end up having to take a peek at the reference implementation near the end, but I'm pretty proud of myself for learning enough to go from scratch to a working model and training loop.