Lessons from 9 More Years of Tricky Bugs

3 days ago 2

Since 2002, I have been keeping track of all the tricky bugs I have come across. Nine years ago, I wrote a blog post with the lessons learned from the bugs up till then. Now I have reviewed all the bugs I have tracked since then. I wanted to see if I have learnt the lessons I listed in the first review. I also wanted to see what kind of bugs I have encountered since then. Like before, I have divided the lessons into the categories of coding, testing and debugging:

Coding

1. Empty cases. Five bugs had to do with empty lines, empty files, spaces, or values of zero. For example, lines with one space (not zero) should have been skipped as empty, but were not. In another case, empty headers in csv files caused problems. In a recent example, reminder mails were sent out, even though there were zero missing mappings that need to be fixed. I noticed in my previous post that I failed to consider cases of zero and null. Evidently, I need to be even more vigilante in spotting these kinds of errors.

2. Days. Four bugs had to do with days in one way or another. For example, logic the looks at the previous day needs to consider what should happen if the previous day is on a weekend. If you make assumptions about how many holidays there can be in a row, remember the Golden Week in Japan. Also, checking if the end date is after today is not enough to see if an agreement is active – the start date may also be after today.

3. Old data formats. Upgrading the logic to using a changed data format is always tricky. You have to consider that old data in the database may have to be converted to the new format. Also, there can be transient cases where ongoing operations may still use the old format, even though the new logic has been deployed. There was also a case where we stopped agreement names from ending in whitespaces, but the 4-eye approval logic failed on old names.

4. Aliased dicts/HashMaps. More than once, I accidentally created a second dict that was just an alias to an already existing dict. This meant that a change in one of them also showed up in the other. This led to very confusing effects when running the code.

5. Local changes. Sometimes, I had local changes that I forgot to push, so what was tested locally was not what was deployed. Ideally it should have been caught in CI tests, but there were no tests for these specific cases. A related case: while working locally, I commented out some code, made some changes, then uncommented the code again. But now some other logic had changed (while the code was commented out), leading to bugs.

Testing

6. Exploratory testing. I caught many bugs when doing some exploratory testing before I finished a feature. Often it was related to feature interactions, where various features happened to be turned on or off, which revealed bugs. In another case, I thought the customer was using a feature in a specific way. But when that didn’t work, I asked them, and they told me they used the feature in a completely different way. Also, some things become obvious when you look at them in a GUI. For example, one change I made accidentally added “hasApiKey=false” in all records displayed, but the idea was to hide anything set to false.

7. Smaller config in test. Usually, the test system is smaller than the prod system in many ways. For example, the test system may only have one event handler, but the prod system has two. This led to a bug where two events that should have been handled in sequence were handled in parallel in prod. The events went to two different event handlers, but in test (with only one event handler), they were always handled sequentially. These kinds of bugs are naturally very hard (or impossible) to discover in test.

8. Access rights. Sometimes I tested features with a user with too much access. This made it seem like the feature worked, when in fact it only worked if the user had certain features enabled.

Debugging

9. Good logging. For many of the bugs, the key to solving them was looking at the logs to figure out what had happened. For example, when one of three (supposedly identical) calendar services gave the wrong answer, I could see in the logs that the faulty one had received only a fraction of the data at start-up (with no error indication). Reading logs and error messages carefully is also important – often I would assume I knew what had happened, without checking carefully in the logs. The time stamps in them are also very helpful. For the “How found”-section of several of the bugs, I wrote something along the lines of: “Then I searched in Kibana around the minute the dead letter happened”.

10. Discussing with colleagues. As before, discussing with a colleague is an incredibly effective way of solving difficult bugs. In one recent case, we were all in the office together when we were trouble shooting. Normally we work remotely three days a week, but being physically close makes cooperating even more effective

11. Alerting. Some errors would not have been noticed at all, or not early enough, if it wasn’t for alerting. Setting up good alarms really pays off.

12. Reproducing with the smallest case. In many cases, I had a working case and a failing case (maybe in the main branch and in a feature branch). Commenting out code (or otherwise reducing the functionality) was key to finding what the cause of the problem was.

Reflections

Going through the notes of all these bugs was quite fun. Some of the bugs I would have remembered even without the notes. Many of them I remembered when I read the notes, and some I had no memory of, even after reading the notes. It was quite nostalgic to remember the colleagues I used to work with, and the systems we worked on together (in different programming languages). What really struck me was the amount of details each system is made up of. It made me think (again) about how much of software engineering is actually learning about the domain.

Looking back at my post from nine years ago, have I avoided the problems I highlighted there? For the most parts, I have. But I have still failed many times to handle cases with empty, zero or null. This is something I have to pay even more attention to. There was also one potentially really bad bug caused by a faulty if-statement. Luckily, I caught it when doing some exploratory testing, and noticed something weird in the logs. As for reading the logs, I should follow my old advice of “pay close attention” more often. But on the whole, I have managed to avoid many of the types of bugs I used to cause in the past.

Analysis

The diagram below shows how many bugs I have recorded each year since the start. For the past nine years, I have encountered one tricky bug every two months on average.

Not every bug was caused by me. Sometimes bugs caused by other people are so interesting that I include them too. For the past nine years, around 70% of the bugs were caused by me. I also keep notes on how much time was spent on fixing the bug. This includes troubleshooting it, fixing it and testing the fix. Below is a diagram of how long it took. Note that anything over 8 hours means multiple days. So 24 hours is 3 days, not 24 hours nonstop.

Conclusion

Many errors seem quite inexplicable, until you figure out what the problem is. For example, there was an SQL error that none of us could understand how it could happen. In the end, it turned out that one node (that did the database queries) had not been restarted, so it ran an old version of the software. Other times, they are not hard to figure out once you see them, but interesting nevertheless. Several years ago there was an overflow in Cassandra. The variable in question was an int in both Python and Cassandra, but in Python integers can be arbitrarily large, whereas in Cassandra, an int is 32 bits.

Whatever the cause, it is always satisfying to figure out what happened. Bugs are great sources for learning, and by tracking the trickiest ones, I am trying to learn as much as possible from each one of them.

Read Entire Article