The best worst hack that saved our bacon

2 hours ago 1

No-one really likes engineering war stories, but this one's relevant because there's a moral to it. I've talked before about defining technical debt as technical decisions that provide immediate value, but with long-term negative impact if they aren't cleaned up. Sometimes introducing technical debt is necessary and you do it consciously to avoid a disaster. As long as you provide yourself enough room to clean it up, it's just part of the regular course of business when millions of people count on your software to get through their days.

Twelve years of calendar appointments on our platform, and the data model was starting to show some wear and tear. Specifically, though, our occurrence table was created with a plain integer primary key, and we were approaching two billion occurrences on the calendar. Well, specifically, the primary key was rapidly approaching 2,147,483,647 – the magic number that is the maximum value for a signed 32-bit integer.

We had actually known about this for some time, and we had done most of the work to fix it already. Our backend code was upgraded to bigints and the actual column itself had a migration set to upgrade it to a big integer. The plan had been in the works for a month and a half, and we almost ran with it. But then, roughly a week before we were going to deploy it (and maybe only a month before the keys ran out), someone, maybe me, I don't recall, noticed that these integer keys were exposed in one of our public APIs.

You can count on one thing in SaaS software. If you provide an integration API to your customers or vendors and it exposes an attribute, that attribute is crucial to someone, somewhere. And in our case the people using the integrations often had to rely on their university's IT department to do the integration itself. Those backlogs are counted in months, and so we couldn't deploy something that would potentially break customer integrations. What to do?

Well, Postgres integer primary keys are signed. So there's this WHOLE other half of the 32-bit word that you're not using if you're just auto-incrementing keys. My simple (read stupid) solution, which absolutely worked was to set the sequence on that primary key to -2,147,483,648 and let it continue to auto-increment, taking up the other half of that integer space. It was so dumb that I think we met like three times together with SRE to say things like, "Is it really this simple? Is this really likely to work? Are we really doing something this dumb?" and the conclusion was yes, and that it would buy us up to 3 years of time to migrate, but we would do it within 6-8 months so all IT departments can make alternative arrangements for their API integrations.

The long term solution was the BigInt, yes, but it was also to expose all keys as opaque handles rather than integers to avoid dictionary attacks and so that we could use any time we needed to on the backend without API users having to account for it. It was also to work through the Customer Success team and make sure no-one counted on the integer-ness (integrality?) of the keys or better that no-one was using the occurrence IDs at all.

In the end we had a smooth transition because of quick thinking and willingness to apply a baldfaced hack to our production (and staging) database. We had a fixed timeline we all acknowledged where the tech debt had to be addressed, and we'd firmly scoped out the negative consequences of not addressing it. It wasn't hard, but it meant that no matter who was in charge or what team changes were made, the cleanup would get done in time and correctly. It was the right thing to do. A few customers had been counting on those IDs and we were able to advise their IT departments about how to change their code and to show them what the new API response would look like long before they actually were forced to use it. In the meantime, everything just worked. Do I advise that you use negative primary keys to save room on your database? No. Was it the right choice of technical debt for the time? Absolutely.

Read Entire Article