Leaving the Cloud Isn't for Everyone

2 hours ago 2

When DHH's Math Makes Sense — and When It Doesn't

Tim O'Brien

David Heinemeier Hansson announced that 37signals moved Basecamp and HEY off the public cloud two years ago, and now he's publishing the numbers to prove it. He also showed up on Lex Friedman, talking about how amazing modern hardware is. His math is solid, and his position is 100% valid: there absolutely are workloads where owning your infrastructure makes economic sense.

But I want to add a note of caution for people listening: not everyone is in Basecamp's position. Moving a workload from the public to the private cloud is a decision that should be made after rigorous analysis. Private cloud is right for specific applications, and for organizations that have the technical and financial capability to execute, but there are some things to think about and model before you jump on this bandwagon.

Let's start with some background about David and Basecamp, the company that Jason Fried and DHH operate in Chicago. For their business model — mature applications, predictable loads, strong operational talent — that choice to move to private has paid off.

I interviewed DHH in Chicago 17 years ago, and we literally ran from the cops when someone from Chicago Parks called the police about my unpermitted filming of that interview. I was interviewing him about Rails, and we got about halfway through when an official interrupted me to ask for my filming permit. I responded that I didn't have one. They asked David and me to wait for them to call the police to give me a ticket.

David and I looked at each other and started walking briskly. Someone yelled at us to come back, and we started sprinting. David's had an active life and been busy since then as a race car driver, but I'd imagine he still has some recollection of that day. It was the most unique ending to an interview about software programming that I can recall.

What I've always seen from DHH is that he doesn't back down from a position he believes in. Some dismiss him as a contrarian, but I've never seen him argue from fiction — he argues from conviction, backed by evidence. And I also like that he doesn't avoid controversial topics. I can't say I agree with all of his opinions, but he's someone I pay attention to because he's frequently and unapologetically correct. You can also tell that he enjoys debate.

In this particular debate—public versus private cloud—he's picked a battle worth having. But here's the story teams should weigh before copy-pasting his playbook. His story, the story for Basecamp, is that private cloud is more efficient, but I’m going to argue that there are some factors you should analyze and consider before you jump on board.

The Daily Cloud Bill is a Feature, Not a Bug

The pattern is familiar: a business looks at its public cloud bill, sees it growing unmanageable, and decides to repatriate. I get it — I've moved workflows from public to private when it made sense. But here's what I learned: the punishing cloud bill is actually a feature. When you move back to private, you lose that.

When you first move to the public cloud, that invoice feels like a punch in the face. But it wakes you up. Cost becomes a live, daily signal that forces immediate architecture decisions. You start asking hard questions about why you're storing years of debug logs, or why that dev environment runs 24/7 when nobody touches it on weekends. You discover cron jobs that hit the database 10,000 times a day for no good reason, and realize you have twelve copies of the same service running in regions where you have no customers.

The bill forces you to delete stale data, right-size instances, slow down noisy processes, and consolidate duplicative systems. It's like a cold shower that keeps you honest about waste every month, but I've found that it encourages greater efficiency because you have a more immediate cost. It isn't a cost amortized over 5 years; it's a cost you can change tomorrow.

On-prem, that pressure evaporates. Infrastructure spend becomes sunk CapEx — it feels "free" until the next hardware refresh cycle. The discipline fades. Waste accumulates quietly, and when you lose track of efficiency in a private cloud, you don’t run out of money first — you run out of capacity.

Key points here: if you are thinking about a private cloud, consider capacity constraints, and also try to preserve a cost signal that turns your CapEx into OpEx. Even if it’s fiction, measure your daily spending rate.

Private is a Different Kind of Expenditure

On public cloud, you can quickly run out of budget, but on private cloud, if you are not careful, it’s easy to run out of capacity because you don’t have that focused pressure to reduce consumption. And that crisis feels different at scale if you have unpredictable workloads. You can't just create new instances. You have to procure hardware (sometimes a lot of it), and there are supply chain issues, racking, and stacking.

There are process and budgeting issues, and you often have to forecast demand well in advance because you can't just walk into a store and buy racks of computers. If you repatriate to private infrastructure and don't recreate cost signal internally — through showback, chargeback, or rigorous capacity reviews — the same waste that drove up your cloud bill will creep back in. Instead of a large public cloud bill, what you’ll be dealing with is unanticipated CapEx expenditures, and private cloud spend feels different.

In public, you might receive a scary monthly invoice for more than you expected. Still, there’s a chance to reduce it over time by paying attention to architecture and efficiency signals. In private, that surprise procurement of new hardware is the equivalent of 60 monthly invoices if you amortize general compute hardware over 5 years. (And, there’s more complexity there because you also need to model migration time, racking and stacking time, and end-of-life refresh considerations.)

David’s in a unique position because he owns his company alongside Jason, so he can confidently distinguish between CapEx and OpEx. A lot of people who read that and agree with his conclusions don’t share that view of their organization. Before you start running towards private, you should be talking to your partners in Finance and understanding the company’s approach to operational versus capital expenses. What would the short-term impact of that migration be? Is there a double bubble between public and private? How does that affect the overall budget? Does your finance partner view OpEx and CapEx as ‘interchangeable’? If you reduce or increase one, how does it affect the other?

If you are working at a smaller company, this might not be a problem, but if you are working at a larger company, there might be more constraints on changes that can be made in this space, and I would also start modeling what a surprise would look like. What happens with a new application that becomes incredibly popular and requires a large amount of CapEx due to the change in cloud strategy?

When Private Wins — and When It Doesn't

Let me be specific about when a private cloud makes sense. If your workload is static — running 24/7/365 with no meaningful scale-up or scale-down — and you're not using your Platform-as-a-Service features, you're essentially just running VMs you manage yourself. In that case, there's really no point in paying the public cloud tax. Basecamp and HEY fit this pattern perfectly: mature applications with predictable loads.

But this math breaks down quickly for other workload types, such as Storage.

You're never going to achieve the same level of scale or efficiency in storage as you get with S3 or Google Cloud Storage. Sure, someone will claim their Ceph cluster costs one-third as much. Maybe. But storage isn't just constrained to terabytes. It's about backups across regions, intelligent tiering that moves data between hot, cold, and archive storage based on access patterns, and durability guarantees that reach 11 nines for S3. It's the compliance certifications that let you handle regulated data without years of audit prep, and global edge locations that put content milliseconds away from users worldwide.

One other area that is difficult to compete in, unless you have scale, is GPUs and AI Workloads.

Building GPU clusters requires specialized expertise that goes beyond typical operations. You need sophisticated thermal management to prevent literal meltdowns when thousands of watts concentrate in a single rack. (Or millions, the industry is on track for a 1MW rack.). The high-bandwidth networking configurations required for distributed training aren't something an average network engineer has touched. Hardware cycles evolve rapidly, with new chips arriving every six to twelve months, making yesterday's investment look antiquated. Most colocation facilities can't support the power density these systems demand — we're talking racks that draw more power than entire floors of traditional servers.

For smaller companies, the CapEx alone for GPUs can be prohibitive. But the real cost is expertise — finding people who understand both the hardware intricacies and the software stack? Those folks are seriously overemployed right now. Good luck.

One last area that will cause problems for private is variable workloads.

If you need 8x your current capacity for three days a year — think Super Bowl, tax season, or viral moments — private cloud forces you into impossible choices. You could overprovision year-round, essentially paying for servers that sit idle 362 days a year. You could turn customers away during peaks, watching revenue evaporate while your competitors handle the load. Or you could build complex hybrid systems that require maintaining expertise across multiple platforms, creating an operational nightmare that defeats the purpose of simplification.

The sweet spot for private cloud is DHH’s exact scenario: predictable, steady-state workload where you can buy precisely what you need and run it hot.

Don’t Hand-Wave Security and Operations

I’ve seen comments like: “We run MySQL ourselves — it’s easy once it’s set up.” I’ve run production databases at scale. “Easy” is not the word I would use. Disks fail. Replication drifts. Backups rot. Schema migrations land at 2 a.m. Hardware fails whenever it senses that you are worrying about it.

But forget hardware failure —let’s talk about security, because that is where the conversation gets serious.

The Thousand Quiet Things

When you use Azure, Google Cloud, or AWS, they’re doing a thousand quiet things you never see. They’re applying microcode and firmware patches without downtime, live-migrating your VMs off suspect hosts before the hardware fails. They’re performing hardware attestation to ensure your compute hasn’t been compromised at the silicon level. Background processes constantly scrub data to prevent bit rot from corrupting your storage. They maintain compliance audits that you inherit for free, saving you months of certification work and tens of thousands in audit fees.

The TEE.fail Wake-Up Call

With vulnerabilities like TEE.fail affecting trusted execution environments, physical security isn’t theoretical anymore. Someone with three minutes of physical access can install a $1,000 device between your memory and motherboard, compromising everything. When you work with a hyperscaler, you’re leveraging their security team of thousands, which understands the whole attack surface.

Moving to your own racks in a colo facility? You’re now responsible for physical security that goes well beyond locked cages. You need supply-chain verification for all hardware, because the cheap memory you bought might already be compromised. You need incident response capabilities that can mobilize at 3 a.m. when anomalies appear. Vulnerability management now spans your entire stack, from BIOS to the application layer. You should be running red-team/blue-team exercises to identify weaknesses before attackers do. And you need forensics capabilities ready to deploy when — not if — something happens.

“We’re patching our VMs” is about 5% of what security actually means in 2025.

The Real Cost of Security

If your comparison model is “hardware vs. EC2 list price,” you’ve dramatically undercounted. That security team — incident response, vulnerability management, compliance audits, penetration testing — costs real money. Large enterprises allocate 10–15% of their IT spend to security. Did you factor that into your private cloud calculations?

DHH probably has some of the industry's most intelligent people working for him. He’s in the best position to handle this complexity. But for most organizations, the security landscape is becoming increasingly complex every day. When you’re responsible for your customers’ data in your own facility, you'd better have world-class people ready to respond. (Especially if you run an email service for customers.)

The Risk Premium You’re Paying Cloud to Carry

Let me tell you a story. In 1997, I had a business building a search engine in Charlottesville. Lightning struck our building. We had backups, but the strike took out both the backup and our primary system. No cloud, no real recovery strategy. For me, it was the dramatic end of the company. I ended up going back to school. Lightning literally changed my career path.

Since then, in every role, I’ve witnessed the unpredictable. I’ve seen multiple disk drives fail simultaneously in ways that are statistically impossible, turning what should be a minor incident into a data recovery nightmare. Winter power outages have taken down facilities in states with supposedly reliable grids. Backhoes have cut fiber lines nobody knew existed, severing redundant connections that turned out to share the same trench. Supply chain shocks have made SSDs unavailable for months, forcing emergency purchases at 3x normal prices. I’ve received rack shipments where the metal was so weak it created OSHA violations, delaying deployments by weeks.

The Chelyabinsk Reminder

When I explain to people that data centers are subject to unpredictable, catastrophic events that no amount of planning can fully mitigate, I use this example. On February 15, 2013, a meteor exploded at high altitude over Chelyabinsk, Russia. This wasn’t science fiction — it was a real event that damaged 7,200 buildings, shattered thousands of windows across the city, injured 1,500 people, and knocked out power and communications infrastructure across the region. The shockwave was powerful enough to blow people off their feet miles from the epicenter.

Now imagine if that meteor had exploded over Ashburn, Virginia, or San Jose, California — two of the world’s largest data center hubs. We’d be discussing massive, simultaneous outages across dozens of facilities. No amount of N+1 redundancy or diesel generators helps when the building itself is structurally compromised. These are the events you can’t model in your TCO spreadsheet, the black swans that make that cloud premium look like sensible insurance.

Have You Priced in Black Swans

Most “we saved a million dollars” blog posts are written in year one or two of private cloud migration. But Black Swan events operate on multi-decade timelines. All those savings evaporate with one catastrophic event. When your city goes offline for a month, hyperscalers automatically fail over to other regions. Your five racks in a colo facility don’t have the same ability to react.

If your answer to extended downtime can’t be “we’re shipping some server to a new city when we’re able to get into contact with our provider,” then pure private cloud is a bet you need to underwrite with serious investments. You need geographic redundancy, which means doubling your infrastructure costs across multiple locations. You need spare capacity — what we call “double bubble” during migrations — because you can’t migrate services without somewhere to put them. Multi-site failover capabilities require sophisticated networking and automation that most teams underestimate. Those 24/7 remote hands contracts get expensive fast, especially when you need skilled technicians who can diagnose hardware issues, not just power cycle servers. And you need people — your own people — who can fix the weird stuff at 3 a.m. when the vendor’s support line puts you on hold.

Public cloud handles all of this invisibly. That's the premium you’re paying. It’s insurance against the unthinkable.

Do the Boring Math First

DHH’s team had already done serious FinOps work in the cloud before leaving. Many companies haven’t. Before you buy pallets of servers, actually do the math:

First, Optimize Your Cloud Spend

Start with the low-hanging fruit that most companies ignore. Reserved Instances and Savings Plans often deliver 40–60% off list price — if you actually commit to using them. Right-sizing reveals that most instances are overprovisioned by 30–50%, burning money on CPU cycles that never get used. Hunt down zombie resources like that dev environment running 24/7 when developers only work 40 hours a week. Implement proper storage tiering to move old data to lower-cost storage tiers rather than keeping everything in premium tiers forever.

Model effective cloud rates, not list prices. AWS list price for m5.large might be $0.096/hour, but with a 3-year savings plan, it’s $0.038/hour.

Then, Calculate True Private Costs

Hardware & Infrastructure: Server amortization over 5–7 years is just the start. You need 20–30% spare capacity for failures because hardware doesn’t wait for convenient times to die. During refresh cycles, you’ll need overlap periods where you’re paying for both old and new infrastructure. Colo fees encompass more than just rack space — it’s power, bandwidth, and those expensive remote hands calls when something needs physical attention. Don’t forget growth headroom, because unlike cloud, you can’t just spin up more capacity when you need it tomorrow.

Operational Overhead: This is where budgets explode. You’ll need additional headcount or managed services contracts to handle the increased operational burden. Security tooling and audits that cloud providers include become line items in your budget. The compliance certifications you inherited for free now cost tens of thousands of dollars annually. A proper DR site with regular testing isn’t optional — it’s table stakes. And don’t forget on-call compensation for the team that now owns every layer of the stack.

Hidden Costs: Procurement delays will affect your time-to-market when you can’t get servers for three months due to supply chain issues. There’s a massive opportunity cost when your ops team spends time racking servers instead of building features. And there’s a risk premium for handling your own incidents — what’s the cost when your senior engineers spend a weekend rebuilding a failed storage array instead of shipping product?

Only after modeling both sides honestly can you make an informed decision.

Conclusion: It’s Not Religion, It’s Math

DHH’s cloud exit is a success story — for his specific context. The math worked because they have predictable, steady-state workloads that run 24/7 without dramatic scaling needs. They employ world-class engineers who can handle the complexity of running bare-metal systems, from kernel tuning to handling hardware failures. They’d already done the hard work of optimizing their cloud spend before leaving, so they knew exactly what they were paying for and why. And they’re philosophically aligned with owning their stack, preferring deep understanding over convenient abstraction.

But copying their playbook without their context is like copying their code without understanding their architecture. Before you cancel your cloud contracts, start with honest math: compare optimized cloud costs — not list prices — against actual private costs that include all operational overhead, not just hardware. Assess your absolute risk tolerance for handling infrastructure failures and security incidents when there’s no hyperscaler to call. Evaluate whether your workload patterns are genuinely steady state or if you’re fooling yourself about that Black Friday spike. Consider your team’s actual capabilities and appetite for operational complexity — wanting to run your own infrastructure and being able to are different things. Most importantly, think in timelines longer than quarterly reports, because infrastructure decisions compound over decades, and that cheap colo might look expensive after your first major outage.

The fundamental insight isn’t “cloud bad, private good” or vice versa. It’s that infrastructure decisions must align with your business reality, not someone else’s blog post.

Also: I noticed many hardware vendors enthusiastically sharing DHH’s posts. When you read these stories, always consider who benefits from the narrative.

I upvoted DHH’s post because he’s sharing real data from a real migration. But I’m writing this piece so people understand it’s not a simple calculation. Like most infrastructure decisions, it’s complicated. And in 2025, it’s getting more complicated every day.

Read Entire Article