At Sliplane, we run managed Docker hosting. One of our core infrastructure needs is to run isolated, fast, repeatable Docker builds for customer deployments. Initially, we used Fly.io to power these builds. It worked, until it didn’t.
Here’s what broke, how we replaced it, and why our new setup is better.
When we started, we needed:
- Fully isolated VMs per build
- Fast boot times
- Persistent volumes for Docker layer caching
- Auto-suspend and resume behavior to save costs
Fly.io promised all of that:
- Firecracker under the hood
- Easy per-app VM isolation
- Wake-on-request semantics
- Persistent volumes that could auto scale to 500 GB
So we launched a dedicated Fly app per customer. Builds would trigger the app via HTTP, and Fly would spin up the VM. Caching worked. Boot times were decent. It felt clean.
What Broke (Repeatedly)
Once we had real usage:
- VMs failed to boot with “out of capacity” errors
- Suspended apps would not reliably wake
- Some VMs just died with no logs and no recovery
- We hit undocumented quotas like the maximum number of apps
Eventually, about 10 percent of all builds failed for reasons unrelated to customer code. We built retries, workarounds, and logging, but we could not fix Fly’s issues.
Why It Was Not a Fit
Fly is optimized for small, stateless web apps. Our builds are not:
- 16 to 32 GB RAM per VM
- Persistent volumes used across sessions
- Heavy reliance on suspend and resume
Fly’s internal resource management was not made for workloads like this. Even though they now pitch AI agent workloads that sound similar, our experience says to be cautious.
What About Just Buying This as a Service?
There are platforms that specialize in fast, isolated Docker builds, like Depot. For many teams, this is a great option.
But for us, it did not work.
Depot charges 4 cents per build minute and 20 cents per GB per month for storage. That is four to five times more than what we pay by running it ourselves. And our business model does not work with metered pricing.
- We charge customers per server, not per build
- If we paid Depot rates, we would lose money on build-heavy users
- Charging extra for build minutes would add friction and complexity
We want builds to feel free and our billing to be uncomplicated. That only works if we control the cost.
So we built it ourselves.
What We Run Now
We rebuilt everything on top of Firecracker, using bare metal hardware:
- Dedicated MicroVM per customer
- NVMe-backed volumes for fast I/O
- RAM and CPU overcommit at the hardware level
- Our own minimal orchestrator written in Go, about 4000 lines
We run a low number of concurrent builds, usually just a few per server. Builds are bursty by nature. They spike CPU for a short time, then wait on I/O. This makes them perfect for resource sharing. Even at peak load, our servers sit around 20 percent utilization. It is simple, predictable, and better than any autoscaler we have used.
Our orchestrator does only what we need:
- Boots Firecracker microVMs
- Mounts fast persistent volumes
- Schedules and manages builds
- Cleans up automatically after use
Because it is purpose-built, we skip everything we do not need. No service discovery. No pod networking. No long-running VMs. Just start the VM, run the build, and remove it.
Advantages and What We Gained
- No capacity errors, because we "own" the hardware (we rent bare metal servers but arent sharing it)
- No hidden limits, because we wrote the scheduler
- Faster builds, with better I/O and less cold start time
- Full observability
- Predictable runtime behavior
- No third-party surprises
- Lower cost per build, about 20-30 percent of what we paid on Fly
We gave up global networking and PaaS convenience in exchange for control and reliability. Our customers care more about builds working than about which edge location runs them.
Tradeoffs
What we lost:
- Global routing
- Built-in deployment tooling
- Zero-config infrastructure
Should You Do the Same?
Fly.io is a solid choice for:
- MVPs and side projects
- Small, ephemeral apps
- Stateless workloads
- Apps that need global routing
It might not work well for:
- CI or build systems
- Anything with large RAM or volume usage
- Infrastructure with state and tight performance constraints
Try Fly first. But benchmark it with real usage. Do not assume it will scale just because it feels easy at the beginning. Yes, this is entirely our fault for not testing harder upfront :D
Final Thoughts
We did not set out to replace Fly. It just stopped working for our needs.
So we built our own infrastructure on bare metal using Firecracker.
It was more work, but the result is simple. Our builds no longer fail unless the customer's code does.
Jonas Co-Founder, Sliplane.io
One note: we do run a competing service. But I still like parts of Fly and use it for other internal infrastructure. This post is just about one use case that did not work out, which is totally normal :)