Fly Away from Fly.io

1 day ago 4

At Sliplane, we run managed Docker hosting. One of our core infrastructure needs is to run isolated, fast, repeatable Docker builds for customer deployments. Initially, we used Fly.io to power these builds. It worked, until it didn’t.

Here’s what broke, how we replaced it, and why our new setup is better.

When we started, we needed:

Fully isolated VMs per build
Fast boot times
Persistent volumes for Docker layer caching
Auto-suspend and resume behavior to save costs

Fly.io promised all of that:

Firecracker under the hood
Easy per-app VM isolation
Wake-on-request semantics
Persistent volumes that could auto scale to 500 GB

So we launched a dedicated Fly app per customer. Builds would trigger the app via HTTP, and Fly would spin up the VM. Caching worked. Boot times were decent. It felt clean.

What Broke (Repeatedly)

Once we had real usage:

VMs failed to boot with “out of capacity” errors
Suspended apps would not reliably wake
Some VMs just died with no logs and no recovery
We hit undocumented quotas like the maximum number of apps

Eventually, about 10 percent of all builds failed for reasons unrelated to customer code. We built retries, workarounds, and logging, but we could not fix Fly’s issues.

Why It Was Not a Fit

Fly is optimized for small, stateless web apps. Our builds are not:

16 to 32 GB RAM per VM
Persistent volumes used across sessions
Heavy reliance on suspend and resume

Fly’s internal resource management was not made for workloads like this. Even though they now pitch AI agent workloads that sound similar, our experience says to be cautious.

What About Just Buying This as a Service?

There are platforms that specialize in fast, isolated Docker builds, like Depot. For many teams, this is a great option.

But for us, it did not work.

Depot charges 4 cents per build minute and 20 cents per GB per month for storage. That is four to five times more than what we pay by running it ourselves. And our business model does not work with metered pricing.

We charge customers per server, not per build
If we paid Depot rates, we would lose money on build-heavy users
Charging extra for build minutes would add friction and complexity

We want builds to feel free and our billing to be uncomplicated. That only works if we control the cost.

So we built it ourselves.

What We Run Now

We rebuilt everything on top of Firecracker, using bare metal hardware:

Dedicated MicroVM per customer
NVMe-backed volumes for fast I/O
RAM and CPU overcommit at the hardware level
Our own minimal orchestrator written in Go, about 4000 lines

We run a low number of concurrent builds, usually just a few per server. Builds are bursty by nature. They spike CPU for a short time, then wait on I/O. This makes them perfect for resource sharing. Even at peak load, our servers sit around 20 percent utilization. It is simple, predictable, and better than any autoscaler we have used.

Our orchestrator does only what we need:

Boots Firecracker microVMs
Mounts fast persistent volumes
Schedules and manages builds
Cleans up automatically after use

Because it is purpose-built, we skip everything we do not need. No service discovery. No pod networking. No long-running VMs. Just start the VM, run the build, and remove it.

Advantages and What We Gained

No capacity errors, because we "own" the hardware (we rent bare metal servers but arent sharing it)
No hidden limits, because we wrote the scheduler
Faster builds, with better I/O and less cold start time
Full observability
Predictable runtime behavior
No third-party surprises
Lower cost per build, about 20-30 percent of what we paid on Fly

We gave up global networking and PaaS convenience in exchange for control and reliability. Our customers care more about builds working than about which edge location runs them.

Tradeoffs

What we lost:

Global routing
Built-in deployment tooling
Zero-config infrastructure

Should You Do the Same?

Fly.io is a solid choice for:

MVPs and side projects
Small, ephemeral apps
Stateless workloads
Apps that need global routing

It might not work well for:

CI or build systems
Anything with large RAM or volume usage
Infrastructure with state and tight performance constraints

Try Fly first. But benchmark it with real usage. Do not assume it will scale just because it feels easy at the beginning. Yes, this is entirely our fault for not testing harder upfront :D

Final Thoughts

We did not set out to replace Fly. It just stopped working for our needs.

So we built our own infrastructure on bare metal using Firecracker.

It was more work, but the result is simple. Our builds no longer fail unless the customer's code does.

Jonas Co-Founder, Sliplane.io

One note: we do run a competing service. But I still like parts of Fly and use it for other internal infrastructure. This post is just about one use case that did not work out, which is totally normal :)

Read Entire Article