Yes this is a rant. I can’t hold it anymore. It’s getting to the point of total nonsense.
Every day there’s a new “AI (insert specialisation) engineer” promising rainbows and unicorns and 10x productivity increase and making it possible for 1 engineer to do what used to require a 100.
Really???
How many of them actually work?
Have anyone seen one - just one - of those tools even remotely resembling smth useful??
Don’t get me wrong, we are fortunate to have this new technology to play with. LLMs are truly magical. They make things possible that weren’t possible before. For certain problems at hand, there’s no coming back - there’s no point clicking through dozens of ad-infested links anymore to find an answer to a basic question, just like there’s no point scaffolding a trivial isolated piece of code by hand.
But replacing a profession? Are y’all high on smth or what?!!
The core problem with these toys is arrogance. There’s this cool new technology. VCs are excited, as they should be about once-in-a-generation tech. But then founders raise tons of money from those VCs and automatically assume that millions in the bank automatically give them the right to dismantle the old ways and replace them with the shiny newer, better ways. Those newer ways are still being built - a bit like a truck that’s being assembled while en route - but never mind. You just gotta trust that it’s going to work out fine in the end.
It doesn’t work this way! You can’t just will a thing into existence and assume that people will change the way they always did things overnight! Consumers are the easiest to persuade - it’s just the person and the product, no organisational inertia to overcome - but even the most iconic consumer products (eg the iPhone) took a while to gain mainstream adoption.
And then there’s also the elephant in the room.
As infra people, what do we care about most?
Is it being able to spend 0.5 minutes less to write a piece of Terraform code?
Or maybe it’s to produce as much of sloppy yaml as we possibly can in a day?
“Move fast and break things” right?
Of course not! The primary purpose of our job - in fact, the very reason it’s a separate job - is to ensure that things don’t break. That’s it, that’s the job. This is why it’s called infrastructure - it’s supposed to be reliable, so that developers can break things; and when they do, they know it’s their code because infrastructure always works. That’s the whole point of it being separate!
So maybe builders of all those “AI DevOps Engineers” should take a step back and try to understand why we have DevOps / SRE / Platform engineering as distinct specialties. It’s naive to assume that the only reason for specialisation is knowledge of tools. It’s like assuming that banks and insurers are different kinds of businesses only because they use different types of paper.
We learned it the hard way. Not so long ago we built a “chat to your AWS account” tool and called it “vibe-ops”. With the benefit of hindsight, it is obvious why it got so much hate. “vibe coding” is the opposite of what infra is about!
Infra is about risk.
Infra is about reliability.
It’s about security.
It’s definitely NOT about “vibe-coding”.
So does this mean that there is no place for AI in infra?
Not quite.
It’d be odd if infra stayed on the sidelines while everyone else rushes ahead, benefitting from the new tooling that was made possible by the invention of LLMs. It’s just different kind of tooling that’s needed here.
What kind of tooling?
Well, if our job that about reducing risk, then perhaps - some kind of tooling that helps reduce risk better? How’s that for a start?
And where does the risk in infra come from? Well, that stays the same, with or without AI:
- People making changes that break things that weren’t supposed to be affected
- Systems behaving poorly under load / specific conditions
- Security breaches
Could AI help here? Probably, but how exactly?
One way to think of it would be to observe what we actually do without any novel tools, and where exactly the risks is getting introduced. Say an engineer unintentionally re-created a database instance that held production data by renaming it, and the data is lost. Who and how would catch and flag it?
There are two possible points in time at which the risk can be reduced:
- At the time of renaming: one engineer submits a PR that renames the instance, another engineer reviews and flags the issue
- At the time of creation: again one engineer submits a PR that creates the DB, another engineer reviews and points out that it doesn’t have automated backups configured.
In both cases, the place where the issue is caught is the pull request. But repeatedly pointing out trivial issues over and over again can get quite tiresome. How are we solving for that - again, in absence of any novel tools, just good old ways?
We write policies, like OPA or Sentinel, that are supposed to catch such issues.
But are we, really?
We’re supposed to, but if we are being honest, we rarely get to it. The situation with policy coverage in most organisations is far worse than with test coverage. Test coverage as a metric to track is at least sometimes mandated by management, resulting in somewhat reasonable balance. But policies are often left behind - not least because OPA is far from being the most intuitive tool.
So - back to AI - could AI somehow catch issues that are supposed to be caught by policies?
Oookay now we are getting at something.
We’re supposed to write policies but aren’t writing enough of them.
LLMs are good with text.
Policies are text. So is the code that the policies check.
What if instead of having to write oddly specific policies in a confusing language for every possible issue in existence you could just say smth like “don’t allow public S3 buckets in production; except for my-img-bucket - it needs to be public because images are served from it”. An LLM could then scan the code using this “policy” as guidance and flag issues. Writing such policies would only take a fraction of the effort required to write OPA, and it would be self-documenting.
We’ve built an early prototype of Infrabase based on the core ideas described above.
It’s a github app that reviews infrastructure PRs and flags potential risks. It’s tailored specifically for infrastructure and will stay silent in PRs that are not touching infra.
If you connect a repo named “infrabase-rules” to Infrabase, it will treat it as a source of policies / rules for reviews. You can write them in natural language; here’s an example repo.
Could something like this be useful?
Does it need to exist at all?
Or perhaps we are getting it wrong again?
Let us know your thoughts!