How to Automate Software Engineering

4 months ago 26

Ege Erdil, Matthew Barnett, Tamay Besiroglu
May 30, 2025

With every passing month, AI models get better at most tasks that a software engineer does in their job. Yet for all these gains, today’s models only assist human engineers, falling far short of automating their job completely. What will it take to build AIs that can fully replace software engineers, and why aren’t we there yet?

Current AIs present something of a paradox. Their performance on narrow coding tasks already exceeds that of most human software engineers. However, any engineer who has worked with them quickly notices the need to keep AI agents such as Claude Code on a very short leash. Despite good benchmark scores and impressive demos, there are clearly core capabilities that human engineers have that our current systems are missing.

We’ve previously highlighted some of these shortcomings: lack of reliability, poor long context performance, and overly narrow agentic capabilities, among others. But why are these capabilities missing in AI systems to begin with? We train them on more compute and data than humans have access to in their entire lives, and we can run tens of millions of parallel copies of them, and yet it’s still not enough.

On some level, the answer has to be that our learning algorithms have been and remain much less efficient than the human brain. Deep learning skeptics often point to this and say that it’s a sign the entire paradigm is doomed.

We draw a different conclusion. The bitter lesson of the past decades of AI research is that handcrafted algorithms perform poorly, and the best algorithms are the ones that are discovered by applying massive amounts of compute for search and learning. This is the principle that drove the pretraining revolution, where scaling up training on massive text datasets allowed models to spontaneously develop powerful meta-learning abilities.

For the past decade of scaling, we’ve been spoiled by the enormous amount of internet data that was freely available for us to use. This was enough for cracking natural language processing, but not for getting models to become reliable, competent agents. Imagine trying to train GPT-4 on all the text data available in 1980—the data would be nowhere near enough, even if we had the necessary compute. In 2025, our situation when it comes to automating software engineering is no different.

The key question now is: what data do we need, exactly?

How software engineering will be automated

There are two powerful tools that have driven AI capabilities in the deep learning era: training on large corpuses of human data and reinforcement learning from various reward signals. Often, combining these two methods produces results that neither method could achieve alone. Neither pure training on human data nor pure reinforcement learning from a random initialization would have been enough to build models as capable as OpenAI’s o3, Anthropic’s Claude 4 Opus, or DeepSeek’s R1.

We expect the automation of valuable occupations such as software engineering to look no different. The roadmap to success will most likely start with training or fine-tuning on data from human professionals performing the task, and proceed with reinforcement learning in custom environments designed to capture more of the complexity of what people do in their jobs. The initial human data will ensure that models are able to start getting useful reward signals during RL training instead of always failing to perform tasks, and the subsequent RL will allow us to turn compute spent on training directly into better performance on the job tasks we care about.

Today, reinforcement learning tends to produce models which are very competent at doing the narrow tasks they were trained to perform, but don’t generalize well out of distribution. We think this is essentially a data problem, not an algorithms problem. Just like we’ve seen in the past with pretraining, as our RL environments become richer, more detailed and more diverse, our RL optimizers will begin to find models that have more general agentic capabilities instead of narrowly overfitting to the few tasks we’re giving them.

If we do this well, AI models will become capable of the same kind of online learning that humans can do: instead of having to work inside bespoke RL environments with custom graders, we will be able to deploy them in the real world for them to learn from their successes and failures. The most plausible way for models to reach this level of meta-learning skill goes through RL, which will require environments of much greater volume and quality than the ones that are available today.

Unfortunately, today’s RL environments are rudimentary and offer only a limited set of tasks and tools. To visualize how limited they are, imagine you had to learn how to be a software engineer without internet access, virtual machines or Docker containers, without critical features in software tools that are the industry standard (e.g., the Slack MCP server does not support search or notifications!), or the ability to collaborate with more than two people at once (most current RL environments don’t support multi-agent orchestration).

These are just some of the ways that models are constrained right now during post-training. Another hurdle comes from the fact that designing tasks for RL requires figuring out how to automatically grade model performance. This is easy if all you’re doing is checking whether a pull request by an AI agent passes a suite of existing, comprehensive tests. Yet it’s far more difficult to judge if an AI agent is good at following open-ended instructions from customers who don’t have a full technical specification of what they want in mind, or to judge if its code is maintainable and avoids creating technical debt, or whether it successfully avoids trapdoor decisions during development. Without being able to grade these parts of the AI’s work, we can’t know if an AI can act as a fully independent engineer, or whether it will just be a tool that saves human engineers time.

Until a few months ago, having such constrained environments made sense because AI agents were simply not competent enough to deal with anything resembling the complexity of real-world work settings. However, this is changing, and the new reinforcement learning from verifiable reward (RLVR) paradigm will soon be severely bottlenecked by the lack of a sufficient volume of realistic RL environments. At Mechanize, our immediate goal is to remove this bottleneck and accelerate progress toward a fully automated economy.

The future of software engineering

AIs will soon be writing the vast majority of lines of code in software projects, but this doesn’t mean most software engineering jobs will immediately disappear. Consider that today, humans only write a tiny fraction of all assembly and machine code—nearly all is generated automatically by compilers. Yet this automation hasn’t come close to eliminating all software engineering jobs.

Or take a more modern example: a web developer in the year 2000 would have had to hand-code complex features—like an infinite scrolling feed—using large amounts of custom JavaScript and HTML. In 2025, however, libraries and frameworks allow developers to implement the same functionality with just a few lines of code, often little more than a single import statement. Despite this massive reduction in effort, employment levels for software engineers grew over the last 25 years.

AI code generation continues the long-running trend of automating software development—just as compilers, high-level languages, and libraries did before. In the short term, this means that AI will not eliminate the need for software engineers but will instead change the focus of their work. Time spent writing code may increasingly shift to tasks that are more difficult to automate, such as defining the scope of applications, planning features, testing, and coordinating across teams.

However, we’ll eventually reach a point when AIs can perform the full range of activities involved in software engineering. Once this occurs, many software engineers could perhaps transition into adjacent positions that rely on similar expertise but are significantly harder to automate, such as software engineering management, product management, or executive leadership within software companies. In these roles, their responsibilities would shift from writing code and debugging to higher-level oversight, decision-making, and strategic planning—until these responsibilities can be automated too.

This highlights an important point: fully automating software engineering—meaning completely eliminating the need for people with software engineering expertise at tech companies altogether—is a far more ambitious goal than simply building AI that can write code. We’ll only truly know we’ve succeeded once we’ve created AI systems capable of taking on nearly every responsibility a human could carry out at a computer. Ultimately, this will require a “drop-in remote worker” that can fully and flexibly substitute for humans in remote jobs.

Therefore, while at some point the software engineering profession will become fully automated, this milestone may only occur at a surprisingly late point in time—likely after AIs have already taken over a large share of white-collar jobs throughout the broader economy.

Although software engineering presents a tractable target for automation in the near-term, we think this may only prove true for some tasks within the profession, rather than the entire profession altogether. As a result, software engineering may be, paradoxically, one of the first, yet also one of the last, white-collar jobs to be automated.

Until AI fully automates software engineering, Mechanize still needs human engineers. Interested in helping us build that future? We’re hiring. For inquiries, contact us at [email protected].

Read Entire Article