Develop, Deploy, Operate

4 hours ago 1

May 22, 2025
Volume 23, issue 2

Download PDF version of this article PDF

A holistic model for understanding the costs and value of software development

Titus Winters, Leah Rivers, and Salim Virji

How much time and money should organizations invest in developer tools? What is the business value of improvements to a development platform? How do companies quantify the business impact of optimizing an application's CPU utilization? What tests should run during development versus integration? Despite the ubiquity of commercial software and the accumulated wisdom of industry, open source, and academia, it's still hard to answer these basic questions about commercial software development.

For companies to be sustainable in the long run, they must balance the production cost of software development with the value of that software, and they must account for risks. Software value is created by product success, developer productivity, hardware resource efficiency, innovation, and risk reduction. These factors affect one another in both predictable and unexpected ways. Some factors have a precise dollar value (salaries, hardware), whereas the dollar value of other factors is hard to estimate precisely in the short term.

This article presents a holistic model for understanding the costs and commercial value of software development. It places software development in the context of business goals, provides insights about software development, and offers a shared vocabulary for use across product teams, functions, and management layers. Then it proposes a model of commercial software-development workflow, with four types of observable impact. The article closes with a discussion of the indirect impact of software infrastructure and architecture on the development workflow.

Software Development and Business Goals

When a business invests in the software-development process, as distinct from the direct development of a specific software product or feature, its goal is to maximize business value with existing resources—which is to say, to reach an acceptable quality bar with sustainable costs.

Software has value when it is used. Software development is an ongoing process of tradeoffs among critical resources, especially people, hardware, strategic capabilities, and time. While software development workflows vary, the delivery of commercial software value generally requires three distinct processes: development, deployment, and operation (figure 1). These three processes are interdependent; changes to one can have effects across the system. Each process has a separate goal, which changes the cost-benefit analysis in critical resource tradeoffs.

Development, deployment, and operation

Why does this matter? In contrast to highly simplified aggregate views, a holistic view of software development accounts for the different needs and goals of each of the three processes. To maximize benefit and minimize waste, companies should look at how resources are being used across all processes. Viewing hardware capacity in isolation may delay release or reduce quality. For example, running fewer tests saves on hardware but resulting quality issues may result in real losses . Framing tradeoffs by process and giving costs and benefits in terms of all three processes can help decision-makers make more effective choices.

Insights

This section lists a number of non-obvious but generally non-contentious points to provide a foundation for the ideas that follow.

Software has product value when it is used

Software provides commercial value only when it is being used (i.e., released or deployed). As in manufacturing, a buildup of work in progress is a common trap of value — if you've built a new feature or fixed a bug without deploying the feature or bugfix, you've spent time (money) for no benefit. An important corollary: If part of a software process adds too much latency, it delays or prevents the realization of value. At the same time, high process latency also adds strategic risk: When (not if) substantial bugs arise in deployed software, swift responses are more valuable than slower ones.

Software Value Creation Processes

Each process in software development has a different calculation for tradeoffs between critical resources (figure 2).

Different calculation for tradeoffs between resources

Development Is A Creative Process

One of the great unsolved problems for software companies is the inability to directly measure productivity for software development or demonstrate the value of sharing knowledge. What would the units of such a measurement be? We've known since Dijkstra that it isn't lines of code. The number of proposed or submitted changes at best indicates activity, not outcomes. We aspire to measure "feature count" or "feature velocity," but in practice, that is still difficult, except in large aggregates. (What is the exact definition of a feature? How do we know if it was the right feature to implement?)

The difficulty is that every software change is unique, and it is rare for two features to be so similar that they can be directly compared. Software development involves creative, thoughtful human judgment and skills, including design, discovery, craft, and deep understanding of existing systems. The industry is founded on knowledge and creativity. In the development phase of the workflow, human creativity and innovation are paramount. Everything in the software workflow, up to the point of submitting a change, is fundamentally a human (creative) process, expressing human intent.As Mary Shaw has been saying for 30-plus years, "Writing software should map to designing the products, not producing them." While business-level aggregate data can provide useful insights, statistics at the individual developer level are insufficient proxies for productivity.

The SPACE paper is a reminder that, while aggregates and proxies can guide the questions to be asked and identify levers of change in software systems, these proxies should be used in a larger constellation of metrics, preferably metrics in tension with one another. (This article does not attempt to define an appropriate collection of productivity metrics.) Even then, those metrics can rarely define and predict costs in such a way that those costs could be weighed against business revenue.

If the development process is seen as involving design, creation, and expression of intent, it's easier to target investments in productivity. In automotive design, investments would be made in design standards, common components, modeling tools, detailed designs for mechanical, electrical, and hydraulic systems, and so on. Only after the creative development phase would the new prototype go into factory production.

Software development is no different. First, we collaborate and figure out what to build. Then we decide how to build it. Finally, we build the software and release it to production. As in automotive design, it's critical to have time to prototype, experiment, and iterate on designs. At this stage, reusable components, higher-level tools, and better holistic design and understanding are key.

Deployment is a Factory Process

In contrast, after changes are submitted, there is no fundamental reason for humans to be involved in software deployment. The deployment process involves the evaluation of tests and production signals to determine whether the current build, release candidate, or configuration is value-positive. These tasks can be handled by machines, which are better at signal processing than humans. Setting aside reliability engineers who are responsible for business-wide deployment capabilities and fixing outages and complex failures—and will always be needed—if we involve humans in the routine parts of software deployment, we have failed at automation, failed to generate necessary signals, or both.

Routine deployment processes that require human involvement are highly likely to be human-unfriendly toil. The need for human oversight and involvement will never be zero, but value is created by reducing both the practical and theoretical involvement of humans in build configuration, integration testing, release, and production reliability processes.

DORAmetrics (see recent publications such as the State of DevOps reports or the older-but-formative book, Accelerate) such as time "from commit to deploy" implicitly recognize this fact: Commit-to-deploy times can be measured and compared across teams because—at root—every release/deploy action can be mechanized.

Operational Processes

Unlike development and deployment, the operation of software in production optimizes for the unexpected.

Deployments to Production:
Sooner is Almost Always Better

The goals for most commercial software projects are similar, stable, and worth repeating: Deliver as much value in software as possible per unit of time, at a given quality level. This alone justifies the industry focus on CD (continuous deployment). For example, consider two teams that are capable of producing the same amount of valuable change per quarter. One team's workflow and release process allow one release per quarter. The other team's workflow and release process allow one release per day. Because value is cumulative over time, although both teams deliver the same value in total work product, the team with daily releases yields more aggregate value. There is as much as a 50 percent increase in the integral of deployed value over time, purely from improving release cadence.

In reality, that evaluation is more complicated:

• Some fraction of changes will fail in production, especially experimental changes in product design. A change may take multiple releases to become valuable, especially product-fit or UX (user experience) changes. This need for iteration is a strong secondary reason to increase release cadence.

• The release process has overhead costs. Every step of deployment has overhead costs for compute. A lack of good, cheap automation will slow the release cadence. As anyone who has had to wait for a phone to update and reboot knows, release processes also cost the end user.

By and large, however, these details are of lesser magnitude than the value derived from shipping more changes earlier.

Qualitative properties of software systems

Software products are a combination of features and QAs (quality attributes), which refer to nonfunctional or extra-functional requirements. While functionality is local, meaning that you can point to the part of the code implementing a feature such as sign-in, QAs tend to be diffuse and harder to isolate, having to do with the interactions, dependencies, behaviors, and attributes of the system as a whole. QAs are often in tension with one another and with cost. Without providing a taxonomy or framework for QAs, the following covers some key insights.

Fixed Costs Are Not The Same as Losses

Although both affect the bottom line, the cost to produce software is distinct from the loss of value (e.g., loss of revenue, reduction in customer satisfaction, brand reputation) caused by defects manifesting in production. Bugs that reach users can be serious and severe. Even if you will never achieve perfection (see error budgets), you should aim to be close enough for most users not to notice. When you fail at quality, the losses are real and unpredictable.

By contrast, especially over short timeframes like quarters, the cost of both humans and computers is fixed: You can reasonably predict pay for engineers and the total cost of cloud or other infrastructure to deliver products. Secondary costs include the detection and repair of defects naturally introduced during development. This detection and remediation consumes some of the capacity to produce new software but is necessary to hit quality targets. Similarly, missed optimizations and wasted compute reduce available capacity in hardware resources, but the total cost of running the codebase over that timeframe is fixed.

In other words, a defect caught as part of the development process is regrettable but natural. It doesn't change the bottom line, apart from adding latency to the deployment of valuable software changes. By contrast, a defect caught after release poses a risk and, in many cases, causes a genuine impact on the bottom line. In this case, the "cost" is not so much that of a fix or a rollback, but a loss in reputation and perceived reliability. The value in quickly and automatically releasing software must be balanced against the risk of releasing software with too many defects.

It is important to account for defects that reach production separately from defects remediated internally as part of the development process.

Outage Reduction Isn't (Usually) A Good Impact Measure

The total number of outages can be useful as an overall proxy of total software process reliability, but evaluating the impact of a particular tool or software process change in terms of outage reduction rarely yields statistically significant results. Thus, an organization could think tools and processes have no impact, that software quality is immaterial, or that another mechanism is needed to characterize impact. The proof of impact is in the relationship between team-level results and organizational performance: teams with better DORA metrics predict better organizational performance. That mechanism is presented in the next section.

Most Counterfactuals Can't Be Measured

To measure an outcome against a counterfactual (an outcome that didn't happen) requires both a clearly identified baseline measurement and a well-understood divergence as the direct result of a defect or intervention. If there is no graph of the outcome over time showing a dip or spike correlated to the event in question, the outcome can't be measured against the counterfactual.

Since there isn't a singular measure of engineering productivity, nor an as-yet-agreed-upon index metric, this means that only very large aggregate proxy metrics are indicative of productivity, and only events that impact the aggregate population are potentially measurable. Direct measurement of the effect of a new tool or process on individual or small-group productivity is generally infeasible.

Defect Cost and Test Fidelity are in Tension

Testing software in production provides the only high-fidelity evidence of its value but risks substantial losses. Defects that reach production are costly, so lower-fidelity tests that can be implemented earlier are preferable to high-fidelity, high-cost tests in production. The earlier in the process that the detection of defects can be shifted, the less it will cost to fix them.

From a testing perspective, earlier phases of software development can serve as proxies for later phases. Development unit tests are a proxy for integration tests. Integration tests are a proxy for release qualification. Release qualification is a proxy for canary analysis. The further along a process, the higher the fidelity of the product-quality signal and the higher the capacity loss of a defect. Consider how much harder it is to root-cause a defect identified just before release, how much reengineering that may trigger, the communication costs, the need to produce a new release candidate, etc.

Evaluating Impact

We suggest four forms of impact, three of which are quantitative. Specifically, considering the points made in the previous section, Fixed Costs Are Not the Same as Losses, there are three fundamental forms of measurable impact for a software organization: product success, hardware resource efficiency, and engineering capacity. Strategic capabilities are a fourth, qualitative factor worth considering.

Product success

Is the product successful? This is the form of impact that matters most in commercial software development. The exact definition of success will differ from company to company and to some extent from product to product, but it often consists of some combination of revenue, reach, adoption, user trust, and customer satisfaction. It increases both when more value is deployed and when the rate of bugs and outages is reduced.

Hardware resource efficiency

Are production hardware/cloud resources being used efficiently? This is a production-adjacent and highly measurable form of impact. Codebase optimization and compiler optimization work can be extraordinarily effective in this space. Efficiency improvements are visible in two similar but distinct areas:

• Customer-facing resource consumption, or how much compute is needed to provide the features shipped to customers.

• The resource consumption of internal software processes, or how much compute is spent to produce a novel change or evaluate a potential release.

For customer-facing resource consumption, efficiency gains are either value-neutral, in the case of pure optimization, or follow from customer input about product requirements and product success, in which case, they are positive.

For engineer-facing resource consumption, efficiency gains involve nuanced tradeoffs between developer time and potential reductions in hardware/cloud usage. Some efficiency problems are worth investing developer time in, given savings over time, and some are not.

Engineering capacity

Are human resources being used effectively? Although this is challenging to measure in general, two insights can reduce the need for precise measurements of improved engineering capacity.

First, as discussed earlier in this article (Deployment is a Factory Process), nearly all human involvement in deployment processes is theoretically unnecessary overhead. DORA categorizes this involvement as "deployment pain." If a company can hold product success outcomes stable while reducing engineering involvement in the testing, deployment, and release process, that reduction in human toil is valuable. Reducing human involvement in deployment processes is a clear gain and often provides a higher-leverage ROI (return on investment) than additional hiring.

Second, because as stated earlier (Most Counterfactuals Can't be Measured), measuring improvement against avoided outcomes is likely impossible. There are at least two ways of measuring effects on capacity, however: You can measure the delta of an intervention (against the counterfactual, not usually possible); or you can estimate the capacity usage (cost) of the current process and look for aggregate cost reductions. The former asks: "How much did you save by acting now?" The latter asks: "Can you try X to make that wasteful process more efficient?"

When you ask about the value of preventing a specific defect at a specific stage, you attempt to measure against a counterfactual: Would that defect have made it to production and caused a loss, rather than a capacity cost? If it was caught by the workflow, which later phase(s) of the workflow would also have caught it? Over the probability distribution of detection chances for the rest of the workflow, what is the expected overall capacity cost for detection and remediation? This is nightmarishly complex to evaluate, especially in a defect-by-defect analysis; problems like this illustrate the difficulty in quantifying and categorizing the impacts of workflow and tooling improvements.

Instead, we suggest a consensus estimation function for DDR (defect detection and resolution). The cost of DDR scales linearly with both the number of engineers exposed to the defect and the time since introduction. This means that tools or process changes that diagnose a class of bugs earlier in the workflow are more valuable than those that do so later—for example, autoformatting and autofixing whitespace (i.e., tabs versus spaces, trailing spaces) issues inside the IDE (integrated development environment), a point-in-time intervention that addresses the defect at the earliest possible point. Only the engineer generating the change is exposed to the issue. Without this intervention, whitespace "defects" may not be detected until several minutes after introduction or as late as integration, hours or days after introduction.

With in-IDE autoformatting, the persistence of an undetected defect drops to seconds or minutes. Changing the DDR period for a class of bugs from hours to minutes is valuable. That investment, however, should be made only if the problem comes up often enough to warrant the cost of that change. Similarly, if the cost of fixing a class of problems late in the workflow is incommensurate with the value of that fix (e.g., blocking a release because of whitespace issues), those issues should be ignored.

Given this rough DDR estimate and remembering that human involvement in deployment should be minimal, we observe that engineering capacity can be improved either by reducing toil in deployment and integration while keeping DDR stable or by reducing the estimated aggregate cost of an individual defect or class of defects.

Strategic capabilities

New strategic capability is an important but not quantitative fourth form of impact. The impact of being able to perform a task that was previously impossible or so inefficient as to be impractical or irrelevant can't be quantified in the same way as product success, hardware resource efficiency, or engineering capacity. Similarly, there is strategic value in being capable of providing good information to decision-makers, even if that information is used rarely: For example, telemetry that is used once a quarter may still create massive leverage.

The ability of AI to aggregate and summarize information is one such strategic capability that can support human decision-making. From the introduction of search engines in the 1990s onward, productivity has leaped from simply having accessible information in a usable format (i.e., aggregated and summarized rather than raw and needing human processing). Another example of strategic capability is a flame graph for usage by a function call, which helps developers decide where and how to optimize performance.

Strategic capabilities may include long-term or experimental investments in fundamentally shifting how a business operates: for example, by declaring a new domain a strategic priority. These shifts often have long lead times and uncertainty. Teams operating in experimental domains benefit from a principled investment approach to investing supported by executive leadership, including a thoughtful approach to performance reviews and promotion. This form of impact should be a bounded part of an overall understanding of impact in an infrastructure investment portfolio.

Tradeoffs

Improvements to one of the following areas are likely to affect the two other workflow areas (table 1).

Improvements to one area affects the others

The role of infrastructure and platform engineering

Infrastructure and platform-engineering teams provide capacity and strategic impact and indirectly affect product-success metrics. While it is up to the product-development teams to build the right features for the products, infrastructure can accelerate their engineering work and provide quality assurance. Organizations that provide central software capabilities and infrastructure create impact in these ways:

• Engineering capacity. Can we reduce edit-build-test cycles and defect-generation rate, and increase developer satisfaction? Can we reduce rework and toil in deployment processes?

• Hardware resource efficiency. Can we reduce the consumption of hardware resources without impacting other forms of impact? Or can we analyze the tradeoffs between hardware resource efficiency gains and one of the other impacts?

• Strategic capability. Can we provide new forms of infrastructure, improved telemetry, etc.? Specializations within infrastructure have a different mix of effects. Some examples:

Effective education improves engineering capacity and should be measured accordingly. Educational efforts can reduce defect generation, rework, and edit-build-test cycles. If possible, these should be primary impact metrics for investments in technical educational programs.
Developer tooling can improve engineering capacity through more efficient development, hardware resource efficiency through more efficient use of hardware resources, or strategic capability through telemetry, new platforms, etc.
Improved codebase efficiency significantly increases hardware resource efficiency.

Top-level business reports include aggregate impact measures for product portfolios: revenue, costs, and other indicators of business performance. Aggregate impact measures for infrastructure and central developer teams are hardware/cloud resource utilization and DORA or similar metrics. DORA metrics ask: Across teams using the provided infrastructure, how does the infrastructure affect overall engineering (not product) performance and capacity? Infrastructure and development platform teams setting their KPIs (key performance indicators) might ask themselves: How do your customer teams' DORA metrics change after adopting your service?

We have suggested reducing deployment toil and increasing release cadence. Subtle but important additions to that approach include measuring the right things at the right phases of the workflow, deploying effective defect-detection mechanisms at the right step of the workflow, and properly balancing latency, resource costs, and toil.

A Model for Software Development

Combining these insights and defect cost estimates, we suggest a potential model for software development that focuses on the business side of software development and permits reasoning about efficiency and optimizing processes based on non-counterfactual measurements and estimates.

These are the values necessary for a stochastic simulation of the process:

• Defect-generation rate (developer). How skilled is the average developer for a given team? A developer's defect-generation rate decreases over time with practice and learning. Education can be a more targeted, effective, and expensive intervention.

• Latency (phase, project). How much wall clock latency does each phase of the workflow of a team/project add to the software-development process? As discussed earlier, deployment is a factory process, so reducing deployment latency is especially valuable.

• Hardware capacity cost (phase, project). For each phase of the workflow of a team/project, how much machine capacity is consumed to run necessary tests and computation?

• Engineering capacity cost (phase, project). How many people are working during each phase of each project's workflow? During deployment, developer time is often unnecessary. It can also be valuable to reduce human involvement in development, if and when it's reasonable to do so.

• Defect detection (phase, project). What fraction of defects that reach each phase of each project's workflow are filtered out at that point?

• Defect false negative (phase, project). Consider the false-negative rate. How often does each phase of each project's workflow miss defective changes? This delays detection to a later phase, increasing defect duration and resource usage in detection. It may also increase the number of developers affected, especially when defects reach deployment.

• Defect false positives (phase, project). How often does each phase of each project's workflow report failures that would not be defects in production?

The repetition of "each phase of each project's workflow" in these components emphasizes the organizational value in workflow consistency. The more tools and infrastructure shared across teams, the easier it is to achieve economies of scale and apply global optimizations. That said, if a given project's needs are substantially distinct from others, the standard workflow may be insufficient for their needs and require local divergence.

The last three components recall the earlier section, Defect Cost and Test Fidelity Are in Tension. Smaller, cheaper, faster tests and defect-detection mechanisms are essential but not always representative of quality/fitness signals in production. As a change approaches release, defect-detection signals have higher fidelity, but capacity costs are higher as well. Shifting everything into development testing would not work, because hardware resource efficiency and engineering capacity costs, as well as latency for individual changes, would skyrocket. Neither would shifting everything into release qualification tests, because detection and repair at that late point, with a much larger number of affected developers, is extremely expensive.

Thus, the goal should be to build a set of workflow phases that filter out defects to reach an acceptable level of quality with minimal developer latency while minimizing the sum of defect-resolution cost estimates.

The logic underpinning standard DORA metrics can be helpful here:

• Change failure rate. Filtering defects properly should lower the rate of outages and bugs that appear in production.

• Failed deployment recovery time. Well-tuned workflows and sufficient release automation can result in two types of positive outcomes:

Good monitoring systems. Progress between late-stage workflow phases (release qualification, canary release monitoring) often hinges on metrics and monitoring used by SREs (site reliability engineers) to detect outages. Improving defect-detection capability early in the workflow also improves production monitoring in general. Defects requiring a rollback to a previous version can be caught in monitoring.
Faster release cadence. If non-catastrophic defects, which do not merit a full rollback, reach production, the faster a fixed release can be cut, validated, and deployed, and the lower the failed deployment recovery time.

• From commit to deploy. DORA focuses on time "from commit to deploy" specifically, because it permits comparison across changes. This metric should be driven downward for most if not all teams.

• Release cadence. If there is sufficient automation and monitoring in place after workflow optimization, pushing more (smaller) releases minimizes work in progress and maximizes the duration in which changes are generating value.

These standard DORA metrics are downstream from submit. These are good metrics because they categorize team/project/product success at a mechanical level. Can you detect enough of the bugs? Can you deploy reliably? Can you meet your SLOs/SLAs (service-level objectives/service-level agreements)? Improvements to project-level technical systems (continuous integration or CI, release, canary, monitoring) improve these system-level metrics. Individual teams can adopt and optimize such systems to improve those metrics. Measuring and seeking to reduce the total amount of human involvement in a given period are important.

By contrast, metrics on the human side of the workflow (development) should be tailored to make individual engineers more effective in the creation process. Since individual changes are heterogeneous, usage-of-tools metrics and aggregates can apply to populations far larger than individual teams. The major levers are reducing the number of edit-build-test cycles, increasing usage of IDE features (automation, understanding), and reducing developer friction. The presubmit part of the workflow is affected through improved capabilities, improved documentation, improved education, better design process, better IDEs, improved diagnostics, reduced false-positive rates for presubmits, reduced latency for presubmits, and code review.

Developers can give good qualitative signals here. Given that we're speaking about the human/creative/design aspect of the software process, asking humans whether they are productive is a good, if limited, proxy. In DORA, productivity is considered a well-being metric for this reason. Likewise, issues or bottlenecks in the creative process can often appear as frustrations that impact well-being.

Conclusions

By taking a holistic view of the commercial software-development process, we have identified tensions between various factors and where changes in one phase, or to infrastructure, affect other phases. We have distinguished four distinct forms of impact, warned against measuring against unknown counterfactuals, and suggested a consensus mechanism for estimating DDR (defect detection and resolution) costs. Our approach balances product outcomes and the strategic need for change with both the human and machine costs of producing valuable software. With this model, the process of commercial software development could become more comprehensible across roles and levels and therefore more easily improved within an organization.

Titus Winters is a senior principal scientist at Adobe, focusing on developer experience. He has served on the C++ standards committee, chairing the working group for the design and evolution of the C++ standard library. He also served on the ACM/IEEE/AAAI CS2023 steering committee, helping set curriculum requirements for computer science undergraduate degrees, focusing on the requirements for software engineering. As a thought leader at Google for many years, he focused on C++, software engineering practice, technical debt, and culture. He is the lead author of the book Software Engineering at Google (O'Reilly, 2020).

Leah Rivers is the director of product management for Google's software foundations. She is a software engineering leader with decades of experience focusing on developers and the platforms and ecosystems crucial to their success. Her background includes working as an engineer and as an executive across a range of organizations including startups, SaaS (software as a service) companies, and high-tech companies including AWS and Google. She cares about the power and potential of harmonizing technology with the individual and social aspects of software development to drive innovation, create meaningful change, and deliver valuable software.

Salim Virji develops reliable engineering practices and processes for Google's SRE organization, and has built consensus and storage services for Google infrastructure. Salim's interests include distributed systems and machine learning. He has contributed to several books on SRE, including The Site Reliability Workbook (O'Reilly, 2018) and Implementing Service Level Objectives (O'Reilly, 2020).

Originally published in Queue vol. 23, no. 2—
Comment on this article in the ACM Digital Library

Develop, Deploy, Operate

A holistic model for understanding the costs and value of software development