Why You Can't Trust Most AI Studies

1 hour ago 1

August 18th. A new MIT report on generative AI comes along. Bottom line: 95% of generative AI pilots fail. Everyone shares it. “Generative AI is doomed,” says one. “I knew it! ChatGPT doesn’t even know how to add numbers reliably,” says another. “Humanity always wins in the end,” says a third.

October 28th. A new Wharton report on generative AI comes along. Bottom line: 75% of enterprises are already seeing positive ROI on generative AI. Everyone shares it. “This is a pretty big deal,” says one. “And it’s just the beginning!” says another. “If you’re not using AI, you’re not gonna make it…” says a third.

I read both reports. Both seem fine at first glance.

What’s going on?

Before I go into those two reports, you should keep in mind three things:

Beware the man of one study: You’d be scared to know the number of research studies from mature disciplines (e.g., medicine or social psychology) that pass peer review, are published in prestigious journals, and even infiltrate popular culture (e.g., Milgram’s experiment on obedience and the marshmallow test on delayed gratification) that turn out not to be replicable or whose findings are heavily disputed. Or worse: studies that originally found the opposite of what’s actually true (that’s called a reversal).
AI is in wartime, which means narrative > truth: Every single thing you hear about AI at this time (including, let’s not be hypocrites, this very blog) is biased to some degree by the fact that AI is popular, there’s a lot of money at stake, and there’s a lot of uncertainty on how it will turn out. I try to be honest in my approach—I have zero money at stake besides what I make by writing, which is contingent on being honest!—but there’s no denying biases affect us all.
Extremes lead the conversation: Corollary of the above. If you are moderate in your takes about AI, a more daring (or shameless) person will eat your lunch simply by adding a bit of hyperbole. Then another will add a bit more and eat their lunch. And so on. The ultimate state of affairs (which you will realize is already happening) is that the loudest, most shameless people gather at the extremes of the distribution of takes: extreme anti-AI and extreme pro-AI.

Let me explain what that means for those two AI reports I mentioned above: truth is likely to be somewhere in between “AI doesn’t work at all, let’s kill it” (too pessimistic) and “AI works so well, it will kill companies that wait” (too optimistic). But the reason truth probably lies in between is not “bothsidesism”; truth doesn’t try to content both sides a little; it isn’t “centrist” out of a need for false balance (for starters, truth is not teleological!).

The reason is that, during wartime, any one study will tend to fall in either extreme of the spectrum, irrespective of truth. If results were meh, no one would share them! (That’s also why journals rarely publish experiments that didn’t find the intended effect, creating a publication bias.) So, naturally, they distort their own position and results (and people further do that when they re-share it online, and so on), leaving too much space in the middle for truth to hide.

That said, let’s analyze each report in the most unflattering light possible; to find the truth—something that I contend everyone wants—we have to apply a force equal but of opposite valence to test the rigor and resistance of any story. For the MIT report, I’ll put myself in the skin of an AI booster, an AI lover. For the Wharton report, in the skin of an AI doomer, an AI hater.

Let’s see what comes out.

I dislike AI hype and AI anti-hype pretty much equally (they’re cut from the same emotional cloth: “this is super super super awesome” or “this is super super super terrible”). But I feel a strong, unyielding devotion and predilection for truth. So, wherever it is, I belong on its side.

Let’s take a straightforward passage from the Fortune article that originally covered the report:

Despite the rush to integrate powerful new models, about 5% of AI pilot programs achieve rapid revenue acceleration; the vast majority stall, delivering little to no measurable impact on P&L. The research—based on 150 interviews with leaders, a survey of 350 employees, and an analysis of 300 public AI deployments—paints a clear divide between success stories and stalled projects. . . . The core issue? Not the quality of the AI models, but the “learning gap” for both tools and organizations. While executives often blame regulation or model performance, MIT’s research points to flawed enterprise integration.

This already considerably qualifies the headline: they’re measuring P&L (profit and loss) on pilots, based partly on an analysis of public deployments and partly on interviews, and they note that the problem is not the technology itself but the integration.

What I read here: The MIT report was done too early to be meaningful. Measuring success by short-term P&L outcomes in the first year of adoption when integration is not mature is not even wrong! It doesn’t tell you anything about whether the technology works, but about whether people are having trouble integrating generative AI in existing workflows, and so forth (which can be a problem, but one that yields a different conclusion than “generative AI is not working”).

The concept of a pilot is precisely to try something and see if it works. It’s inherently experimental, an early-stage prototype no one expects to provide measurable profit! I am the first to admit that generative AI is not being reflected on productivity charts and that’s a bad sign—but it’s not so much a bad sign in the sense that the technology doesn’t work, as in the sense that we haven’t yet figured out how to implement it transversally throughout the economy.

There’s a corollary to this: if you only measure P&L on pilots (a binary between success or failure), you’re not measuring the things pilots can actually do, which is fuzzy and nebulous when it comes to interpretation, but fundamental nonetheless. How much time was saved? Did error rates decrease or increase? Have workflows qualitatively, if not quantitatively, improved?

There’s a second problem with the report: a visibility bias. The report draws from projects public enough to make it into their dataset, which comprises mostly the “look what we’re doing” type of PR experiments vs the boring and uninteresting automation that, in being unflashy, is rarely pursued at this stage (maybe it’s pursued to some degree and it just doesn’t work, but the report doesn’t say anything about that either way!).

The 95% failure rate of AI pilots might just be telling us that what can be faithfully described as “corporate theater” fails 95% of the time. Well, ok, we already knew that.

Is the MIT report useless? Far from it! It’s just measuring something other than what people took it to be measuring. It’s a warning about how companies measure impact (P&L on pilots, really?) and about how hard it might be to integrate generative AI in the world at the micro scale. That’s it. If I were uncharitable, I’d simply say that it’s the perfect measure for corporate impatience.

OFFER ENDS TODAY!! Subscribe to access exclusive content and in-depth coverage of all things AI with a focus on common sense and the human side of things. No hype, no selling, no doom and gloom. Only level-headed takes allowed!

I’m running a Halloween sale for free subscribers at a 33% discount until November 3rd (Monday). I won’t be offering another one until 2026, so be sure to get yours at the reduced price. You can also get a standard monthly subscription.

GET YOUR DISCOUNT

The Wharton report is trickier to disentangle than the MIT one and also more critical that we do so. For one simple reason: Contrary to the MIT report, the Wharton report suggests a positive finding—it works! So let’s put on our AI-skeptic hat and do the same we just did with extra care.

Read Entire Article

Why You Can't Trust Most AI Studies

Related

Garbage In, Garbage Out: The Case for Better Robot Data Unde...

Brussels knifes privacy to feed the AI boom

EU takes aim at plastic pellets to prevent their nightmare c...