AI and the Digital Content Provider's Dilemma

3 weeks ago 1

This is the first part of a four-part series called “Protecting Digital Content in the AI Age: A Lawyer’s Guide.” Part 2 will discuss the advantages of automated payments for AI companies and the public. Part 3 will examine AI crawlers, robots.txt, the technology behind the web, and new tools for digital content owners. Part 4 will analyze US and EU laws related to text and data mining opt-outs and technologies that may help digital content owners sustain their businesses.

A. The Web That Could Have Been and the Web We Got

So often I find a writer I want to follow, but I don’t want to add another monthly subscription to my budget for a newspaper or magazine that I’m otherwise not interested in. Sometimes I read an incredibly influential piece that got millions of views, but the writer got no compensation at all because they don’t put out enough content to warrant a subscription. Every now and then I’ll stumble upon a hobbyist site clearly run by someone who drinks way too much coffee that fulfills a one-time need for very useful information, like one that dutifully cataloged specs for an untold number of computer monitors. What if it were as easy to compensate them as throwing coins into a street musician’s hat? Visit a site, automatically toss them some money.

The idea of web micropayments is an old one.^[1] For techno-optimists, it sits in the same wistful part of the brain as federated social media platforms and ubiquitous mesh networks (and bringing Google Reader back from the dead, if we’re going to be honest). Micropayments never took off because credit card companies weren’t interested, and unlike the record companies and the movie studios, there wasn’t a forcing function to drag them kicking and screaming into the next millennium, like Napster or Netflix. There was simply nowhere else to go if someone wanted digital payments. For context, the first website was launched in 1991 and PayPal (for example) didn’t launch its services until 2000.

In the meantime, digital content providers increasingly relied on ad revenue, whether their content was copyrightable or not. Google Search launched in 1998, initially putting ads in the search results screen as part of Google AdWords in 2000. The web caused a near extinction event for a number of businesses, most notably local journalism.^[2] The remaining digital content owners and Google settled into a happy symbiosis: content owners acquiesced^[3] to their content being copied and indexed for listing on Google Search in exchange for Google sending viewers to their sites. In 2003 that symbiosis was enriched by Google AdSense and what would eventually become the Google Display Network, which allowed content owners to sign up for Google to serve ads on their site for a share of the ad revenue. Google eventually drove so much business on the web that they pried content providers out from behind portals like Yahoo! and largely decimated the closed web. In 2007 the New York Times actually scrapped its subscription service because they realized they could get more money just from ads and didn’t return to the concept for another 4 years.

Google Search and the ad-supported open web did not just move analog content online. Nearly every organization with a web presence became a content creator, and many began to offer deep libraries of free, expert content to lure visitors to their sites and build trust in their brands. For example, law firms like Quinn Emmanuel publish several blogs a day about recent cases and new laws; hobby retailers like Goulet Pens have giant trove of articles and videos about their goods; software security firms publish vital security-related info, like Dark Visitors’ widely cited list of all known bots on the internet. It’s not just that the world saw an explosion of content, it’s that the type of content being produced was new, too, previously either not available at all, limited to customers of the firm, or only published in a book at least a year after it would have been useful.

But, the ad-based web has eaten itself. Websites are now optimized for engagement and views, not quality, so outrage and conspiracies have become the coin of the realm, and the engagement (addiction) strategies of social media algorithms make the casino playbooks look absolutely stone age. Hucksters have learned to jump their way up the Google Search results so effectively that nearly every search, no matter how specific, yields links of the “10 very basic facts about the domain you were asking about” variety – quotation marks in your search be damned! Things have gotten so bad that even the Wall Street Journal noticed that people were adding the word “reddit” to their searches back in 2023 so they could find genuine information because Google now struggles to surface things like personal blogs and other non-commercial material.

B. The Rise of AI and the Decline of Human Traffic and Ad Revenue on the Web

A major factor in AI’s popularity is the decline of Google Search. Casual AI users appreciate getting a quick, straight answer without having to wade through the muck of Google Search results and people doing a deeper dive are rejoicing because some of the models are capable of finding relevant sources that Google Search simply can’t anymore. Unfortunately, most queries are of the casual variety and the AI answers often eliminate the need to go visit the sources, even if they’re listed. That translates into less ad revenue for site owners who host content and fewer transactions for e-commerce sites.

Data from SimilarWeb shows that 44 out of the top 50 news websites in the US saw declining traffic in the last year. Search referrals to top U.S. travel and tourism sites tumbled 20% year over year last month, 9% for e-commerce, and 17% for news and media sites. Cloudflare’s data indicates that with OpenAI, it’s 750 times more difficult to get traffic (referrals) than it was with the Google of old (before its answer box and its AI overviews) and with Anthropic, it’s 30,000 times more difficult.^[4]

Perplexity AI has become a recurring character in the AI v. digital content provider saga. While the practices of many AI companies have been the subject of a healthy debate, some of Perplexity’s behavior constitutes incontrovertible AI-powered plagiarism. Perplexity is valued at about $20 billion and it primarily runs a search engine that summarizes its recommendations. During the summer of 2024, they were excoriated for publishing an article using AI that essentially recycled investigative journalism from Forbes, lifting sections of their article and even reproducing some of their images, without even mentioning Forbes or the authors of the original piece. They even created an AI-generated podcast related to the piece that irritated Forbes to no end because it outranked all other Fobes content in a Google search for this topic.

That same summer, WIRED used all the firepower it could muster in calling them out for output making claims that WIRED reported something it didn’t, output closely paraphrasing WIRED articles without attributing them, and reproducing their photographs, showing attribution only if the image is clicked – all while ignoring their robots.txt. They published an article entitled “Perplexity is a Bullshit Machine,” among others. That article was shortly followed by another, entitled “Perplexity Plagiarized Our Story About How Perplexity is a Bullshit Machine.”

In a September 2025 interview with Stratechery, Cloudflare’s CEO (another recurring character discussed further in Part 3) described similar problematic behavior:

“[I]f they’re blocked from getting the content of an article, they’ll actually, they’ll query against services like Trade Desk, which is an ad serving service and Trade Desk will provide them the headline of the article and they’ll provide them a rough description of what the article is about. They will take those two things and they will then make up the content of the article and publish it as if it was fact…”

Cloudflare also exposed that Perplexity crawlers were bypassing website permissions by disguising their identity. Perplexity really seems to have gotten Cloudflare’s goat, and driven some of Cloudflare’s innovation, or at the very least, their rhetoric. Cloudflare and Perplexity have since engaged in several public brawls that have focused discourse on AI crawling. All while avoiding punching the actual AI heavy weights in the room.

It may be tempting to argue that declining viewership is caused by just a few bad apples rather than the fundamental nature of the technology. After all, some of those referral numbers are significantly better than others and historically, there have been many cycles of technological disruption, copyright holder backlash, and then gradual realization by copyright holders that the technology can actually improve their businesses. However, even when AI companies appear to be following all the best practices and creating output that properly references sources, websites are still losing traffic. Google’s AI Overviews (powered by Gemini), which follows a lot of good practices and has the strongest incentive to send people to actual webpages because they contain Google ads, appears to be choking off web traffic by about 40%:

C. AI Crawlers Are Overwhelming Web Infrastructure and Driving Up the Costs of Maintaining a Website

Apart from the question of data governance, AI crawlers are taxing site infrastructure, significantly degrading site access for other users, and regularly taking down sites altogether. An estimated 30% to 39% of global web traffic now comes from bots. Unlike older search crawlers, AI crawlers ignore website permissions (robots.txt), crawl delay instructions, and bandwidth-saving guidelines, causing traffic spikes that can be 10-20x the normal level^[5] within just a few minutes. Many sysadmins report that crawlers are running random user-agents from tens of thousands of residential IP addresses, each one making just one HTTP request and therefore masquerading as a normal user. This activity amounts to a distributed-denial-of-service (DDoS) attack, which can’t be thwarted by mechanisms like IP blocking, device fingerprinting, or even CAPTCHAs.^[6] The crawler activity affects not just the websites being crawled, but other websites as well if they’re on a shared server.^[7] Even the largest sites experience performance issues when crawled by AI-related crawlers.

Sites hosting open source software seem to be particularly juicy targets. Anthropic was in the headlines last summer for visiting certain sites more than a million times a day. A GNOME sysadmin estimated that in a 2.5 hour sample, 97% of attempted visitors were crawlers. Both Fedora and LWN, a Linux/FOSS news site, have reported that only a small portion of their traffic now consists of humans and that they’re struggling just to keep their sites up – Fedora has been down for weeks at a time. It does not appear to be the case that these are examples of crawler bugs – some report a regular pattern of being scraped every six hours.^[8]

Other kinds of websites are also being attacked. An April 2025 survey by the Confederation Open Access Repositories, representing libraries, universities, research institutions, etc. around the world, indicated that 80% of surveyed members had encountered service disruptions as a result of aggressive bots; a staggering ⅕ reported having a service outage that lasted several days. Wikimedia has seen a 50% increase in multimedia downloads on Wikipedia. The explosive growth in the number of crawlers out there and the scale of their activities drove Wikimedia to the point that it started creating structured Wikipedia datasets for AI companies to download just to keep them off the live site. Even small, niche sites have been under distress. A website hosting pictures of board games reported that it was crawled 50,000 times by OpenAI’s crawler in a single month, drawing 30 terabytes of bandwidth.

The kicker? Websites with slow loading pages are ranked lower in Google Search results!

The amount of pressure even the more scrupulous AI companies are putting on site infrastructure is vastly disproportionate to the amount of human traffic they send to those same sites:

One group of academics, policy-makers and advocates has suggested that the digital commons is currently subsidizing AI development by bearing these additional infrastructure costs and involuntarily contributing to the environmental footprint associated with AI. Indeed, although the EU Act requires companies to disclose energy used in training and inference, they are not required to disclose an estimate of the energy used by third parties in responding to their crawlers or the energy used to block them.^[9]

D. Conclusion

The question of paying content providers is fundamentally about preserving the open web, not necessarily punishing AI companies for doing something wrong. I might even be persuaded that certain activities or services are fully within the bounds of the law. But even perfectly legal, well-intentioned activities can create negative externalities that should nevertheless be addressed. The current state of affairs is not sustainable . People won’t keep posting freely available content, at increasing expense, just to sate the AIs if users don’t even associate them with their work, nevermind compensation. They’ll move their content behind paywalls, join walled gardens, or simply stop creating content. Individuals will be further isolated in their information bubbles. Per a recent study, “Consent in Crisis: The Rapid Decline of the AI Data Commons,” 20-33% of all tokens from the most high quality and frequently used websites for training data became restricted in 2024, up from 3% the previous year. Another compensation model is needed and it looks like the technology to power it might be right around the corner.

^[1] The 1997 HTTP spec optimistically stubbed out a micropayments request code: 402.

^[2] https://en.wikipedia.org/wiki/Decline_of_newspapers

^[3] Not without throwing a legal tantrum first, of course. Google faced numerous lawsuits related to indexing websites as well as creating thumb nail images for image search. See Perfect 10, Inc. v. Amazon.com, Inc. & Google Inc. (2007) and Field v. Google, Inc. (2006).

^[4] Cloudflare has made a lot of data available about the activity of AI crawlers. An explanation of their metrics and links to live dashboards are here.

^[5] https://www.theregister.com/2025/08/29/ai_web_crawlers_are_destroying/

^[6] Headless browsers, discussed in Part 3, allow AIs to interact with websites like a human would.

^[7] https://www.inmotionhosting.com/blog/ai-crawlers-slowing-down-your-website/

^[8] https://thelibre.news/foss-infrastructure-is-under-attack-by-ai-companies/

^[9] See the Model Documentation Form published alongside the Transparency chapter of the EU AI Act Code of Practice.

Read Entire Article