Content and Community

3 months ago 6

One of the oldest and most fruitful topics on Stratechery has been the evolution of the content industry, for two reasons: first, it undergirded the very existence of Stratechery itself, which I’ve long viewed not simply as a publication but also as a model for a (then) new content business model.

Second, I have long thought that what happened to content was a harbinger for what would happen to industries of all types. Content was trivially digitized, which means the forces of digital — particularly zero marginal cost reproduction and distribution — manifested in content industries first, but were by no means limited to them. That meant that if you could understand how the Internet impacted publishing — newspapers, books, magazines, music, movies, etc. — you might have a template for what would happen to other industries as they themselves digitized.

AI is the apotheosis of this story and, in retrospect, it’s a story the development of which stretches back not just to the creation of the Internet, but hundreds of years prior and the invention of the printing press. Or, if you really want to get crazy, to the evolution of humanity itself.

The AI Unbundling and Content Commoditization

In September 2022, two months before the release of ChatGPT, I wrote about The AI Unbundling, and traced the history of communication to those ancient times:

As much as newspapers may rue the Internet, their own business model — and my paper delivery job — were based on an invention that I believe is the only rival for the Internet’s ultimate impact: the printing press. Those two inventions, though, are only two pieces of the idea propagation value chain. That value chain has five parts:

creation, substantiation, duplication, distribution, consumption

The evolution of human communication has been about removing whatever bottleneck is in this value chain. Before humans could write, information could only be conveyed orally; that meant that the creation, vocalization, delivery, and consumption of an idea were all one-and-the-same. Writing, though, unbundled consumption, increasing the number of people who could consume an idea.

Writing unbundled consumption from the rest of the value chain

Now the new bottleneck was duplication: to reach more people whatever was written had to be painstakingly duplicated by hand, which dramatically limited what ideas were recorded and preserved. The printing press removed this bottleneck, dramatically increasing the number of ideas that could be economically distributed:

The new bottleneck was distribution, which is to say this was the new place to make money; thus the aforementioned profitability of newspapers. That bottleneck, though, was removed by the Internet, which made distribution free and available to anyone.

The Internet unbundled distribution from duplication

What remains is one final bundle: the creation and substantiation of an idea. To use myself as an example, I have plenty of ideas, and thanks to the Internet, the ability to distribute them around the globe; however, I still need to write them down, just as an artist needs to create an image, or a musician needs to write a song. What is becoming increasingly clear, though, is that this too is a bottleneck that is on the verge of being removed.

It’s a testament to how rapidly AI has evolved that this observation already feels trite; while I have no idea how to verify these numbers, it seems likely that AI has substantiated more content in the last three years than was substantiated by all of humanity in all of history previously. We have, in other words, reached total content commoditization: the chatbot of your choice will substantiate any content you want on command.

Copyright and Transformation

Many publishers are, as you might expect, up in arms about this reality, and have pinned their hopes for survival on the courts and copyright law. After all, the foundation for all of that new content is the content that came before — content that was created by humans.

The fundamental problem for publishers, however, is that all of this new content is, at least in terms of a textual examination of output, new; in other words, AI companies are soundly winning the first factor of the fair use test, which is whether or not their output is transformative. Judge William Alsup wrote in a lawsuit against Anthropic:

The purpose and character of using copyrighted works to train LLMs to generate new text was quintessentially transformative. Like any reader aspiring to be a writer, Anthropic’s LLMs trained upon works not to race ahead and replicate or supplant them — but to turn a hard corner and create something different. If this training process reasonably required making copies within the LLM or otherwise, those copies were engaged in a transformative use. The first factor favors fair use for the training copies.

Judge Vince Chabria wrote a day later in a lawsuit against Meta:

There is no serious question that Meta’s use of the plaintiffs’ books had a “further purpose” and “different character” than the books — that it was highly transformative. The purpose of Meta’s copying was to train its LLMs, which are innovative tools that can be used to generate diverse text and perform a wide range of functions. Users can ask Llama to edit an email they have written, translate an excerpt from or into a foreign language, write a skit based on a hypothetical scenario, or do any number of other tasks. The purpose of the plaintiffs’ books, by contrast, is to be read for entertainment or education.

The two judges differed in their view of the fourth factor — the impact that LLMs would have on the market for the copyright holders — but ultimately came to the same conclusion: Judge Alsup said that the purpose of copyright law wasn’t to protect authors from competition for new content, while Judge Chabria said that the authors hadn’t produced evidence of harm.

In fact, I think that both are making the same point (see my earlier analysis here and here): Judge Chabria clearly wished that he could rule in favor of the authors, but to do so would require proving a negative — sales that didn’t happen because would-be customers used LLMs instead. That’s something that seems impossible to ascertain, which gives credence to Judge Alsup’s more simplistic analogy of an LLM to a human author who learned from the books they read. Yes, AI is of such a different scale as to be another category entirely, but given the un-traceability of sales that didn’t happen, the analogy holds for legal purposes.

Publishing’s Three Eras

Still, just because it is impossible to trace specific harm, doesn’t mean harm doesn’t exist. Look no further than the aforementioned history of publishing. To briefly compress hundreds of years of history into three periods:

Printing Presses and Nation States

From The Internet and the Third Estate:

In the Middle Ages the principal organizing entity for Europe was the Catholic Church. Relatedly, the Catholic Church also held a de facto monopoly on the distribution of information: most books were in Latin, copied laboriously by hand by monks. There was some degree of ethnic affinity between various members of the nobility and the commoners on their lands, but underneath the umbrella of the Catholic Church were primarily independent city-states.

The printing press changed all of this. Suddenly Martin Luther, whose critique of the Catholic Church was strikingly similar to Jan Hus 100 years earlier, was not limited to spreading his beliefs to his local area (Prague in the case of Hus), but could rather see those beliefs spread throughout Europe; the nobility seized the opportunity to interpret the Bible in a way that suited their local interests, gradually shaking off the control of the Catholic Church.

Meanwhile, the economics of printing books was fundamentally different from the economics of copying by hand. The latter was purely an operational expense: output was strictly determined by the input of labor. The former, though, was mostly a capital expense: first, to construct the printing press, and second, to set the type for a book. The best way to pay for these significant up-front expenses was to produce as many copies of a particular book that could be sold.

How, then, to maximize the number of copies that could be sold? The answer was to print using the most widely used dialect of a particular language, which in turn incentivized people to adopt that dialect, standardizing language across Europe. That, by extension, deepened the affinities between city-states with shared languages, particularly over decades as a shared culture developed around books and later newspapers. This consolidation occurred at varying rates — England and France several hundred years before Germany and Italy — but in nearly every case the First Estate became not the clergy of the Catholic Church but a national monarch, even as the monarch gave up power to a new kind of meritocratic nobility epitomized by Burke.

The printing press created culture, which itself became the common substrate for nation-states.

Copyright and Franchises

It was nation-states, meanwhile, that made publishing into an incredible money-maker. The most important event in common-law countries was The Statute of Anne in 1710. For the first time the Parliament of Great Britain established the concept of copyright, vested in authors for a limited period of time (14 years, with the possibility of a 14 year renewal); the goal, clearly stated in the preamble, was to incentivize creation:

Whereas printers, booksellers, and other persons have of late frequently taken the liberty of printing, reprinting, and publishing, or causing to be printed, reprinted, and published, books and other writings, without the consent of the authors or proprietors of such books and writings, to their very great detriment, and too often to the ruin of them and their families: for preventing therefore such practices for the future, and for the encouragement of learned men to compose and write useful books; may it please your Majesty, that it may be enacted, and be it enacted by the Queen’s most excellent majesty, by and with the advice and consent of the lords spiritual and temporal, and commons, in this present parliament assembled, and by the authority of the same…

A quarter of a century later America’s founding fathers would, for similar motivations, and in line with the English tradition that undergirded the United States, put copyright in the Constitution:

[The Congress shall have power] To promote the progress of science and useful arts, by securing for limited times to authors and inventors the exclusive right to their respective writings and discoveries.

These are noble goals; at the same time, it’s important to keep in mind that copyright is an economic distortion, because it is a government-granted monopoly. That, by extension, meant that there was a lot of money to be made in publishing if you could leverage these monopoly rights to your advantage. To focus on the U.S.:

The mid-1800s, led by Benjamin Day and James Gordon Bennett Sr., saw the rise of advertising as a funding source of newspapers, which dramatically decreased the price of an individual copy, expanding reach, which attracted more advertisers.
The turn of the century brought nationwide scale to bear, as entrepreneurs like Joseph Pulitzer and William Randolph Hearst built out nation-wide publishing empires with scaled advertising and reporting.
In the mid-20th century Henry Luce and Condé Montrose Nast created and perfected the magazine model, which combined scale on the back-end with segmentation and targeting on the front-end.

The success of these publishing empires was, in contrast to the first era of publishing, downstream from the existence of nation-states: the fact that the U.S. was a massive market created the conditions for publishing’s golden era, and companies that were franchises. That’s Warren Buffett’s term, from a 1991 letter to shareholders:

An economic franchise arises from a product or service that: (1) is needed or desired; (2) is thought by its customers to have no close substitute and; (3) is not subject to price regulation. The existence of all three conditions will be demonstrated by a company’s ability to regularly price its product or service aggressively and thereby to earn high rates of return on capital. Moreover, franchises can tolerate mis-management. Inept managers may diminish a franchise’s profitability, but they cannot inflict mortal damage.

In contrast, “a business” earns exceptional profits only if it is the low-cost operator or if supply of its product or service is tight. Tightness in supply usually does not last long. With superior management, a company may maintain its status as a low-cost operator for a much longer time, but even then unceasingly faces the possibility of competitive attack. And a business, unlike a franchise, can be killed by poor management.

Until recently, media properties possessed the three characteristics of a franchise and consequently could both price aggressively and be managed loosely. Now, however, consumers looking for information and entertainment (their primary interest being the latter) enjoy greatly broadened choices as to where to find them. Unfortunately, demand can’t expand in response to this new supply: 500 million American eyeballs and a 24-hour day are all that’s available. The result is that competition has intensified, markets have fragmented, and the media industry has lost some — though far from all — of its franchise strength.

Given that Buffett wrote this in 1991, he was far more prescient than he probably realized, because the Internet was about to destroy the whole model.

The Internet and Aggregators

The great revelation of the Internet is that copyright wasn’t the only monopoly that mattered to publishers: newspapers in particular benefited from being de facto geographic monopolies as well. The largest newspaper in a particular geographic area attracted the most advertisers, which gave them the most resources to have the best content, further cementing their advantages and the leverage they had on their fixed costs (printing presses, delivery, and reporters). I explained what happened next in 2014’s Economic Power in the Age of Abundance:

One of the great paradoxes for newspapers today is that their financial prospects are inversely correlated to their addressable market. Even as advertising revenues have fallen off a cliff — adjusted for inflation, ad revenues are at the same level as the 1950s — newspapers are able to reach audiences not just in their hometowns but literally all over the world.

A drawing of The Internet has Created Unlimited Reach

The problem for publishers, though, is that the free distribution provided by the Internet is not an exclusive. It’s available to every other newspaper as well. Moreover, it’s also available to publishers of any type, even bloggers like myself.

A city view of Stratechery's readers in 2014

To be clear, this is absolutely a boon, particularly for readers, but also for any writer looking to have a broad impact. For your typical newspaper, though, the competitive environment is diametrically opposed to what they are used to: instead of there being a scarce amount of published material, there is an overwhelming abundance. More importantly, this shift in the competitive environment has fundamentally changed just who has economic power.

In a world defined by scarcity, those who control the scarce resources have the power to set the price for access to those resources. In the case of newspapers, the scarce resource was reader’s attention, and the purchasers were advertiser…The Internet, though, is a world of abundance, and there is a new power that matters: the ability to make sense of that abundance, to index it, to find needles in the proverbial haystack. And that power is held by Google.

Google was an Aggregator, and publishers — at least those who users visited via a search results page — were a commodity; it was inevitable that money from advertisers in particular would increasingly flow to the former at the expense of the latter.

There were copyright cases against Google, most notably 2006’s Field v. Google, which held that Google’s usage of snippets of the plaintiff’s content was fair use, and furthermore, that Blake Fields, the author, had implicitly given Google a license to cache his content by not specifying that Google not crawl his website.

The crucial point to make about this case, however, and Google’s role on the Internet generally, is that Google posting a snippet of content was good for publishers, at least compared to the AI alternative.

Cloudflare and the AI Content Market

Go back to the two copyright rulings I referenced above: both judges emphasized that the LLM’s in question (Claude and Llama) were not reproducing the copyrighted content they were accused of infringing; rather, they were generating novel new content by predicting tokens. Here’s Judge Alsup on how Anthropic used copyrighted work:

Each cleaned copy was translated into a “tokenized” copy. Some words were “stemmed” or “lemmatized” into simpler forms (e.g., “studying” to “study”). And, all characters were grouped into short sequences and translated into corresponding number sequences or “tokens” according to an Anthropic-made dictionary. The resulting tokenized copies were then copied repeatedly during training. By one account, this process involved the iterative, trial-and-error discovery of contingent statistical relationships between each word fragment and all other word fragments both within any work and across trillions of word fragments from other copied books, copied websites, and the like.

Judge Chabria explained how these tokens contribute to the final output:

LLMs learn to understand language by analyzing relationships among words and punctuation marks in their training data. The units of text — words and punctuation marks — on which LLMs are trained are often referred to as “tokens.” LLMs are trained on an immense amount of text and thereby learn an immense amount about the statistical relationships among words. Based on what they learned from their training data, LLMs can create new text by predicting what words are most likely to come next in sequences. This allows them to generate text responses to basically any user prompt.

This isn’t just commoditization: it’s deconstruction. To put it another way, publishers were better off when an entity like Google was copying their text; Google summarizing information — which is what happens with LLM-powered AI Search Overviews — is much worse, even if it’s even less of a copyright violation.

This was a point made to me by Cloudflare CEO Matthew Prince in a conversation we had after I wrote last week about the company’s audacious decision to block AI crawlers on Cloudflare-protected sites by default. What the company is proposing to build is a new model of monetization for publishers; Prince wrote in a blog post:

We’ll work on a marketplace where content creators and AI companies, large and small, can come together. Traffic was always a poor proxy for value. We think we can do better. Let me explain. Imagine an AI engine like a block of swiss cheese. New, original content that fills one of the holes in the AI engine’s block of cheese is more valuable than repetitive, low-value content that unfortunately dominates much of the web today. We believe that if we can begin to score and value content not on how much traffic it generates, but on how much it furthers knowledge — measured by how much it fills the current holes in AI engines “swiss cheese” — we not only will help AI engines get better faster, but also potentially facilitate a new golden age of high-value content creation. We don’t know all the answers yet, but we’re working with some of the leading economists and computer scientists to figure them out.

Cloudflare is calling its initial idea pay per crawl:

Pay per crawl, in private beta, is our first experiment in this area. Pay per crawl integrates with existing web infrastructure, leveraging HTTP status codes and established authentication mechanisms to create a framework for paid content access. Each time an AI crawler requests content, they either present payment intent via request headers for successful access (HTTP response code 200), or receive a 402 Payment Required response with pricing. Cloudflare acts as the Merchant of Record for pay per crawl and also provides the underlying technical infrastructure…

At its core, pay per crawl begins a technical shift in how content is controlled online. By providing creators with a robust, programmatic mechanism for valuing and controlling their digital assets, we empower them to continue creating the rich, diverse content that makes the Internet invaluable…The true potential of pay per crawl may emerge in an agentic world. What if an agentic paywall could operate entirely programmatically? Imagine asking your favorite deep research program to help you synthesize the latest cancer research or a legal brief, or just help you find the best restaurant in Soho — and then giving that agent a budget to spend to acquire the best and most relevant content. By anchoring our first solution on HTTP response code 402, we enable a future where intelligent agents can programmatically negotiate access to digital resources.

I think there is value in Cloudflare’s efforts, which are very much inline with what I proposed in May’s The Agentic Web and Original Sin:

What is possible — not probable, but at least possible — is to in the long run build an entirely new marketplace for content that results in a new win-win-win equilibrium.

A drawing of A New AI Content Marketplace

First, the protocol layer should have a mechanism for payments via digital currency, i.e. stablecoins. Second, AI providers like ChatGPT should build an auction mechanism that pays out content sources based on the frequency with which they are cited in AI answers. The result would be a new universe of creators who will be incentivized to produce high quality content that is more likely to be useful to AI, competing in a marketplace a la the open web; indeed, this would be the new open web, but one that operates at even greater scale than the current web given the fact that human attention is a scarce resource, while the number of potential agents is infinite.

I do think that there is a market to be made in producing content for AI; it seems likely to me, however, that this market will not save existing publishers. Rather, just as Google created an entirely new class of content sites, Amazon and Meta an entirely new class of e-commerce merchants, and Apple and Meta an entirely new class of app builders, AI will create an entirely new class of token creators who explicitly produce content for LLMs. Existing publishers will participate in this market, but won’t be central to it.

Consider Meta’s market-making as an example. From 2020’s Apple and Facebook:

This explains why the news about large CPG companies boycotting Facebook is, from a financial perspective, simply not a big deal. Unilever’s $11.8 million in U.S. ad spend, to take one example, is replaced with the same automated efficiency that Facebook’s timeline ensures you never run out of content. Moreover, while Facebook loses some top-line revenue — in an auction-based system, less demand corresponds to lower prices — the companies that are the most likely to take advantage of those lower prices are those that would not exist without Facebook, like the direct-to-consumer companies trying to steal customers from massive conglomerates like Unilever.

In this way Facebook has a degree of anti-fragility that even Google lacks: so much of its business comes from the long tail of Internet-native companies that are built around Facebook from first principles, that any disruption to traditional advertisers — like the coronavirus crisis or the current boycotts — actually serves to strengthen the Facebook ecosystem at the expense of the TV-centric ecosystem of which these CPG companies are a part.

In short, Meta advertising made Meta advertisers; along those lines, the extent to which Cloudflare or anyone else manages to create a market for AI content is the extent to which I expect new companies to dominate that market; existing publishers will be too encumbered by their existing audience and business model — decrepit though it may be — to effectively compete with these new entrants.

Content-Based Communities

So, are existing publishers doomed?

Well by-and-large yes, but that’s because they have been doomed for a long time. People using AI instead of Google — or Google using AI to provide answers above links — make the long-term outlook for advertising-based publishers worse, but that’s an acceleration of a demise that has been in motion for a long time.

To that end, the answer for publishers in the age of AI is no different than it was in the age of Aggregators: build a direct connection with readers. This, by extension, means business models that maximize revenue per user, which is to say subscriptions (the business model that undergirds this site, and an increasing number of others).

What I think is intriguing, however, is the possibility to go back to the future. Once upon a time publishing made countries; the new opportunity for publishing is to make communities. This is something that AI, particularly as it manifests today, is fundamentally unsuited to: all of that content generated by LLMs is individualized; what you ask, and what the AI answers, is distinct from what I ask, and what answers I receive. This is great for getting things done, but it’s useless for creating common ground.

Stratechery, on the other hand, along with a host of other successful publications, has the potential to be a totem pole around which communities can form. Here is how Wikipedia defines a totem pole:

The word totem derives from the Algonquian word odoodem [oˈtuːtɛm] meaning “(his) kinship group”. The carvings may symbolize or commemorate ancestors, cultural beliefs that recount familiar legends, clan lineages, or notable events. The poles may also serve as functional architectural features, welcome signs for village visitors, mortuary vessels for the remains of deceased ancestors, or as a means to publicly ridicule someone. They may embody a historical narrative of significance to the people carving and installing the pole. Given the complexity and symbolic meanings of these various carvings, their placement and importance lies in the observer’s knowledge and connection to the meanings of the figures and the culture in which they are embedded. Contrary to common misconception, they are not worshipped or the subject of spiritual practice.

The digital environment, thanks in part to the economics of targeted advertising, the drive for engagement, and most recently, the mechanisms of token prediction, is customized to the individual; as LLMs consume everything, including advertising-based media — which, by definition, is meant to be mass market — the hunger for something shared is going to increase.

We already have a great example of this sort of shared experience in sports. Sports, for most people, is itself a form of content: I don’t play football or baseball or basketball or drive an F1 car, but I relish the fact that people around me watch the same games and races that I do, and that that shared experience gives me a reason to congregate and commune with others, and is an ongoing topic of discussion.

Indeed, this desire for a communal topic of interest is probably a factor in the inescapable reach of politics, particularly what happens in Washington D.C.: of course policies matter, but there is an aspect of politics’ prominence that I suspect is downstream of politics as entertainment, and a sorting mechanism for community.

In short, there is a need for community, and I think content, whether it be an essay, a podcast, or a video, can be artifacts around which communities can form and sustain themselves, ultimately to the economic benefit of the content creator. There is, admittedly, a lot to figure out in terms of that last piece, but when you remember that content made countries, the potential upside is likely quite large indeed.

Read Entire Article