Methodology
CommonCrawl
Common Crawl maintains one of the largest publicly available web archives. It provides billions of URLs and is used by researchers and developers, and is a key data source for training large language models.
Selection of Articles
We need a representative sample of English-language articles on the web. To do so, we randomly select 65k URLs from CommonCrawl, and confirm that each is in English, has an article schema markup, is at least 100 words, has a publish date between January 2020 and May 2025, and is an article or listicle as classified by the Graphite page type classifier.
AI Detection Algorithm
Accurate detection of AI-generated content is required to make claims about the prevalence of AI-generated articles on the web. There is a considerable disagreement about the accuracy of AI detection algorithms, and many argue that detecting AI is impossible, or at best, highly inaccurate. Many companies offer AI detection algorithms, including Originality.ai, GPTZero, Grammarly, and Surfer.
To compute the percentage of AI-generated content in an article, we use the same algorithm described in our 2024 whitepaper, but classify each chunk using Surfer’s AI detector with a chunk size of 500 words. We classify an article as AI-generated if the algorithm predicts that more than 50% of the content is AI-generated, and human-written otherwise.
Before classifying the articles in our data set, we evaluate the accuracy of Surfer’s AI detection algorithm.
Evaluation of False Positive Rate
To evaluate the false positive rate (the percentage of human-written articles classified as AI-generated), we need a dataset of human-written articles. Since the large-scale adoption of AI tools began with ChatGPT, we argue that, with high probability, articles published before the release of ChatGPT were written by humans. Therefore, we run Surfer’s AI detector on the 15,894 articles in our CommonCrawl dataset that were published between January 2020 and November 2022. SurferSEO’s AI detection tool classifies 4.2% of these articles as primarily AI-generated, suggesting a 4.2% false positive rate.
Evaluation of False Negative Rate
To evaluate the false negative rate (percentage of AI-generated articles classified as human-written), we use OpenAI’s GPT-4o to generate 6,009 articles on a wide range of topics from projects at Graphite, including commerce, finance, consumer, and B2B enterprise.
We use the OpenAI API to generate the articles using the system prompt:
You are an expert content writer. Your task is to generate clear, engaging, and informative content about the topic provided by the user.
- Write in a professional yet friendly tone.
- The target audience is people searching on the web for key terms related to the topic provided by the user.
- The user will provide a word count for the prompt. Ensure that the generated content adheres to the specified word count, allowing for a variance of plus or minus 10 percent.
- Avoid jargon unless explained.
- Do not include any disclaimers or meta-commentary.
and prompt “Write an article on the topic '{topic}' with approximately {word_count} words.”, with word_count set to the number of words in a reference human-written article on the same topic.
SurferSEO’s AI detection algorithm correctly classifies 99.4% of the AI-generated articles as AI-generated, suggesting a 0.6% false negative rate for GPT-4o.
The raw data for this evaluation is available here.
Quantifying AI-Generated Articles on the Web
Finally, we classify all 65k articles in the dataset to evaluate the percentage of articles being published on the web that are AI-generated.
The raw data with classifications is available here. Note that we do not include the URLs to avoid identifying specific companies that may be publishing AI-generated articles.
.png)
