Goofing on Meta's AI Crawler

2 hours ago 1

Early in March 2025, I noticed that a web crawler with a user agent string of

meta-externalagent/1.1 (+https://developers.facebook.com/docs/sharing/webmasters/crawler)

was hitting my blog’s machine at an unreasonable rate.

I followed the URL and discovered this is what Meta uses to gather premium, human-generated content to train its LLMs. I found the rate of requests to be annoying.

I already have a PHP program that creates the illusion of an infinite website. I decided to answer any HTTP request that had “meta-externalagent” in its user agent string with the contents of a bork.php generated file.

I run the Apache web server. To feed bork.php generated content to Meta, I turned on mod_rewrite, and put this in the relevant config file:

RewriteEngine on RewriteCond %{HTTP_USER_AGENT} meta-externalagent RewriteRule ^.*(\?.*)*$ /bork.php [L]

That’s as specific as I’m willing to be, given how widely Apache configs vary. bork.php is a PHP program, and has a few pre-reqs for running correctly.

This worked brilliantly. Meta ramped up to requesting 270,000 URLs on May 30 and 31, 2025.

Meta’s crawler requests per day

Meta’s crawler requests completely swamped any other traffic my blog/website got. I was essentially only feeding Meta’s AI. After about 3 months, I got scared that Meta’s insatiable consumption of Super Great Pages about condiments, underwear and circa 2010 C-List celebs would start costing me money. So I switched to giving “meta-externalagent” a 404 status code. I decided to see how long it would take one of the highest valued companies in the world to decide to go away. The answer is 5 months.

Timeline

Just to get this entirely out of the way:

2025-03-08 - approximate start date
2025-03-13 - well underway, first “combined” format Apache log file I saved
2025-06-17 - started giving 404s to “meta-externalagent/1.1'
2025-10-23 - started giving 404s to “facebookexternalhit”, “meta-externalagent”, “facebookcatalog”, “meta-externalads”, “meta-externalfetcher”, “meta-webindexer” crawlers, too, completely out of spite.
2025-11-10 - called an end to the experiment, starting writing this post.

Results

From 2025-03-08 to 2025-06-17:

8898445 200 OK, issued from about 2025-03-08 until 2025-06-17
6225348 404 not found, issued from 2025-06-17 to 2025-11-10

Requested URLs

My program bork.php generates HTML with links. 25% of these links are in <img> tags. One third of those have .png, .gif or .jpg suffixes each. About 20% of the <a> (anchor) tags are to external sites with randomly chosen names - they almost certainly do not exist in DNS. About 80% of <a> tags that bork.php generates will not have a DNS name, so they’re links back to my site.

Of the links that are back to my site, randomly-generated URLs have 11 suffixes (so called “file types”). Those URLs are weighted towards .html suffixes. Here’s what meta-externalagent crawlers asked for in requests of my site:

Suffix Expected % Requested %

cfm	7.1	12
gif	7.1	0
htm	7.1	10
html	28.5	40
jpg	7.1	0.4
jsp	7.1	12
mp3	7.1	10
png	7.1	0.4
shtml	7.1	13
tar.gz	7.1	0
torrent	7.1	none

I’m not sure I’m calculating the expected percentages correctly. There are more ways for it to chose to make a .gif URL than a .cfm URL, for example, but the “Expected %” column is roughly correct. The actual proportions are far more interesting than any actual-to-expected correlation would be.

The crawler asked for exactly 0 (zero) .torrent URLs, but did ask for .mp3 URLs. It heavily favors (87% of requests) URLs that indicate they contain text, .html, .htm and the like. It heavily disfavors URLs ending in suffixes indicating an image.

Since Meta is crawling the web to train its Large Language Model, it’s not too surprising the crawler favors retrieving text.

If I was paranoid, I’d say that asking for .mp3 URLs way out of proportion from other non-text URLs indicates that Meta is looking for copyright violations. Except why not retrieve indications of Torrenting in that case? In any case, bork.php doesn’t return appropriate content for non-text, non-image URLs, just a few randomly-chosen bytes, so I personally have nothing to fear.

Note that the HTML and image file content is randomly generated. I believe I have no copyright on it at all. What Meta does with randomly generated content is on them, but my conscience is clear.

Requesting IP Addresses

Meta made requests from both IPv4 (230 addresses) and IPv6 (580 different) addresses. The IPv6 addresses were all in the 2a03:2880::/29 block. The IPv4 addresses were in 173.252.64.0/18, 57.141.0.0/24, 66.220.144.0/20, 69.171.224.0/19, 69.63.176.0/20. Meta IPv6 addresses made 15115906 requests, while IPv4 addresses made 7887 requests. Meta distinctly prefers IPv6 addresses.

The whois data on the addresses showed a variety of Facebook-related corporate entities as “owners” of the address ranges. Meta plays the “we’re an Irish corporation” game like all the other tech giants.

Disturbing gap

The 2025-08-19 to 2025-08-23 interval of very low Meta HTTP requests is not Meta’s crawler stopping. I think something strange was going on at the data center hosting my VPS. Apache on that VPS got very few requests from anywhere during those 5 days. The systemd command journalctl doesn’t show anything suspicious. The VPS doesn’t show a reboot during that period.

I have set up Smokeping to track connectivity to my VPS. It shows nothing amiss for the period.

Smokeping latency to my VPS

Confessions

This is of course, completely unscientific. I forgot to write down the start date. I arbitrarily quit giving out fabricated, randomly-generated HTML in favor of 404s. My infinite website program doesn’t give out URLs that I could use for tracking, if and when some crawler asks for a URL my Apache server it gave out previously. I didn’t construct my “infinite website” program so that I could determine what percentage of what kind of URLs it gave out.

Exhortation

This effort does show what a simple Linux guy with a $6 a month VPS can do. If a lot of people ran scraper traps or junkyards, Facebook would have to behave properly, or behave a good deal more lawlessly. If the latter, the fig leaf of being a law-abiding citizen of The Internet would be removed.

Meta is a terrible company. They aren’t being at all mannerly scraping everything. At the very least, the effects of copyright law on their use of human-written material is arguable. I feel that we should all give fake content to Meta’s AI scraper, or something similar. I believe that every time someone implements a scraper junkyard, it should be individual, highly customized, and idiosyncratic, in order to give the people at Meta, Google, OpenAI and others problems. I quit goofing on Meta because I was worried about costs of ridiculously high traffic to my $6-a-month VPS. I should probably have written my infinite website program with some kind of rate limiting, a fixed number of requests per day perhaps, and then give out 503s the rest of the day. bork.php already waits a randomly-chosen delay with a mean of about 14 seconds on each request.

Read Entire Article