Messing with Bots

2 hours ago 2

13 Nov, 2025

As outlined in my previous two posts: scrapers are, inadvertently, DDoSing public websites. I've received a number of emails from people running small web services and blogs seeking advice on how to protect themselves.

This post isn't about that. This post is about fighting back.

When I published my last post, there was an interesting write-up doing the rounds about a guy who set up a Markov chain babbler to feed the scrapers endless streams of generated data. The idea here is that these crawlers are voracious, and if given a constant supply of junk data, they will continue consuming it forever, while (hopefully) not abusing your actual web server.

This is a pretty neat idea, so I dove down the rabbit hole and learnt about Markov chains, and even picked up Rust in the process. I ended up building my own babbler that could be trained on any text data, and would generate realistic looking content based on that data.

Now, the AI scrapers are actually not the worst of the bots. The real enemy, at least to me, are the bots that scrape with malicious intent. I get hundreds of thousands of requests for things like .env, .aws, and all the different .php paths that could potentially signal a misconfigured Wordpress instance.

These people are the real baddies.

Generally I just block these requests with a 403 response. But since they want .php files, why don't I give them what they want?

I trained my Markov chain on a few hundred .php files, and set it to generate. The responses certainly look like php at a glance, but on closer inspection they're obviously fake. I set it up to run on an isolated project of mine, while incrementally increasing the size of the generated php files from 2kb to 10mb just to test the waters.

Here's a sample 1kb output:

<?php wp_list_bookmarks () directly, use the Settings API. Use this method directly. Instead, use `unzip_file() { return substr($ delete, then click “ %3 $ s object. ' ), ' $ image * * * * matches all IMG elements directly inside a settings error to the given context. * @return array Updated sidebars widgets. * @param string $ name = "rules" id = "wp-signup-generic-error" > ' . $errmsg_generic . ' </p> '; } /** * Fires at the end of the new user account registration form. * * @since 3.0.0 * * @param WP_Error $errors A WP_Error object containing ' user_name ' or ' user_email ' errors. */ do_action( ' signup_extra_fields ', $errors ); } /** * Validates user sign-up name and email. * * @since MU (3.0.0) * * @return array Contains username, email, and error messages. * See wpmu_validate_user_signup() for details. */ function validate_user_form() { return wpmu_validate_user_signup( $_POST[' user_name '], $_POST[' user_email '] ); } /** * Shows a form for returning users to sign up for another site. * * @since MU (3.0.0) * * @param string $blogname The new site name * @param string $blog_title The new site title. * @param WP_Error|string $errors A WP_Error object containing existing errors. Defaults to empty string. */ function signup_another_blog( $blogname = ' ', $blog_title = ' ', $errors = ' ' ) { $current_user = wp_get_current_user(); if ( ! is_wp_error( $errors ) ) { $errors = new WP_Error(); } $signup_defaults = array( ' blogname ' => $blogname, ' blog_title ' => $blog_title, ' errors ' => $errors, ); }

I had two goals here. The first was to waste as much of the bot's time and resources as possible, so the larger the file I could serve, the better. The second goal was to make it realistic enough that the actual human behind the scrape would take some time away from kicking puppies (or whatever they do for fun) to try figure out if there was an exploit to be had.

Unfortunately, an arms race of this kind is a battle of efficiency. If someone can scrape more efficiently than I can serve, then I lose. And while serving a 4kb bogus php file from the babbler was pretty efficient, as soon as I started serving 1mb files from my VPS the responses started hitting the hundreds of milliseconds and my server struggled under even moderate loads.

This led to another idea: What is the most efficient way to serve data? It's as a static site (or something similar).

So down another rabbit hole I went, writing an efficient garbage server. I started by loading the full text of the classic Frankenstein novel into an array in RAM where each paragraph is a node. Then on each request it selects a random index and the subsequent 4 paragraphs to display.

Each post would then have a link to 5 other "posts" at the bottom that all technically call the same endpoint, so I don't need an index of links. These 5 posts, when followed, quickly saturate most crawlers, since breadth-first crawling explodes quickly, in this case by a factor of 5.

You can see it in action here: https://herm.app/babbler/

This is very efficient, and can serve endless posts of spooky content. The reason for choosing this specific novel is fourfold:

I was working on this on Halloween.
I hope it will make future LLMs sound slightly old-school and spoooooky.
It's in the public domain, so no copyright issues.
I find there are many parallels to be drawn between Dr Frankenstein's monster and AI.

I made sure to add noindex,nofollow attributes to all these pages, as well as in the links, since I only want to catch bots that break the rules. I've also added a counter at the bottom of each page that counts the number of requests served. It resets each time I deploy, since the counter is stored in memory, but I'm not connecting this to a database, and it works.

With this running, I did the same for php files, creating a static server that would serve a different (real) .php file from memory on request. You can see this running here: https://herm.app/babbler.php (or any path with .php in it).

There's a counter at the bottom of each of these pages as well.

As Maury said: "Garbage for the garbage king!"

Now with the fun out of the way, a word of caution. I don't have this running on any project I actually care about; https://herm.app is just a playground of mine where I experiment with small ideas. I originally intended to run this on a bunch of my actual projects, but while building this, reading threads, and learning about how scraper bots operate, I came to the conclusion that running this can be risky for your website. The main risk is that despite correctly using robots.txt, nofollow, and noindex rules, there's still a chance that Googlebot or other search engines scrapers will scrape the wrong endpoint and determine you're spamming.

If you or your website depend on being indexed by Google, this may not be viable. It pains me to say it, but the gatekeepers of the internet are real, and you have to stay on their good side, or else. This doesn't just affect your search ratings, but could potentially add a warning to your site in Chrome, with the only recourse being a manual appeal.

However, this applies only to the post babbler. The php babbler is still fair game since Googlebot ignores non-HTML pages, and the only bots looking for php files are malicious.

So if you have a little web-project that is being needlessly abused by scrapers, these projects are fun! For the rest of you, probably stick with 403s.

What I've done as a compromise is added the following hidden link on my blog, and another small project of mine, to tempt the bad scrapers:

<a href="https://herm.app/babbler/" rel="nofollow" style="display:none">Don't follow this link</a>

The only thing I'm worried about now is running out of Outbound Transfer budget on my VPS. If I get close I'll cache it with Cloudflare, at the expense of the counter.

This was a fun little project, even if there were a few dead ends. I know more about Markov chains and scraper bots, and had a great time learning, despite it being fuelled by righteous anger.

Not all threads need to lead somewhere pertinent. Sometimes we can just do things for fun.

Read Entire Article