A couple of days ago, the small server hosting this website was temporarily knocked out by scraping bots. This wasn’t the first time, nor is it the first time I’m seriously considering employing more aggressive countermeasures such as Anubis (see for example the June 2025 summary post). But every time something like this happens, a portion of the software hobbyist in me dies. We should add this to the list of things AI scrapers destroy next to our environment, the creative enthusiasm of the individuals who made things that are being scraped, and our critical thinking skills.
When I tried accessing Brain Baking, I was met with an unusual delay that prompted me to login and see what’s going on. A simple top revealed both Gitea and the Fail2ban server gobbling up almost all CPU resources. Uh oh. Quickly killing Gitea didn’t reduce the work of Fail2ban as the Nginx access logs were being flooded with entries such as:
47.79.216.157 - - [27/Oct/2025:13:05:34 +0100] "GET /wgroeneveld/brainbaking/src/commit/4359ae68930de084df09e1cfa05ffd4520fb7e40/content/links.md?display=source HTTP/1.1" 502 568 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/140.0.0.0 Safari/537.36" 47.79.217.151 - - [27/Oct/2025:13:05:34 +0100] "GET /wgroeneveld/brainbaking/rss/commit/5911666cf0b30236cdc7590abb4e171534faf972/content/museum.md HTTP/1.1" 502 568 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/140.0.0.0 Safari/537.36" 47.79.217.32 - - [27/Oct/2025:13:05:35 +0100] "GET /wgroeneveld/brainbaking/src/commit/7b46fd682f36af81d4852b8ee2ee9970c638cac6/layouts HTTP/1.1" 502 568 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/140.0.0.0 Safari/537.36" 47.79.218.157 - - [27/Oct/2025:13:05:35 +0100] "GET /wgroeneveld/brainbaking/src/commit/4359ae68930de084df09e1cfa05ffd4520fb7e40/content/404.md HTTP/1.1" 502 568 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/140.0.0.0 Safari/537.36" 47.79.216.205 - - [27/Oct/2025:13:05:35 +0100] "GET /wgroeneveld/brainbaking/src/commit/590574b17b0e1bb068d442d309341e98762fd55d/content/about.md HTTP/1.1" 502 568 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/140.0.0.0 Safari/537.36" 47.79.217.95 - - [27/Oct/2025:13:05:35 +0100] "GET /wgroeneveld/brainbaking/rss/commit/25674d6de08a667926aab89362fa7bb585cd35c5/content/links.md HTTP/1.1" 502 568 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/140.0.0.0 Safari/537.36" 47.79.218.191 - - [27/Oct/2025:13:05:35 +0100] "GET /wgroeneveld/brainbaking/src/commit/590574b17b0e1bb068d442d309341e98762fd55d/themes HTTP/1.1" 502 568 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/140.0.0.0 Safari/537.36" 47.79.216.116 - - [27/Oct/2025:13:05:35 +0100] "GET /wgroeneveld/brainbaking/rss/commit/b4eac0fb71b056cb44fe062b8f2c0949dbb08af6/content/museum.md HTTP/1.1" 502 568 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/140.0.0.0 Safari/537.36"I have enough fail safe systems in place to block bad bots but the user agent Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/140.0.0.0 Safari/537.36 isn’t immediately recognized as “bad”: it’s ridiculously easy to spoof that HTTP header. Most user agent checkers I throw this string at claim this agent isn’t a bot. That means we shouldn’t only rely on this information.
Also, I temporarily block isolated IPs that keep on poking around (e.g. rate limiting on Nginx that get pulled into the ban list) but of course these scrapers never come from a single source. Yet the base attacking IP ranges remained the same: 47.79. The website ipinfo.io can help in identifying the threat: AS45102 Alibaba (US) Technology Co., Ltd.. Huh?
Apparently, Alibaba provides hosting from Singapore that is frequently being abused by attackers. Many others that host forums software such as PhpBB experienced the same problems and although the AbuseIPDB doesn’t report recent issues on the IPs from the above logs, I went ahead and blocked the entire range.
Fail2ban was struggling to keep up: it ingests the Nginx access.log file to apply its rules but if the files keep on exploding… Piping cat access.log | grep /commit/ | cut -d " " -f 1 to instant-ban everyone trying to access Git’s commit logs simply wasn’t fast enough. The only thing that had immediate effect was sudo iptables -I INPUT -s 47.79.0.0/16 -j DROP.
In case that wasn’t yet clear: I hate having to deal with this. It’s a waste of time, doesn’t hold back the next attack coming from another range, and intervening always happens too late. But worst of all, semi-random fire fighting is just one big mood killer. I just know this won’t be enough. Having a robust anti attacker system in place might increase the odds but that means either resorting to hand cannons like Anubis or moving the entire hosting to CloudFlare that will do it for me. But I don’t want to fiddle with even more moving components and configuration, nor do I want to route my visitors through tracking-enabled USA servers.
That Gitea instance should be moved off-site, or better yet, I should move the migration to Codeberg to the top of my TODO list. Yet it’s sad to see that people who like fiddling with their own little servers are increasingly punished for doing so, pushing many to a centralized solution, making things worse in the long term. The internet is no longer a safe haven for software hobbyists. I could link to dozens of other bloggers who reported similar issues to further solidify my point.
Other things I’ve noticed is increased traffic with Referer headers coming from strange websites such as bioware.com, mcdonalds.com, and microsoft.com. It’s not like any of these giants are going to link to an article on this site. I don’t understand what the purpose of spoofing that header is besides upping the hits count?
However worse things might get, I refuse to give in.
It’s just like 50 Cent said: Get Hostin’ Or Die Tryin’.
.png)
