Using lots of little tools to aggressively reject the bots

4 months ago 15

Or using lots of little tools to aggressively reject the bots · October 18th, 2024

U G H.

For some reason my quaint little piece of the internet has suddenly been inundated with unwanted guests. Now normally speaking I would be over joyed with having more guests to this tiny part of the internet. Come inside the cozy little server room, we have podcasts to drown out the noise of the fans, and plenty to read. Probably at some point in the future there will be photography things too, and certainly plenty of company. But no, nobody can have such nice things. Instead the door to the server room was kicked down and in came a horde of robots hell bent on scraping every bit of data they possibly could from the site.

Now, for the longest time, I've had no real issue with this. Archive.org is welcome to swing by any time, index the entire site, and stash it away for posterity. Keep up the good work folks! But instead of respectful netizens like that, I have the likes of Amazon, Facebook, and OpenAI, along with a gaggle of random friends, knocking on my doors. These big corporations 1) do not need my content and 2) are only accessing it for entirely self serving means.

Lets not even pretend it's anything else, because we know it isn't. These large companies scrape data broadly and with little regard to the effect it has on the infrastructure servicing whatever it is they're pulling from. With the brain slug that is "AI" now openly encouraging the mass consumption of data from the internet at large to train their models on, it was really only a matter of time before the scraping became more severe. This is the hype cycle at work. OpenAI needs to scrape to train, Facebook does too because they have a competing model. Amazon and Google and Microsoft all have their own reasons related to search and advertising, bending the traffic to flow through their platforms. The point is, these are not "consumers" of Lambdacreate. You, the human reading this, are! Thanks for reading.

To the bots. Roboti ite domum!

Hyperbole aside, what's our problem exactly?

Fortunately, I am well versed in systems administration, and have a whole toolkit at my disposal to analyze the issue. Let's put some numbers against all of the above hyperbole.

My initial sign that something was up came in from my Zabbix instance. I call the little server that runs my Zabbix & Loki instances Vignere after the creator of the Vignere Cipher, hence the funky photo in Discord. Anyways, Vignere complained about my server using up its entire disc for all of my containers. Frustrating, but not a big deal since I'm using LXD under the hood.

A Zabbix alert in a Discord channel displaying disc exhaustion for multiple containers.

Fine, I'll take my lumps. Took down all of my sites briefly, expanding the underlying ZFS sparse file, and brought the world back up. No harm no foul, just growing pains. But of course, that really wasn't the issue. I was inundated with more alerts. Suddenly I was seeing my Gitea instance grow to consume the entire disc every single day, easily generating 20-30G of data each day. Super frustrating, and enough information on the internet says that Gitea just does this and doesn't enable repo archive cleanup by default, so that must be it. I happily go and setup some aggressive cleanup tasks thinking my problems are over. Maybe I shouldn't have setup a self-hosted git forge and just stuck with Gitlab or Github.

But no, not at all, this thin veneer of a fix rapidly crumbled under the sudden and aggressive uptick in web traffic I started seeing. Suddenly it wasn't just disc usage, I was getting inundated with CPU and Memory alerts from my poor server. I couldn't git pull or push to my Gitea. Hell my weechat client couldn't even stay connected. Everything ground to a halt for a bit. But by the time I could get away from work, or the kids, and pull out my computer to dig into it the problem had stopped. I could access everything. Sysstat and Zabbix told me that the resource utilization issues were real, but I couldn't exactly tell why from just that.

A Zabbix alert in a Discord channel displaying extremely high cpu utilization.

This is however, why I keep an out of band monitoring system in the first place. I need to be able to look at historic metrics to see what "normal" looks like. Otherwise it's all just guesswork. And boy did Zabbix have a story to tell me. To get a clear understanding of what I mean, lets take a quick look at the full dashboard from when I redid my Zabbix server after it failed earlier this year. Pew pew flashy graphs right? The important one here is the nginx requests and network throughput chart in the bottom left hand corner of the dashboard. Note that that's what "normal" traffic looks like for my tiny part of the internet.

An aggregate Zabbix graph that shows Nginx requests per second overlaid with in/out bound network traffic data.

And this, dear reader, is what the same graph looks like after LC was laid siege to. Massive difference right? And not a fun one either. On average I was seeing 8 requests per second come into nginx across a one month period. It's not a lot, but once again, this is just a tiny server hosting a tiny part of the internet. I'm not trying to dump hyper scale resources into my personal blog, it just isn't necessary.

The same graph, only scary.

At its worst Zabbix shows that for a period I was getting hit with 20+ requests per second. Once again, not a lot of traffic, but it is 10x what my site usually gets, and that makes a big difference!

So why the spike in traffic? Why specifically from my gitea instance? Why are there CPU and Disc alerts mixed into all of this, it's not like 20+ requests a second is a lot for nginx to handle by any means. To understand that, we need to dig into the logs on the server.

Looking under the hood

But before I could even start to do that I needed a way to get keep the server online long enough to actually review the logs. This is where out of band logging like a syslog or loki server would be extremely helpful. But instead I had the join of simply turning off all of my containers and disabling the nginx server for a little bit. After that I dug two great tools out of my toolkit to perform the analysis, lnav & goaccess.

lnav is this really great log analysis tool, it provides you with a little TUI that color codes your log files and skim through them like any other pager. That in and of itself is cool, but it also provides an abstraction layer on top of common logging formats and lets you query the data inside of the log using SQL queries. That, for me, is a killer feature. I'm certainly not scared to grep, sed, and awk my way through a complex log file, but SQL queries are way simpler to grasp.

Here's the default view, it's the equivalent of a select * from access_log.

The default rendering for an nginx access log, there's a ton of colors, it's log files made pretty!

Digging through this log ended up being incredibly easy and immediately informative. I won't bore anyone with random data, but these are the various queries I ran against my access.log to try and understand what was happening.

# How many different visitors are there total? select count(distinct(c_ip)) from access_log; # Okay that's a big number, what do these IPs look like, is there a pattern? select distinct(c_ip) from access_log; # Are these addresses coming from somewhere specific (ie: has LC been posted to Reddit/Hackernews and hugged to death?) select distinct(cs_referer) from access_log; # Are these IPs identified by a specific agent? select distinct(cs_user_agent) from access_log; # Theres a lot of agents and IPs, what IPs are associated with what address? select c_ip, cs_user_agent from access_log;

After a quick review of the log it was obvious that the traffic wasn't originating from the same referrer, ie: no hug of death. Would've been neat though right? Instead there was entire blocks of IP addresses hitting www.lambdacreate.com and krei.lambdacreate.com and scraping every single url. Some of these IPs were kind enough to use actual agent names like Amazonbot, OpenAI, Applebot, and Facebook, but there was plenty of obviously spoofed user agents in the mix. Since this influx of traffic was denying my own access to the services I host (specifically my Gitea instance) I figured the easiest and most effective solution was just to slam the door in everyone's face. Sorry, this is MY corner of the internet, if you can't play nice you aren't welcome.

So anyways, I just started banning.

Roboti ite infernum

Nginx is frankly an excellent web server. I've managed lots of Apache in my time, but Nginx is just slick. Really it's probably the fact that Openresty + Lapis brings you this wonderful blog that I really have a preference at all because I'm positive what I'm about to describe is entirely doable in Apache as well. Right, anyways, the easiest way to immediately change the situation is to outright reject anyone who reports their user agent and is causing any sort of disruption.

My hamfisted solution to that is to just build up a list of all of the offensive agents. Sort of like this, only way longer.

map $http_user_agent $badagent { default 0; ~*AdsBot-Google 1; ~*Amazonbot 1; ~*Amazonbot/0.1 1; }

Then in the primary nginx configuration, source the user agent list, and additional setup a rate limit. Layering the defenses here allows me to outright block what I know is a problem, and slow down anything that I haven't accounted for while I make adjustments.

# Filter bots to return a 403 instead of content. include /etc/nginx/snippets/useragent.rules; # Define a rate limit of 1 request per second every 1m limit_req_zone $binary_remote_addr zone=krei:10m rate=5r/s;

Then in the virtual host configuration we configure both the rate limit and a 403 rejection statement.

limit_req zone=krei burst=20 nodelay; if ($badagent) { return 403; }

It really is that hamfisted and easy. If you're on the list, 403. If you're not and you start to scrape, you get the door slammed in your face! But of course this only half helps, while issuing 403s prevents access to the content of the site, my server still needs to process that http request and reject it. That's less resource intense then processing something on the backend, but it's still enough where if the server is getting tons of simultaneous scraping requests that it bogs it down.

Now with 403 rejections in place we can start to prod the nginx access log with lnav. How about checking to see all of the unique IPs that our problems originate from?

select distinct(c_ip) from access_log where sc_status = 403;

126 distinct IPs displayed in lnav

Or better yet, we can use goaccess to analyze in detail all of our logs, historic and current, and see how many requests have hit the server, and what endpoint they're targeting the most.

zcat -f access.log-*.gz | goaccess --log-format=COMBINED access.log -o scrapers.html

The Goaccess dashboard displaying the broad total statistics. The Goaccess graphs displaying the total IP and agent type graphs.

Either of these is enough to indicate that there are hundreds of unique IPs, and to fetch lists of user agents to block. But to actually protect the server we need to go deeper, we need firewall rules, and some kind of automation. What we need is Fail2Ban.

Since we're 403 rejecting traffic based off of known bad agents, our fail2ban rule can be wicked simple. And because I just don't care anymore we're handing out 24 hour bans for anyone breaking the rules. That means adding this little snippet to our fail2ban configuration.

[nginx-forbidden] enabled = true port = http,https logpath = /var/log/nginx/access.log bantime = 86400

And then creating a custom regex to watch for excessively 403 requests.

[INCLUDES] before = nginx-error-common.conf [Definition] failregex = ^<HOST> .* "(GET|POST) [^"]+" 403 ignoreregex = datepattern = {^LN-BEG} journalmatch = _SYSTEMD_UNIT=nginx.service + _COMM=nginx

And our result is! Boom! A massive ban list! 735 bans at the time of writing this. Freaking ridiculous.

~|>> fail2ban-client status nginx-forbidden Status for the jail: nginx-forbidden |- Filter | |- Currently failed: 13 | |- Total failed: 57135 | `- File list: /var/log/nginx/access.log `- Actions |- Currently banned: 38 |- Total banned: 735 `- Banned IP list: 85.208.96.210 66.249.64.70 136.243.220.209 85.208.96.207 185.191.171.18 85.208.96.204 185.191.171.15 85.208.96.205 85.208.96.201 185.191.171.8 85.208.96.200 185.191.171.4 185.191.171.11 185.191.171.1 85.208.96.202 185.191.171.5 185.191.171.6 85.208.96.209 185.191.171.10 85.208.96.203 85.208.96.195 85.208.96.206 185.191.171.16 185.191.171.7 85.208.96.208 185.191.171.17 185.191.171.2 85.208.96.199 85.208.96.212 185.191.171.13 66.249.64.71 66.249.64.72 185.191.171.3 85.208.96.197 85.208.96.193 85.208.96.196 185.191.171.12 85.208.96.194

So what?

The end result is that you're able to enjoy this blog post, and have had access to all the other great lambdacreate things for several months now. Because it is incredibly difficult to write blog posts when you have to fend off the robotic horde. None of this was even scraping my blog, it was all targeting at generating tarballs of every single commit of every single publicly listed git repo in my gitea instance. Disgusting and unnecessary. But I'm leaving the rule set, take a quick glance at the resource charts from Zabbix and you'll readily understand why.

The change displayed as a Zabbix graph, everything is looking way better now.

Long term, I'll probably want to figure out a way to extend this list, or make exceptions for legitimate services such as archive.org. And I don't want the content here to be delisted from search engines necessarily, but at the same time this isn't here to fuel the AI enshitification of the internet either. So allez vous faire foutre scrapers.

Read Entire Article