
Unless you happen to be one of the few who are still running a website on a server in your physical possession, you likely do not understand the magnitude of the problem presented by web-scraping robots sucking up the entire contents of the Internet for AI training data. And this is in addition to the robots we have been dealing with for years. This is a really bad situation. Robots surpassed human traffic on the Internet for the first time last year, effectively doubling the cost of running the average website if blocking is not employed. I personally feel that if I did not block robots, they would be more than 80% of the traffic to this website. Most people running websites have either given up or moved them to the servers of web-hosting companies where the robots are someone else's problem and they can resign themselves to just paying higher web-hosting bills in order to keep their sanity. So far, I have chosen the insane approach of continuing to host from home.
As noted in previous cheapskatesguide articles, I have been blocking hundreds of millions of IP addresses, most of which are owned by commercial web-hosting companies that allow their customers to host the web-scraping robots that are harassing all of our websites. For several months I have been seeing indications that my Raspberry Pi 3 may have insufficient processing power to handle all of that blocking. The problem seems to be less that the Raspberry Pi is hitting its CPU's limitations and more that Nginx does not always allocate enough time to search through the entire IP address black list and white list when more than about 1.5 articles per second are being requested. This has become a nearly daily problem for my server when dozens or hundreds of Lemmy or Mastodon servers suddenly request a web page at nearly the same time. When that happens, some IP addresses that should be blocked occasionally get through and some IP addresses on my white list are occasionally blocked. I reached out for advice on this problem to the Nginx forum, but to little avail. The only useful help I received was a suggestion to put the larger blocks of IP addresses first in my black list, which I had already been doing.
I have been looking for a way of moving my robot blocking to a physically separate firewall in the hope that throwing more processing power at Nginx might cause it to do a better job. Three or four previous cheapskatesguide articles tell the sad tale of my failed attempts to build a firewall using the standard Linux iptables approach. I simply could not get iptables to forward ports to my web server! Believe me, I spent weeks trying. Recently, I began a new approach that, while maybe not allowing me to run the low-level DDoS-mitigation code I had hoped, would perhaps allow me to better block web-scraping robots.
My new approach, which I will detail in this article, was to create what is known as a Web Application Firewall (WAF). In my case, that is just an Nginx reverse proxy that also blocks webpage requests from designated IP addresses and user agents. My WAF uses Nginx to forward Internet traffic coming from my LAN's router on ports 80 and 443 to my Raspberry Pi 3 web server. As the traffic passes through my WAF (a Lenovo Thinkcenre m600 with a Pentium J3710 processor), web page requests from blocked IP addresses and blocked user agents are filtered out. My m600 only has about four times the processing capability of the Raspberry Pi 3, but I can scale that up to a faster machine any time I want. I am choosing to leave the web server software on the Raspberry Pi because that is where I designed the code for my social media site, Blue Dwarf, to run, and that is where I want it to continue to run.
In addition to its uses as a firewall, an Nginx reverse proxy can serve in other roles. A reverse proxy can be used to provide more privacy to the owner of a website. In this scenario, the "A records" of the domain name registrar's DNS servers point to a rented server at some web-hosting company that hosts the WAF, and the WAF forwards the traffic it receives to the real server, at say the owner's home. Many people (okay, many nerds) do this to improve the privacy and security of their home Internet connections. Many people prefer to host multiple websites by configuring Nginx for multiple virtual hosts on a single powerful server, but for websites that receive more traffic or require more processing power, reverse proxies can be used to pass only the traffic destined for a particular website to a designated server. So, each website can be hosted on its own server, and all of those servers can be connected to the same Internet connection. A reverse proxy can also be used as a load balancer to spread out the work of running a single website onto multiple servers as the work load increases with the website's popularity.
The down side of interposing a WAF between a LAN's router and a web server is significant for an individual running his own website. First, this increases the complexity of the hosting hardware and of the operation of the website. Increased complexity is never a good thing, because it means more of an individual's time must be devoted to "web mastering". Next, a WAF draws additional power that the UPS must supply in the event of a power outage. The boot up sequence when the power is restored is also more complicated. Last, this is another machine that can fail and thereby prevent the serving of a website. Yes, these are significant negatives, but I decided to try a WAF for a while anyway just to see if it will improve my ability to block robots.
Planning for the possibility of reversing this perhaps unwise expansion of my LAN, I wanted to make as few changes as possible to my web server. That way, if I decide the WAF just isn't worth the effort, I will only have to move the white list, black list, and TLS certificate files back to the web server and carry on as I have for the past seven years with all of my robot blocking being performed on the web server. I will explain my approach to minimally changing my web server below as I go through the whole process of configuring the WAF and connecting it to my LAN.
To summarize, here are the steps I followed to create my WAF:
- Install and configure a Debian Netinstall ISO on the computer that will be the WAF.
- Install Nginx, Let's Encrypt, and any additional desired firewall software on the WAF, like maybe UFW and/or Fail2ban. Similarly to having multiple layers of security, having multiple layers of blocking doesn't usually hurt.
- Configure Nginx on the WAF as a reverse proxy.
- Move the web server's Nginx blocking files (black and white lists) and any other firewall files (UFW, Fail2ban, etc.) to the WAF.
- Move TLS certificates from the web server to the WAF.
- Make any necessary configuration changes to Nginx on the web server.
- Forward ports 80 and 443 from the LAN's router to the WAF.
Below are the details. I will assume most readers are at least moderately familiar with Linux, so rather than giving every detail, I will reference other websites that contain some of the Linux commands.
Installing and Configuring Debian Netinstall on the WAF
I chose version 12.11.0 of Debian Netinstall because Netinstall is a minimal version of Debian without a GUI that I hope will provide reasonably good security. Yes, I am aware of Debian's security problems with its supply chain. Yes, they worry me, but I have no reason to believe any other Linux distributions have better security. The installation procedure is basically the same as most other Linux distributions, and you can find all the installation details and many more than you will need here.
After the operating system has been installed on your new WAF, begin the OS configuration process by creating a new account with a new user name (e.g. "joe") and adding it to the www-data and sudo groups to make working with Nginx and the WAF's OS easier.
Actually, the creation of a non-sudoing user account is taken care of by the installation procedure for Debian Netinstall, but I have included it here for completeness. Now, reboot the WAF to make the change effective, and then SSH into it as user joe from your laptop--which is attached to an Ethernet port on the back of your router, as are your WAF and web server.
First, improve the security of your WAF's operating system. Unfortunately, Debian 12 has switched from the old and simple method of changing values of variables in some of its configuration files to the use of complicated-looking commands. Instead of changing the values of PasswordAuthentication and PermitRootLogin in the /etc/ssh/sshd_config file, you may now have to use confusing commands like:
I wonder what idiots forced this change? For increased security, prevent root access with "PermitRootLogin no". If you prefer the use of passwords instead of keys, set PasswordAuthentication to "yes". Many articles on the Internet say to use keys instead of passwords, but a strong password is for all practical purposes just as effective as a key. In fact, a key is nothing but a strong password that you don't know.
Modify the /etc/sysctl.conf file for increased security in the same way as you probably already have on your web server. For example, you may want to add these statements and others (see https://www.cyberciti.biz/tips/linux-unix-bsd-nginx-webserver-security.html):
Next, set up networking as explained in the "Assign Static IP Address on Minimal Installed Debian 12" section of this web page. In order to forward the Internet traffic on ports 80 and 443 from your LAN's router to your WAF, the WAF must have a static IP address. For instructional purposes, I will assume the web server has a static IP address on the LAN of 192.168.1.50 and the WAF is being assigned a static IP address on the LAN of 192.168.1.51. So, the WAF's /etc/network/interfaces file should contain:
Here, 192.168.1.1 has been assumed to be the IP address of your LAN's router.
To finish the configuration of Debian Netinstall, use RAM for temporary files to reduce the wear on your SSD by adding the following statements to the /etc/fstab file. You may want to use your own values for the capacities of the temporary file systems.
Reboot and test with "df -h". The output should include something like this:
Installing Nginx, Let's Encrypt, and Any Additional Firewall Software
As with additional layers of security on your web server, more layers of robot blocking can provide additional protection. The beauty of creating your own firewall is that you can run virtually whatever you like on it. Remember, however, that each piece of software running on your firewall adds an additional load to its CPU, so you may want to be prudent in your blocking software selection. Remember that all that software must not only run on your firewall when it is idling but also when one of your blog articles is at the top of Hacker News and receiving thousands of page requests per hour.
Install Nginx and enable it to run automatically at boot:
I will discuss configuring Nginx to act as a reverse proxy in the next section of this article.
If you would like to greatly reduce the ability of hackers to effectively conduct brute-force attacks against your SSH server on your firewall or web server or against a log-in page on your website, or help protect against web-scraping robots, Fail2ban is an option that many people use. Fail2ban is complicated and hard to learn to use, but its power makes up for its learning curve. If you decide to use it, install and enable it with these Linux commands:
You will also have to create a log file in the /var/log directory before Fail2ban will run:
If you already have Fail2ban running on your web server, you will want to copy over the jail.conf or local.conf file (whichever file contains your configuration changes) and any filters you have added to the /etc/fail2ban/filter.d directory. Configuring Fail2ban is beyond the scope of this article, but you may want to read about it online. Unfortunately, I have not found one website that explains it comprehensively, so you may have to spend some time searching.
A basic Linux firewall that is easy to install and use is UFW (Uncomplicated Firewall). UFW allows you to block ports on your computer as well as blocking traffic to and from specific IP addresses on the Internet. If you choose to install and enable UFW, do so as follows. WARNING: enabling UFW without first opening port 22 blocks SSH access, so you need to open port 22 first with a UFW command like "sudo insert 1 ufw allow ssh" or the command shown below to restrict SSH access to only your LAN.
At this point, the response from the above UFW status command should look something like this:
You should also block all ports on your Firewall that you do not need open before exposing it to the Internet. Here is a list to start with, but if you have not already, you should develop your own over time:
Note that the first occurrence of a port in the above list takes precedence, so blocking all ports lower in the list does not block the ones that you have already opened higher in the list. UFW does icmp filtering by default, which will prevent script kiddies from finding your server with a simple ping of your IP address on port 22. A basic guide for using UFW can be found here.
In addition to installing UFW, you may also want to drop invalid incoming packets which will help protect your web server from some "light" DDoS attacks. I use the word "light" to distinguish these attacks from DDoS attacks that are powerful enough to saturate your entire download bandwidth with web page requests. If these rules are not already present in the firewall's /etc/ufw/before.rules file, add them:
Nginx can also be tuned to help resist DDoS attacks, but that is outside the scope of this article.
Something I will mention without going into details is that if you are passing requests between your firewall and your web server with the HTTP scheme (and why wouldn't you?), you will have to put your TLS certificates on the firewall. This means, for example, if your website uses Let's Encrypt certificates, you will need to FIRST install the Let's Encrypt Certbot on your WAF with a command like:
And THEN tar everything in your web server's /etc/letsencrypt directory and untar it into the same directory on your firewall. This will preserve all of your file permissions. You may also have to re-create on your firewall the .well-known and acme-challenge subdirectories for each of your websites that uses Let's Encrypt TLS certificates. I will discuss some issues associated with that near the end of this article.
If you plan to run PHP code on your firewall, you will have to install PHP:
Replace "7.4" in the above command with whatever version of PHP you choose to use. After installing PHP, make the same changes to your PHP configuration files that you have probably already made on your web server, including if necessary those required to run PHP inside HTML files.
Configuring Nginx on the WAF as a Reverse Proxy
This is the most important part of this article, and probably the part you have been waiting for. Configuring Nginx as a reverse proxy should have been an easy process, but for me it turned out to be a nightmarish weeks-long ordeal that made me question my basic intelligence. This was all thanks to a lack of good, easy-to-understand documentation on the Internet. At least, if good documentation exists, I could not find it. I will make this as simple as I can with enough explanation for you to hopefully implement it much faster than I did.
The "server" part of the nginx.conf file for a reverse proxy that accepts incoming traffic from the Internet on ports 80 and 443 and sends it to the web server looks something like this:
The above is just the basic Nginx "server" section of nginx.conf for a reverse proxy. In order to avoid confusion, I will not discuss hosting additional pages and subdirectories on the firewall that are not on the web server. Of course, you will have one "server" section in the nginx.conf file on your firewall for each website on your web server. You may also choose to add statements to your nginx.conf file to prevent cross site scripting, image hotlinking, and custom error handling. Note that having a content-security policy on both your WAF and web server will produce duplicate headers, but you may want to keep both anyway. And of course, you will include your white list and black list files that block web-scraping robots. Aside from the "server" sections, I can't think of a reason the "http" section of your firewall should be any different on your firewall than on your web server.
If you run PHP code on your firewall, you will have to enable PHP in Nginx on the firewall. The statements for enabling PHP are slightly different for each Linux distribution. These statements worked for me for enabling PHP with version 12.11.0 of Debian Netinstall:
Be sure to substitute the version of PHP you are using in place of 7.4 above.
Another thing that is required for the above "proxy_pass http://example1.com/;" statement to work is that the firewall must know the IP address of example1.com. One easy way of giving it that information is to put the local IP addresses of the websites on your web server into the firewall's /etc/hosts file. For example:
One mistake I saw repeatedly in Internet articles was this statement in the server section of the forward proxy's nginx.conf file:
No! First, $host will be the firewall's host name. You don't want that. You want the firewall to be merely passing HTTP requests and responses between the Internet visitor's browser and your web server. That is what these statements do:
If you don't use them, your firewall will not work! And many of the articles on the internet about creating a forward proxy don't have them!
takes care of everything having to do with HTTP and HTTPS pages. You don't have another proxy_pass statement that takes care of passing HTTPS pages to your web server. You redirect all requests that arrive on ports 80 and 443 at the firewall to the web server as HTTP requests and the above command takes care of making sure responses arrive at the Internet user's browser as either HTTP or HTTPS pages. That was highly confusing to me because I could find no article that explained it as clearly as I just did.
You are not likely to configure your WAF's Nginx configuration file correctly on the first attempt, or for that matter on the 20th attempt, so some debugging skills will probably be required. Looking at the Nginx log files on both the WAF and the web server will help you to see what is being passed between the two. You may also need additional help to know exactly what data is being passed. The Linux tcpdump command can help with that, but it is hard to use and not very well documented on the Internet, unless perhaps you have a degree in computer science that allows you to decipher the hieroglyphics found in the manual pages of your computer or on the Internet. I don't have a CS degree, so examples help me far more that the hieroglyphics. Below are some tcpdump commands that I found to be useful when I was debugging the WAF.
I initially saw a 403 error on the web server when it redirected my request from bluedwarf.top/index.html to bluedwarf.top/cackle/index.php. Debugging that problem from my web server with the tcpdump command:
worked wonderfully.
You may also want to try these commands from your web server (substituting in your web server's network interface for eth0, your firewall's IP address for 192.168.1.51, and your website's relevant subdirectories):
My laptop's LAN address was 192.168.1.83 in the last command above. The above example commands and their outputs take some time to understand. Their full explanations are outside the scope of this article, but reading about tcpdump on the Internet may help, eventually.
Testing that your firewall can see a particular file on your web server is also a helpful part of the debugging process:
Substitute the name of your file for "filename" and the domain name of your website for "example1.com". Of course, the above command must be executed from your firewall.
Move Nginx Blocking Files, Etc. from Your Web Server to Your WAF
Transfer all the Nginx configuration files (other than nginx.conf) from your server to your WAF. Also transfer any configuration files for UFW, fail2ban, and PHP (e.g. php.ini and www.conf) that you modified from the web server to the WAF.
I moved my Nginx white list and black list files, UFW files, and files associated with my special whitelisting procedure (my custom 403 error page, get_access.html, and add2whitelist.php) to my WAF. The white and black lists must both be on the WAF. Putting the black list on the WAF while leaving the white list on the web server would just mean that the whitelist would be ignored. This is because like iptables and UFW, Nginx pays attention to the first occurrence of any statement that applies to a particular IP address or user agent and ignores all subsequent ones. So the "include" statement for your white list in your Nginx configuration file on your WAF must come before the include statement for your black list file.
As mentioned earlier, if you are running Fail2ban, you will copy from your web server to your WAF your /etc/fail2ban/jail.conf or /etc/fail2ban/local.conf (whichever file contains your changes to your Fail2ban configuration) and any filters you have created in your /etc/fail2ban/filter.d directory.
Move TLS Certificates from Your server to Your WAF
Transfer over the entire Let's Encrypt directory from the web server's /etc/letsencrypt directory as a tar file and untar it into the WAF's /etc/ directory. My WAF passes web page requests to my web server as HTTP page requests (without encryption), so my WAF must verify the authenticity of my websites to the browsers of everyone visiting my websites. This also offloads the handshaking associated with visitors making Nginx connections from the web server to the WAF.
Unfortunately, thanks to the location in which the .well-known directory must be placed on the web server and the way I have configured Nginx on the WAF, renewing Let's Encrypt TLS certificates becomes complicated. When Certbot is run manually on the WAF, it will not see the keys it generates for verification if you place them in the .well-known/acme-challenge subdirectory on the WAF. The easiest way to deal with this seems to be to run Certbot on the WAF, and then write the string of characters generated by Certbot into the required file in the .well-known/acme-challenge subdirectory on the web server before hitting the "Enter" key to complete the certification process. I assume this will break automatic renewal of TLS certificates with Certbot, and that will be unacceptable when Let's Encrypt begins forcing TLS certificate renewal every six days. I have not yet found an easy solution for this that does not involve re-arranging website directories on the web server or resorting to self-signed TLS certificates. Certbot may have a command that allows for this problem, but I have not yet investigated that possibility.
Make Any Necessary Configuration Changes to Nginx on the Web Server
As mentioned, I specifically designed my firewall so as to require a minimal number of changes to my web server. As a result, the only ones I made were commenting out the include statements to the black list and white list in the server sections of my web server's Nginx configuration file and adding some statements for logging. You can also disable UFW and Fail2ban if you are running them on your web server. Here are the statements I added for logging to each server section:
If you don't insert these, the IP address of the firewall will appear in your Nginx log files instead of the IP addresses of the visitors to your web site. You don't want that. This is something else that the other articles failed to mention.
WARNING: Do not relax the security you have put in place in your web server's Nginx configuration file (aside from removing white list "allow" and black list "deny" statements and user agent filtering). This is critical. Your web server is still preventing hackers from modifying and running your PHP files to take over your website and from downloading your private data. So, you still want to prevent access to the same directories on your websites (via "deny all" statements), prevent scripts from running in other directories, prevent image hotlinking, keep your content security policies that prevent cross-site scripting, etc.
Forward Ports 80 and 443 from the Router to the WAF
The last step in configuring your LAN for your new firewall is forwarding ports. Ports 80 and 443 were originally forwarded from the LAN's router to the web server. Now, they must be forwarded to the WAF instead. Each router does that differently, but if you have been running a web server, you already know how to do that for your particular router.
Final Words
If adding a Web Application Firewall in front of your web server to provide additional protection for your website looks intimidating, it is because it is. At least, it was for me. This is a process that will likely require many hours to complete and some (at least in my mind) moderate network debugging skills. My advice is that if you have never done this before and you undertake it, be prepared for repeated disappointments as things you try just don't work. Keep trying, and hopefully you will eventually succeed.
My goals for this firewall were mostly to provide better robot blocking and perhaps some more powerful DDoS protection than my Raspberry Pi 3 web server is capable of delivering. I still have to do some testing before I will know if my new firewall actually provides either of those, but at least I now have the additional ability to run multiple physical web servers on my LAN. Exploring that should be fun, and fun is a very important component of running a home web server.
If you have found this article worthwhile, please share it on your favorite social media. You will find sharing links at the top of the page.
Related Articles:
Installing and Configuring Nginx on a Linux Home Web Server
How to have Your Own Website for $2 a Year
How to Serve Over 100K Web Pages a Day on a Slower Home Internet Connection
Running a Small Website without Commercial Software or Hosting Services: Lessons Learned
A Page Load Time Comparison of Raspberry Pi 3 & 4 Web Servers
Moving Up from a Raspberry Pi Web Server to a Low-Cost, Low-Power x86 Web Server
Predicted Performance of a Raspberry Pi 3 Web Server Running a Text-Only Social Media Network