11 Jul 2024
A few months ago, one of NetSoc’s committee members built an r/place clone just for NetSoc’s members, and we figured it would be nice to have a Discord bot to periodically take screenshots of the pixel art website and post updates to our server. When I volunteered to make it (because why not?), I never thought I’d spend about 5 hours wrestling with Docker while trying to deploy the app to our server, because the Dockerised app would only run on my computer.
Obviously, this goes against the very principle of Docker: like the homepage so boldly proclaims, Docker lets your apps “run anywhere”. And indeed, I quickly realised that Docker wasn’t the problem at all. I put a temporary fix in place and forgot about it until a couple of days ago, when I resolved (this is a pun you’ll get soon) to figure out what happened and work it out properly.
Spoiler: it was a DNS misconfiguration from 4 years ago hiding behind some systemd caching magic that I still don’t fully understand.
6 months ago
Some additional context: the Discord bot uses Selenium to launch a headless browser which loads the website and takes the screenshot. On my laptop, this worked perfectly, and the bot subsequently sent the pictures to my test Discord channel. However, when I started the container on our server, nothing would happen. Running docker exec -it <container_id> bash to open an interactive shell into the container and read the log files, I came across this error message:
ERROR:root:Message: unknown error: net::ERR_NAME_NOT_RESOLVED(Do you get the pun now?)
ERR_NAME_NOT_RESOLVED clearly meant there was something going wrong with DNS inside the container. I wanted to manually test the DNS, so I updated the Dockerfile to install dnsutils and ran a couple tests with nslookup. nslookup worked for domains like google.com, but returned NX_DOMAIN (non-existent domain) for pixel-art.netsoc.com. Huh? Meanwhile, outside the container, nslookup for pixel-art.netsoc.com worked perfectly. Double huh?
After a bit of Googling and StackOverflow-ing, I checked out /etc/resolv.conf inside the container to find out what nameservers it was using. The first nameserver listed was a local IP address, 192.168.122.100, and the other two were IP addresses for nameservers on the UCD network. Querying this local nameserver directly from outside the container quickly proved that it was indeed the problem. For some reason, it wasn’t resolving *.netsoc.com domains correctly.
A Quick Fix
I tried to edit the resolv.conf file manually from within the container, but apparently this is a special file that Docker manages by bind-mounting it to a file on the host machine. Essentially, it’s uneditable.
Thankfully, a bit more Googling revealed the --dns flag that lets you specify a nameserver for the container when you run it.
$ sudo docker run -d --dns 8.8.8.8 pixel-art-botAfter telling the container to query Google’s Public DNS at 8.8.8.8, the bot worked! I have to admit, I felt a bit disappointed because after staying up till 4 AM troubleshooting, an extra 13 characters in the run command felt like an underwhelming solution. Regardless, it worked and I left it at that. Until…
Present day
Out of the blue, I suddenly remembered this issue a few days ago. Seeing as I’m currently on summer break, I decided there wasn’t a better time to finish what I started and figure out exactly what was going on with the DNS on our server.
/etc/resolv.conf on the server contained only one entry for a nameserver: 127.0.0.53. This is a loopback address that systemd-resolved, which is the daemon for the systemd-resolve name resolution service, listens for DNS queries on. To find what nameservers the systemd-resolve service is actually using, I ran systemd-resolve --status. Unsurprisingly, I got the same list I’d found inside the container in the beginning of the year: one local IP and two of UCD’s nameservers.
Just like last time, the local nameserver simply would not resolve any *.netsoc.com domain correctly. However, systemd-resolved worked like a charm.
I spent a good while trying to figure out exactly what was going on at that address. I’m completely new to networking, so I missed a number of obvious signs and wandered down a couple of different paths. One of them was running host 192.168.122.00 to find its hostname, which gave me ipa.netsoc.com. Navigating to this site in my browser led me to a CentOS Identity Management login page. Asking around, I learned that a previous team of sysadmins had tried to set up FreeIPA for some reason or the other.
Then, I ran arp -a to list all the addresses on the server’s local network. 192.168.122.100 was one of them, as expected, but here’s the interesting part: the interface listed for it was virbr0, suggesting that it was a VM running on the server. At long last, I was able to piece everything together: the mystery nameserver was the CentOS VM my predecessors had started to set up FreeIPA on.
Now that I knew the mystery nameserver was basically useless, I could safely remove it from the system. Any custom nameservers that systemd-resolved actually uses under the hood are configured in /etc/systemd/resolved.conf. I simply deleted this file and restarted the systemd-resolve service to reset the list of nameservers back to the default (i.e. UCD’s nameservers from DHCP). I re-ran the Discord bot, this time without the --dns flag, and it worked! I’d successfully found out what had gone wrong with the DNS. Except…
One last loose thread
Why did DNS resolution fail inside the container and work on the host machine if the container used the same configuration file as the server? The answer comes down to systemd-resolved.
systemd-resolved is active on the server, but not inside Docker containers. As a result, if 127.0.0.53 is the only nameserver in /etc/resolv.conf, Docker automatically bind-mounts a different file instead: /run/systemd/resolve/resolv.conf. This contains the actual list of nameservers that systemd-resolved queries.
Any programs inside the container would directly query these nameservers for DNS resolution. Since the first nameserver on the list was 192.168.122.100, they would receive an NX_DOMAIN and proceed to throw errors like ERR_NAME_NOT_RESOLVED. They wouldn’t even try the other two nameservers because NX_DOMAIN is a valid response.
However, outside the container, programs would go through systemd-resolved. This is the part I still don’t fully understand, because systemd-resolved somehow managed to resolve all *.netsoc.com subdomains without ever actually hitting a nameserver. I’m not entirely sure how, because some of these subdomains weren’t even in its cache before I looked them up. However, because it never queried the faulty nameserver, it never received any NX_DOMAIN responses, and nothing appeared to be amiss.
Unfortunately, I was unable to replicate this behaviour after clearing the systemd-resolved cache. Afterwards, listing 192.168.122.100 as a nameserver for systemd-resolved led to NX_DOMAIN every time for *.netsoc.com subdomains. I spent a bit of time trying to find out why, but to no avail. Maybe I’ll pick it up when I’m bored again in a couple months’ time ;)
Although this entire experience was horribly frustrating at times, I still had fun systematically hunting down the misconfiguration and it was pretty satisfying when I finally found out what was happening (mostly).