I write this blog. Does anyone read it? How could I tell?
In the old days of the web your web server recorded a log when a page was requested, and various tools would analyze those logs to tell you about your visitors. Today these logs are mostly useless when it comes to looking at human traffic, because the majority of traffic is bots, especially now that the AI companies are running their own web crawls.
Some bots like Googlebot label themselves with the User-Agent header or ip range but there are many other bots, including those that identify themselves as a browser. (My decades out of date recollection is there was a mechanism at Google as well to fetch pages in a way that didn't appear to be a bot.)
Instead, today's web access logging uses JavaScript. A script on the page gathers information about the visitor and POSTs it to some logging endpoint. This is how, for example, Google Analytics works. Some random site claims it is used on more than half of all websites, which means using Google Analytics gives Google gets yet another hook into where ~everyone on the internet is browsing.
"Telemetry" script is yucky, what could you do otherwise? Here's a trick that doesn't require JavaScript, but also doesn't work.
Embed an invisible image in every page:
<img src=/log width=1 height=1>Bots that only fetch HTML and traverse links won't hit this logging endpoint. The Referer header passed in the hits to the /log path will tell you the page the <img> tag was on.
Unfortunately, bots these days are interested in images too.
What if you do use JavaScript? It turns out the fancy bots run JS too. What is something a bot won't do?
One idea I had is that a bot is unlikely to linger on any given page — they have other places to go. I tried a script that used setTimeout to only record the page load as a visit if the browser hung around for three seconds.
It appears to work better than the other things I've tried, but within a day I spotted the Baidu bot fetching my homepage and then three seconds later fetching the logging endpoint. Is it possible they're actually running the page script and waiting? Maybe I need a longer timer?
This blog is also published as a feed. Feed readers fetch its content and resyndicate it within their own UI. I haven't tried, but I doubt they'd run my script.
Some feed readers, when they fetch the feed, report how many subscribers they are acting on behalf of. Is that a count of human readers? I don't think so. When I used a feed reader in the past, I had subscriptions I sometimes didn't read.
On the one extreme, increasingly savvy bots will get ever closer to appearing like human traffic in logs. On the other, humans read via feed readers or without JavaScript and aren't logged anyway. Heaven forbid someone prints my posts and read them on paper, they're impossible to track!
This problem is an instance of a bigger pattern you might encounter in engineering: sometimes when you get down to implementing a measure, you find an endless maze of increasingly confusing corner cases. What if someone loads the page, but they only distractedly skim it? That's not really a reader, is it? What if someone loads the page and finds what they were looking for immediately, before the logging beacon runs?
What these kinds of problems can indicate is that you need to take a step back and reconsider what your real objective is. What is my objective? I think I write this blog for two reasons.
-
Taking the time to serialize my thoughts, chasing down all the holes my inner critic spots, is a way for me to consolidate and archive my knowledge. For that purpose I don't need anyone to read it but me.
-
I write for an imagined audience of another me, someone with my interests and skill level who didn't yet know the thing I learned. Sometimes I write the post that is exactly the thing someone else needed, and they end up reaching out. For this purpose I need an email address, not an access log.