Bye, Google Search

7 hours ago 2

For this blog, I mean. Which used to have a little search window in the corner of the banner just above that looked up blog fragments using Google. No longer; I wired in Pagefind; click on the magnifying glass up there on the right to try it out. Go ahead, do it now, I’ll wait before getting into the details.

The problem · Well, I mean, Google is definitely Part Of The Problem in advertising, surveillance, and Internet search. But the problem I’m talking about is that it just couldn’t find my pages, even when I knew they were here and knew words that should find them.

Either it dropped the entries from the index or dropped a bunch of search terms. Don’t know and don’t care now. ongoing is my outboard memory and I search it all the freaking time. This failure mode was making me crazy.

Pagefind · Tl;dr: I downloaded it and installed it and it Just Worked out of the box. I’d describe the look and feel but that’d be a waste of time since you just tried it out. It’s fast enough and doesn’t seem to miss anything and has a decent user interface.

How it works · They advertise “fully static search library”, which I assumed meant it’s designed to work against sites like this one composed of static files. And it is, but there’s more to it than that; read on.

First, you point a Node program at the root of your static-files tree and stand back. My tree has a bit over 5,000 files containing about 2½ million words, adding up to a bit over 20M of text. By default, it assumes you’re indexing HTML and includes all the text inside each page’s <body> element.

You have to provide a glob argument to match the files you want to index; in most cases, something like root/**/*.html would do the trick. Working this out was for me the hardest part because among other things my articles don’t end with .html; maybe it’ll be helpful for some to note that what worked for ongoing was:
When/???x/????/??/??/[a-zA-Z0-9]<[\-a-zA-Z0-9]:>

This produced an index organized into about 11K files adding up to about 78M. It includes a directory with one file per HTML page being searched.

I’d assumed I’d have to wire this up to my Web server somehow, but no: It’s all done in the client by fetching little bits and pieces of the index using ordinary HTTP GETs. For example, I ran a search for the word “minimal”, which resulted in my browser fetching a total of seven files totaling about 140K. That’s what they mean by “static”; not just the data, but the index too.

Finally, I noticed a couple of WASM files, so I had to check out the source code and, sure enough, this is basically a Rust app. Again I’m impressed. I hope that slick modern Rust/WASM code isn’t offended by me rubbing it up against this blog’s messy old Perl/Ruby/raw-JS/XML tangle.

Scalable? · Interesting question. For the purposes of this blog, Pagefind is ideal. But, indexing my 2½ million words burned a solid minute of CPU on the feeble VPS that hosts ongoing. I wonder if the elapsed time is linear in the data size, but it wouldn’t surprise me if it were worse. Furthermore, the index occupies more storage than the underlying data, which might be a problem for some.

Also, what happens when I do a search while the indexing is in progress? Just to be sure, I think I’ll wire it up to build the index in a new directory and switch indices as atomically as possible.

Finally, I think that if you wanted to sustain a lot of searches per second, you’d really want to get behind a CDN, which would make all that static index fetching really fly.

Configuring · The default look-and-feel was mostly OK by me, but the changes I had to make did involve quality time with the inspector, figuring out the class and ID complexities and then iterating the CSS.

The one thing that in the rear-view seems unnecessary is that I had to add a data-pagefind-meta attribute to the element at the very bottom of the page where the date is to include it in the result list. There should be a way to do this without custom markup. John Siracus filed a related bug.

Deployment · There’s hardly any work. I’ll re-run the indexer every day with a crontab entry and it looks it should just take care of itself.

To do? · Well, I could beautify the output some more but I’m pretty happy with it after just a little work. I can customize the sort order, which I gather is in descending order of how significant Pagefind thinks the match is. There’s a temptation to sort it in reverse date order. Actually, apparently I can also influence the significance algorithm. Anyhow I’ll run with mostly-defaults for now.

Search options · I notice that the software is pretty good at, and aggressive about, matching across verb forms and singular/plural and prefixes. Which I guess is what you want? You can apparently defeat that by enclosing a word in quotes if you want it matched exactly. Works for phrases too. I wonder what other goodies are in there; couldn’t find any docs on that subject.

Finally, there’s an excellent feature set I’ll never use; it’s smart about lots of languages. But alas, I write monolingually.

Shameful cleanup · Like I said, getting Pagefind installed and working was easy. Getting the CSS tuned up was a bit more effort. But I have to confess that I put hours and hours into hiding my dirty secrets.

You see, ongoing contains way more writing than you or Google can see. It’s set up so I can “semi-publish” pieces; there but unlinked. There was a whole lot of this kind of stuff: Photo albums from social events, pitches to employers about why they should hire various people including me, rants at employers for example about why Solaris should adopt the Linux userland (I was right) and why Android should include a Python SDK (I was right), and pieces that employer PR groups convinced me to bury. One of my minor regrets about no longer being employed is I no longer get to exercise my mad PR-group-wrangling skillz.

But when your search software is just walking the file tree, it doesn’t know what’s “published” and what’s not. I ended up using my rusty shell muscles with xarg and sed and awk and even an ed(1) script. I think I got it all, but who knows, search hard enough and you might find something embarrassing. If you do, I’d sure appreciate an email.

Thanks! · To the folks who built this. Seems like a good thing.

Read Entire Article