*if you include word2vec.
Chris and I spent a couple hours the other day creating a search engine for my blog from “scratch”. Mostly he walked me through it because I only vaguely knew what word2vec was before this experiment.
The search engine we made is built on word embeddings. This refers to some function that takes a word and maps it onto N-dimensional space (in this case, N=300) where each dimension vaguely corresponds to some axis of meaning. Word2vec from Scratch is a nice blog post that shows how to train your own mini word2vec and explains the internals.
The idea behind the search engine is to embed each of my posts into this domain by adding up the embeddings for the words in the post. For a given search, we’ll embed the search the same way. Then we can rank all posts by their cosine similarities to the query.
The equation below might look scary but it’s saying that the cosine similarity, which is the cosine of the angle between the two vectors cos(theta), is defined as the dot product divided by the product of the magnitudes of each vector. We’ll walk through it all in detail.
Cosine distance is probably the simplest method for comparing a query embedding to document embeddings to rank documents. Another intuitive choice might be euclidean distance, which would measure how far apart two vectors are in space (rather than the angle between them).
We prefer cosine distance because it preserves our intuition that two vectors have similar meanings if they have the same proportion of each embedding dimension. If you have two vectors that point in the same direction, but one is very long and one very short, these should be considered the same meaning. (If two documents are about cats, but one says the word cat much more, they’re still just both about cats).
Let’s open up word2vec and embed our first words.
Embedding
We take for granted this database of the top 10,000 most popular word embeddings, which is a 12MB pickle file that vaguely looks like this:
Chris sent it to me over the internet. If you unpickle it, it’s actually a NumPy data structure: a dictionary mapping strings to numpy.float32 arrays. I wrote a script to transform this pickle file into plain old Python floats and lists because I wanted to do this all by hand.
The loading code is straighforward: use the pickle library. The usual security caveats apply, but I trust Chris.
You can print out word2vec if you like, but it’s going to be a lot of output. I learned that the hard way. Maybe print word2vec["cat"] instead. That will print out the embedding.
To embed a word, we need only look it up in the enormous dictionary. A nonsense or uncommon word might not be in there, though, so we return None in that case instead of raising an error.
To embed multiple words, we embed each one individually and then add up the embeddings pairwise. If a given word is not embeddable, ignore it. It’s only a problem if we can’t understand any of the words.
That’s the basics of embedding: it’s a dictionary lookup and vector adds.
Now let’s make our “search engine index”, or the embeddings for all of my posts.
Embedding all of the posts
Embedding all of the posts is a recursive directory traversal where we build up a dictionary mapping path name to embedding.
We do this other thing, though: normalize_text. This is because blog posts are messy and contain punctuation, capital letters, and all other kinds of nonsense. In order to get the best match, we want to put things like “CoMpIlEr” and “compiler” in the same bucket.
We’ll do the same thing for each query, too. Speaking of, we should test this out. Let’s make a little search REPL.
A little search REPL
We’ll start off by using some Python’s built-in REPL creator library, code. We can make a subclass that defines a runsource method. All it really needs to do is process the source input and return a falsy value (otherwise it waits for more input).
Then we can define a search function that pulls together our existing functions. Just like that, we have a search:
Yes, we have to do a cosine similarity. Thankfully, the Wikipedia math snippet translates almost 1:1 to Python code:
Finally, we can create and run the REPL.
This is what interacting with it looks like:
This is a sample query from a very small dataset (my blog). It’s a pretty good search result, but it’s probably not representative of the overall search quality. Chris says that I should cherry-pick “because everyone in AI does”.
Okay, that’s really neat. But most people who want to look for something on my website do not run for their terminals. Though my site is expressly designed to work well in terminal browsers such as Lynx, most people are already in a graphical web browser. So let’s make a search front-end.
A little web search
So far we’ve been running from my local machine where I don’t mind having a 12MB file of weights sitting around. Now that we’re moving to web, I would rather not burden casual browsers with an unexpected big download. So we need to get clever.
Fortunately, Chris and I had both seen this really cool blog post that talks about hosting a SQLite database on GitHub Pages. The blog post details how the author:
- compiled SQLite to Wasm so it could run on the client,
- built a virtual filesystem so it could read database files from the web,
- did some smart page fetching using the existing SQLite indexes,
- built additional software to fetch only small chunks of the database using HTTP Range requests
That’s super cool, but again: SQLite, though small, is comparatively big for this project. We want to build things from scratch. Fortunately, we can emulate the main ideas.
We can give the word2vec dict a stable order and split it into two files. One file can just have the embeddings, no names. Another file, the index, can map every word to the byte start and byte length of the weights for that word (we figure start&length is probably smaller on the wire than start&end).
The cool thing about this is that index.json is dramatically smaller than the word2vec blob, weighing in at 244KB. Since that won’t change very often (how often does word2vec change?), I don’t feel so bad about users eagerly downloading the entire index. Similarly, the post_embeddings.json is only 388KB. They’re even cacheable. And automagically (de)compressed by the server and browser (to 84KB and 140KB, respectively). Both would be smaller if we chose a binary format, but we’re punting on that for the purposes of this post.
Then we can make HTTP Range requests to the server and only download the parts of the weights that we need. It’s even possible to bundle all of the ranges into one request (it’s called multipart range). Unfortunately, GitHub Pages does not appear to support multipart, so instead we download each word’s range in a separate request.
Here’s the pertinent JS code, with (short, very familiar) vector functions omitted:
You can take a look at the live search page. In particular, open up the network requests tab of your browser’s console. Marvel as it only downloads a couple 4KB chunks of embeddings.
So how well does our search technology work? Let’s try to build an objective-ish evaluation.
Evaluation
We’ll design a metric that roughly tells us when our search engine is better or worse than a naive approach without word embeddings.
We start by collecting an evaluation dataset of (document, query) pairs. Right from the start we’re going to bias this evaluation by collecting this dataset ourselves, but hopefully it’ll still help us get an intuition about the quality of the search. A query in this case is just a few search terms that we think should retrieve a document successfully.
Now that we’ve collected our dataset, let’s implement a top-k accuracy metric. This metric measures the percentage of the time a document appears in the top k search results given its corresponding query.
Let’s start by evaluating a baseline search engine. This implementation doesn’t use word embeddings at all. We just normalize the text, and count the number of times each query word occur in the document, then rank the documents by number of query word occurrences. Plotting top-k accuracy for various values of k gives us the following chart. Note that we get higher accuracy as we increase k – in the limit, as k approaches our number of documents we approach 100% accuracy.
You also might notice that the accuracy increases as we increase the number of keywords. We can see also the lines getting closer together as the number of keywords increases, which indicates there are diminishing marginal returns for each new keyword.

Do these megabytes of word embeddings actually do anything to improve our search? We would have to compare to a baseline. Maybe that baseline is adding up the counts of all keywords in each document to rank them. We leave this as an exercise to the reader because we ran out of time :)
It would also be interesting to see how a bigger word2vec helps accuracy. While sampling for top-k, there is a lot of error output (I can't understand any of ['prank', ...]). These unknown words get dropped from the search. A bigger word2vec (more than 10,000 words) might contain these less-common words and therefore search better.
Wrapping up
You can build a small search engine from “scratch” with only a hundred or so lines of code. See the full search.py, which includes some of the extras for evaluation and plotting.
Future ideas
We can get fancier than simple cosine similarity. Let’s imagine that all of our documents talk about computers, but only one of them talks about compilers (wouldn’t that be sad). If one of our search terms is “computer” that doesn’t really help narrow down the search and is noise in our embeddings. To reduce noise we can employ a technique called TF-IDF (term frequency inverse document frequency) where we factor out common words across documents and pay closer attention to words that are more unique to each document.