Large language models and their adjacent tools are evolving fast and they are a powerful way to improve content classification, which is great when you're building a contextual ad network that doesn't track people. However, LLM prompts and responses can be inconsistent or unpredictable. We've taken a more reliable approach. By using (more) deterministic embeddings and creating centroids with related embeddings, we were able to sharpen up our targeting and boost performance with less guesswork without relying on any user-specific data.
Historical context and scaling topic classification
First, a little bit of background. A few years back, we built our first topic classifier that essentially bundled content and keywords together into topics that advertisers could target and buy similar to what they do for search ads. To give an example, this allowed advertisers to target DevOps related content with relevant ads. This approach scaled well up to about 10-15 topics and gave advertisers an easily understandable way to get good contextual targeting for their campaigns.
Last year, we built a more advanced way to target content using language model embeddings, a strategy we called niche targeting (see our blog with more dev details). It works by comparing embedding vectors to find pages semantically similar to an advertiser's landing or product page. The results were strong, about 25% better performance on average, but scale was a challenge. There simply weren't enough very closely related pages to build large campaigns. Also, while the results were great, explaining embeddings to marketers proved difficult, making the approach harder to sell despite its effectiveness.
Hybrid approach with embedding centroids
After generating embeddings for nearly a million pages across our network, clear clusters of related content began to emerge. Looking at the graph above, you can see that URLs for PyLint are close and even somewhat overlapping with those of ESLint. URLs for Flask and Django, two Python web frameworks, form a pretty tight cluster. And web scraping tools like BeautifulSoup and Scrapy are also very close showing that embeddings are really capturing the semantic meaning.
One of the powerful things about embeddings is that you can apply ordinary math to them, like averaging a group of vectors. A centroid is exactly that: the mean of a set of related embeddings. For example, averaging paragraph embeddings can yield a strong representation of an entire page, and averaging page embeddings often produces a good vector for a domain. In the graph above, the small circles represent individual web pages while the larger circles represent these domain centroids, the semantic center of the pages for each domain.
New content that's semantically related to domains we've already classified will land near that content in embedding space (the chart reduces this to 2d space, but imagine 100s of dimensions). This lets us easily answer questions like "what domains are most closely related to a new domain we're classifying?" or "how 'Flask-ey' is this new content?". It also scales to any number of clusters of closely related content.
Like our earlier topic classifier, this allows us to target ads to content advertisers care about. But unlike the old model, this approach just requires a few URLs or domains with the kinds of content an advertiser wants to target. It's also far easier to explain this type of classification to advertisers: "we target your ads to the most similar domains across our network based on your landing pages or other URLs you provide" is easy to understand and it works pretty well.
Show me the code!
To show some concrete code examples, here's code to generate a domain centroid from URL embeddings with pgvector and Django:
When classifying new pages, it's easy to see how similar the new content's embedding is to domains where we already have an embedding. From there, domains can be classified and grouped into topics that advertisers are interested in although we have found that taking centroids of multiple domains frequently doesn't make sense. Two web frameworks across different languages may be somewhat far in embedding space although content very close to either of them is usually a good semantic match.
This approach offers the best of both worlds. It has the semantic depth of embeddings, far beyond what simple keywords can capture, with the clarity and explainability closer to keyword-style targeting. It scales to anything an advertiser might want to target since new pages and domains just get embeddings and are automatically matched to the right clusters of domains and topics.
Conclusion
From the moment we started using embeddings for contextual ad targeting, we recognized they had great potential for improving performance for advertisers. Better ad performance means we can generate more money for the sites that host our ads which is a great virtuous cycle.
With this approach using centroids, we hope to have another virtuous cycle where our classifications improve over time as we classify more content.
Simon Willison's blog post on embeddings as well as some of his other posts and presentations have been very influential in honing our approach. Thanks Simon!
.png)

![AI-powered humanoid robot demo [video]](https://news.najib.digital/site/assets/img/broken.gif)