Grug-Brained Search Evaluation

4 months ago 3

Big brain search developer spend years computing “perfect metric”. Get no real search relevance work done. Grug just want to know if “relevance change do what I expect”. Let big brain A/B test tell me if good.

Ok I’m not sure I can keep up the Grug language.

NDCG is overrated

I’ve written in the past that NDCG is overrated. NDCG is the metric we ostensibly use to measure search quality offline.

That is, trying to put a distinct quality label on a search result is hard. For many reasons:

human labelers and LLMs are often not representative of your users.
human labelers get tired and make mistakes
internal, influential HIPPOs may have opinions not correlated with users
clickstreams are hard to interpret, and contain all kinds of biases that dictate why something was ‘clicked’ (the UI, the tendency for users to click high up on the page regardless, etc)
long tail queries get few interaction, leaving us with interaction data only on the most common queries

People tie themselves in knots eliminating all these errors. And what happens? To eliminate errors, we add complexity. Complexity adds harder to understand errors.

Sure getting Steve from accounting to label some search results will be error prone. But you know exactly how Steve will screw up. You know Steve well.

But modeling out how users click search results to tease out “relevance” can be a tricky task, for all the reasons listed above.

Not to mention, the fundamental assumption here may be flawed. Lbeling a single search result as relevant/not, then improving that ranking, is only one piece of search quality. Search quality means so much more: diverse search results, clear query understanding, speed, reflecting the intent back to the user, good percieved relevance, etc …

Given all these issues, which path do you go down?

Grug-brained NDCG

IMO, most people would do fine with the following algorithm:

Identify a population of queries you want to fix
Gather those query judgments with a tool from internal users (ie Quepid or something)
Tune those queries so NDCG gets better (change ranking, improve query understanding, ??)
Regression test old queries, from previous iterations of this loop, make sure those didn’t break
GOTO 1

This doesn’t measure “quality”. It measures “are we fixing things we intend, without breaking other things”?

That is the team is defining a (maybe wrong) goal and measuring its progress towards that goal. That is it. That’s the tweet.

For example, if you intend your change to improve searches for products by name, then you’d go and ask the team to label some search results for queries. Including what the “right answer” ought to be. You then try some fixes, making sure you only solved this problem (steps 3/4), and ship if it works.

We know “the team” has error. But we can probably define that error. It boils down to the team needing to continually learn what users want from search. Which they should do anyways.

How do we measure quality then? We ship our changes to an actual A/B test. Or usability study. These are explicit studies of quality. Actual “quality” means whether we sell more products, increase DAU, or help solve business problems. This goes far beyond just a relevance change - and indeed it may be that relevance itself isn’t what holds back our search, but other search quality issues.

A great deal of complexity comes in strongly coupling these two worlds. World (a) do we solve the problem we intend to solve and (b) does solving that problem matter to users? Tightly coupling the two with complex modeling is hard, and requires big brains. Do it last.

When to go big-brained

Of course cases still exist when we want to go big brained. If we want to train a model, our judgment list needs to point to “true north” of quality. That’s 90% of the work, and takes considerable care. We don’t spend our cycles modeling training, but getting trustworthy training data. This is what has mattered when I’ve built ranking models – Every. single. time.

Here you want to pull out all the stops. Get big brained data scientists to learn about topics like click models. Get a good Learning to Rank book, like, naturally AI Powered Search chapters 10-12, and really buff up your relevance capabilities.

But I’d argue many teams aren’t here yet, they’re just trying to make their Elasticsearch or whatever a bit better, and they’d do better to be grug-brained data scientists, then waste their limited cycles on big-brained dreams that may take away from actual work tuning relevance.

I hope you join me at Cheat at Search with LLMs to learn how to apply LLMs to search applications. Check out this post for a sneak preview.

Read Entire Article