In my previous article, I hypothesized reasoning agents work best with simple search tools. In agentic search, we don’t build thick, complex search systems. Instead, I argued, we should build simple, easy-to-understand, and transparent ones like grep or basic keyword search. Agents iterate and learn: They sift through results, learn about the corpus, how retrieval works, and retry with what they’ve learned.
Well now I’ve taken time to measure what an agent can get away with in two datasets on (N=100) queries (all code here)
Below, I’ll walk through what exactly I’m comparing, how it works, and give you the code breadcrumbs if you ever want to setup your own agentic search loop.
What are the variants?
BM25 Baseline
The BM25 baseline uses SearchArray to naively sum BM25 scores between snowball tokenized product name and description. I weigh product title / description scores based on a random search I did in the past (title is 2x description).
Boring :). But you’ll see more or less also what we also give the agent.
For each field:
Then roughly, we search as follows:
Agent setup
The agentic loop sits at the core of the experiment. Below I outline the loop visually, followed by a code outline. If you’ve never written your own agentic loop to do a task, it’s very educational. If you have the means, I highly recommend it.
You can see below, we send in a prompt and specifications for tools the agent can use. We also share the structure we expect back (which I’ll discuss might influence the reasoning done):

The agent will decide it needs to call tools only available on the client. To do so, it responds with a request for you to call a tool and package the tools output back to the agent:

Once done, you’ll eventually find no more tool call requests, so you wrap up by returning the structured results:

Or as summarized in the code below. We loop, calling the LLM, until no more tool calls are needed, at which point we return a structured output.
I’m omitting a lot of pedantic pydantic plumbing to talk to Python functions, package up arguments and return values. But please explore the nitty-gritty code here.
With the outline out of the way, let’s explore the three main ingredients we plug into the agentic loop.
The prompt (+ few shot examples)
IANAPE (I Am Not A Prompt Engineer). So I’m expecting John Berryman to improve this prompt. But the prompt I pass in (here for ESCI)
I give a sampling of 10 relevant query - products, irrelevant ones, and ones in-between, ie:
Finally, we tell the agent about a search tool that looks like the BM25 baseline above. What can be hard to grasp: the agent doesn’t get a Python function. It gets told there exists a function called search_esci, that takes parameters keywords and tok_k. The agent gets the doc string as a description, and is told the return type the agent should expect. You can learn more from OpenAI’s tool spec)
The response we get back (structured output)
Next the structure of the search results OpenAI should return back (via structured outputs). The requested outputs become part of the context. This can have an impact on how well the search works. Specifically, if I force GPT-5 to tell me why something is relevant, I believe (perhaps wrongly) it is forced to consider relevance more carefully. Perhaps this wisdom from ye olde manual chain-of-thought days no longer holds.
The SearchResults model, passed into text_format above is:
I won’t go through each attribute, you can see the code here. But above you can see a bit what I mean. I ask the agent to explain the intent behind the user’s query. Does this help the agent reason about how to best satisfy the user’s request by forcing it to think about what the user wants? Or is it a waste of tokens and just lead to context rot?
Further experiments might show how much shrinking we could do :)
Future enhancements
I feel like this is just scratching the surface. Some additional enhancements I want to explore:
More structured filters in search tool
Originally, when I just focused on the WANDS (Wayfair) data, I had a tool that also allowed category filtering of search results before ranking. This seemed to help somewhat. But I haven’t gotten a chance to systematically investigate the benefit. As part of my course, I also have enriched the WANDS data with a lot of additional structured attributes that might benefit from filtering (color, room, etc).
Semantic cache training data query evals
ESCI has a large population of queries split between test / train. What if we simulated a case where we have reliable eval data on one set of queries (ie the training set) but no evals on another query set (the test queries). This case mirrors common situations in production. We might have a set of queries with extensive clickstream or other evaluation data. And another set we are blind on.
If we index the known eval queries into a vector index, when an unseen query arises, we can add a tool that looks up similar queries.
Tool memory
I mentioned in my original post the idea of storing the LLMs evaluation of specific tool calls. IE the search for “leather couch” worked well / did not work well to find relevant results for teh user’s “suede couch” query. I’d like to see whether adding tool memory helps / doesn’t help the agent take shortcuts. And whether subsequent iterations of the same query improve over time as the agent has deeper memory about the tools it used in the past for that user query.
Drafting on Tool Exhaust
Finally, if we build an agent that works very well, can we reuse the tool memory in online / non-agentic search? Could we lookup in that tool memory what keywords and filters tend to work well, then use that in classic search-bar style search?
Feedback? Ideas
This is all written in the spirit of experimentation, exploration, and happy failures 🙂 Please get in touch if you have feedback or ideas. Especially if you nee where I’ve screwed up!
I hope you join me at Cheat at Search with LLMs to learn how to apply LLMs to search applications. Check out this post for a sneak preview.
.png)
![Latency Profiling and Optimization – Dmitry Vyukov [video]](https://www.youtube.com/img/desktop/supported_browsers/opera.png)
