Using an LLM for query planning in RAG –> 40% better answer relevance

7 hours ago 2

By Alec Berntson, Alina Stoica Beck, Amaia Salvador Aguilera, Arnau Quindós Sánchez, Thibault Gisselbrecht and Xianshun Chen

Agentic retrieval in Azure AI Search is a new API built to effectively answer complex queries by extracting the right content needed. The API defines and runs a query plan, incorporating conversation history and an Azure OpenAI model. It transforms complex queries, then performs multiple searches at once, combining the final results and delivering ready-to-use content for answer generation.

In this post we detail the operations that take place while the API is called and walk through the numerous experiments and datasets to evaluate its relevance performance. We learned the agentic retrieval API automates optimal retrieval for complex user queries, so you can get more relevant content with less work. This means:

No pre- or post-retrieval work from the caller of the API, everything is handled automatically, and the returned content can be sent exactly as-is to an LLM for answer generation.
+16 points (up to +33 points) answer relevance improvement when using the agentic retrieval API for complex queries as opposed to retrieval only; on-par results for simple queries for which retrieval only is enough.
Multiple query transformations (handling of conversation history, spell correction, search queries generation and paraphrasing)
Minimal cost and latency as all query transformations are achieved with a single LLM call.
High performance from GPT-4o, GPT-4o mini, GPT-4.1, GPT-4.1 mini, and GPT4.1 nano

Learn more about agentic retrieval in Azure AI Search, and the experiments we ran to test its capabilities.

The agentic retrieval API automatically performs a suite of operations to extract the necessary content for answer generation from an indexed set of documents. The input to the API is a query that requires content retrieval from a given index. In the context of conversations, the conversation turns are also part of the API input. The output is a string that contains the retrieved content already formatted and ready to use for LLM-based answer generation. While there are some parameters that can be tuned (e.g. maximum output size), the work of identifying, extracting and merging the necessary content for answer generation from the given index is done automatically.

In agentic scenarios, many queries require more capabilities than those available in typical retrieval systems. Traditional retrieval systems usually follow a recall + reranking pattern, where relevant documents are retrieved based on a given query. However, for complex queries issued in agentic scenarios, if the user query is used as-is without any transformations before retrieval, the most relevant documents might not be identified. In order to successfully retrieve relevant content for complex queries, it is necessary to transform the user query into one or multiple queries. In addition to transforming the query, it is also necessary to collect several pieces of content and use them together to generate an answer.

Typically retrieval systems consist of two layers: recall (L1) and reranking (L2). Azure AI search API implements this pattern: first, the L1 layer recalls documents from customer indexes using either text, vector or hybrid (a combination of the two) representations; second, the L2 layer, using semantic ranker, reorders the top 50 documents from L1 to put the best content first. We introduced several updates to the search API in previous blogposts [1] and [2], showing the importance of using reranking on top of vector and / or text search. Even when using powerful embedding models like OpenAI text-embedding-3-large, L1 alone is not enough and using semantic ranker to rerank the recalled documents drastically improves relevance of results. However, extra steps are needed for complex queries issued in agentic settings, which we added to the agentic retrieval API.

The agentic retrieval API uses the same two layers to retrieve the best content from an index for a given search query. However, before retrieving content, the API first transforms the input query into one or more search queries suited for retrieval, thus increasing the matching between query and documents. The retrieved documents are then collected, sorted and output as one string that is ready to be passed to a LLM for answer generation. The agentic retrieval API introduces two components:

Query planning: the input query is transformed into one or more search queries. This transformation also adds the necessary context from the user-bot conversation, if provided. Additionally, paraphrases of the query are generated, and misspellings are corrected such that matching between query and indexed documents is optimized. For instance, if a user issued the query “what about KB4048959 and waht systmes is it compatibel with?” in the context of a conversation about security updates, query planning generates the search queries “What security updates are related to KB4048959?” and “What systems is KB4048959 compatible with?”, thus including the context from the conversation, decomposing the query into separate search queries and correcting misspellings (paraphrases are also generated, but we didn’t include them here to keep the content concise). All these transformations are achieved with a single LLM call, such that latency and cost are optimized.
Results merging: relevant documents are retrieved by running L1 and L2 on each one of the search queries produced in query planning. Next, the retrieved documents are deduplicated and content is merged into one string that is returned to the caller of the API. This enables the caller to generate a response simply by proving the returned string to an LLM, with no extra steps needed.

Figure 1: Overview of search execution flow in agentic retrieval API.

We tested the agentic retrieval API and the search API on a multitude of datasets and queries that span many different types and use cases. While the search API delivers good search relevance for “classical” search queries:

The retrieval search API drastically improves relevance for complex queries issued in agentic scenarios, while delivering similar relevance as the search API for classical search queries:
- +16 points improvement for answer relevance
- +15 points increase in result rate

Improvements are consistent across languages and domains. Here, we show results on more than 10 domains and 6 languages.
Most improvements are seen for queries that are particularly difficult and that require gathering information from multiple indexed documents. Significant gains are also observed on misspelled queries and queries with low word-overlap with the relevant indexed documents (+14 points for answer relevance).

In agentic scenarios, queries can be very complex and require multiple pieces of content to create a relevant answer. Since the different pieces of content can come from multiple indexed documents, classical information retrieval metrics (e.g. NDCG) cannot measure results accurately. That is because such metrics look at each retrieved document at a time and evaluate its relevance independently of other documents, while complex queries require several documents together.

Following existing work for RAG and agentic retrieval, we use the “RAG triad” [3], [4], [5] to evaluate the performance of the agentic retrieval API compared to the search API. The RAG triad consists of 3 metrics:

Content relevance: how relevant is the retrieved content?
Answer relevance: how relevant is the LLM-generated answer?
Groundedness: how much is the generated answer based on the retrieved content / hallucinated?

We use GPT4o for all LLM-based evaluations. To measure relevance, we prompt the LLM and ask it to determine how relevant a given text is for a query. The LLM takes in a query-text pair and outputs a relevance score that we rescale to 0-100; see [2] for more details. To measure groundedness, we ask the LLM to determine what amount of the generated answer is present in the retrieved content, a score that we also rescale to 0-100.

We compare the performance of the agentic retrieval API to that of the search API. To make comparisons fair, we keep constant all possible parameters: we issue the queries against the same indexes with the same search configurations, we use strings of the same length for answer generation, and we use the same models and prompts to generate answers and to evaluate results. For the agentic retrieval API, the string to use for answer generation is automatically computed and output by the API; its max length can be configured by the caller (we used the default 5000 tokens as max length). For the search API, the result of the API call is a sorted list of indexed documents. To create the string to use for answer generation, we append documents in the output list while the given max length is not reached.

To measure content relevance, we use the input query, the conversation turns if available, and the content used for answer generation (so the output string in the case of the agentic retrieval API, and the concatenated documents in the case of the search API). We ask the evaluation LLM to determine how relevant the content is for the given query, thus obtaining a score that we rescale to obtain a metric between 0 and 100.

To measure answer relevance and groundedness, we first generate an answer by prompting an LLM using the retrieved content. The LLM has the option of not generating an answer (returning instead “Sorry, I could not find an answer”) if the content presented to it is not relevant for the input query. We use the input query and the generated answer to compute answer relevance by prompting the evaluation LLM, thus obtaining a score that we rescale to 0-100.

Additionally, we use the retrieved content to measure groundedness. We ask the evaluation LLM to determine to what extent the information in the generated answer is present in the string used for generation. We obtain a score that we also rescale to 0-100. If there is no answer for a given query (the answer generating LLM responded with “Sorry, I could not find an answer”), we skip the query and do not compute groundedness.

As in our previous blogpost [2], we used several datasets that were created specifically to align with production scenarios:

One ("Customer") with several document sets provided with permission from Azure customers. These documents are usually hundreds of pages long, so require chunking before vectorization.
One ("Support") is sourced from hundreds of thousands of public support/knowledge base articles present in many different languages, out of which we used 8 languages
One (Multi-industry, Multi-language – "MIML") is a collection of 60 indexes that we created from publicly available documents. They represent typical documents from 10 customer segments and 6 languages (we dropped Chinese traditional from our previous dataset because the data already included the more widely used Chinese simplified), each segment having approximately 1000 documents.

We also added several new datasets so that we cover scenarios where very technical vocabulary is used:

One (“FDA”) is a collection of medication documents written in English.
One (“DAYI”) is a collection of medication documents written in Chinese.
One (“Arxiv”) is a collection of scientific documents.

Finally, we used the publicly available MT-RAG dataset to test conversational scenarios. For this dataset, we used queries, conversations and documents as present in the data.

We used many different techniques to collect and generate queries for the different datasets (except for MT-RAG for which we used the queries provided in the public data). There are two main types of queries we used:

“Classical” search queries that were either:
- Real-user queries issued in Bing
- Generated by providing an LLM with some part of a document. These queries were created such that multiple types are covered (question, keyword, web-like, concept-seeking etc. as described in previous blogposts [1] and [2]). Misspelled and paraphrased queries were also created.

“Complex” queries that can only be answered by retrieving information from multiple indexed documents. To create these queries, we used many different techniques from the literature (with some adaptations to use LLMs instead of manual steps):
- GraphRAG
- Multi-hop QA
- HOTPOTQA
- MoreHopQA
- MultiHop-RAG
- ComplexWebQuestions
- MuSiQue
- MEQA

For each dataset and each language present in the data, we used between 1000 and 5000 queries of each type (classical and complex).

For all experiments, we used hybrid search for L1 (BM25 and OpenAI text-embedding-3-large with 3072 dimensions) and semantic ranker for L2 (we reranked top 50 documents retrieved by L1). We used this configuration because it provides high quality search results as described in [2]. This configuration is provided as part of both the search API and the agentic retrieval API.

For all metrics except Table 5, we used GPT4o for query planning and answer generation, and we set the maximum length of the string used to generate the answer to 5000 tokens (the default value).

Tables 1 compares the performance of the two APIs on multiple datasets with complex and classical search queries. While the performance of the two APIs is similar on classical search queries, content and answer relevance are drastically higher for the agentic retrieval API on complex queries.

		Answer Relevance			Content Relevance
Query Type	Dataset	Search API	Agentic Retrieval API	Delta	Search API	Agentic Retrieval API	Delta
*Classical*	*MIML*	87.12	87.89	0.76	77.22	77.95	0.74
	*Support*	66.85	67.44	0.58	53.17	53.72	0.55
	*Content*	77.58	77.77	0.19	68.95	68.89	-0.06
	*DAYI*	74.68	73.26	-1.42	68.22	66.67	-1.55
	*FDA*	63.03	62.41	-0.62	55.73	55.28	-0.45
	*Arxiv*	68.97	69.66	0.68	58.97	57.40	-1.58
	*AVG Classical*	73.04	73.07	0.03	63.71	63.32	-0.39
*Complex*	*MIML*	46.94	57.13	11.25	32.16	39.46	7.30
	*Support*	43.92	61.06	17.15	32.48	44.88	12.40
	*Content*	46.18	58.99	12.81	33.93	42.82	8.89
	*DAYI*	40.58	48.52	7.94	31.21	36.75	5.54
	*FDA*	38.25	71.25	33.00	30.84	58.33	27.50
	*Arxiv*	33.43	48.83	15.39	25.55	35.74	10.19
	*AVG Complex*	41.55	57.63	16.26	31.03	43.00	11.97
*Conversational*	*MT-RAG*	77.31	78.74	1.43	69.35	71.49	2.14
	AVG	58.83	66.38	7.63	49.06	54.57	5.51

Table 1: Content and answer relevance for the two APIs across datasets. Larger values in bold.

To compare groundedness, we look at queries for which groundedness can be computed on both sides, i.e. queries for which both APIs return content that leads to an answer being generated. Table 2 presents this comparison for the two APIs, showing that the generated answer is equally grounded for both APIs.

Query Type	Dataset	Search API	Agentic Retrieval API	Delta
*Classical*	*MIML*	83.58	83.52	-0.06
	*Support*	76.72	76.68	-0.03
	*Content*	83.01	81.85	-1.17
	*DAYI*	84.12	83.00	-1.12
	*FDA*	85.26	85.69	0.43
	*Arxiv*	72.05	73.05	1.00
	*AVG Classical*	80.79	80.63	-0.16
*Complex*	*MIML*	72.82	73.49	0.67
	*Support*	76.38	76.21	-0.18
	*Content*	72.75	74.10	1.35
	*DAYI*	74.21	73.85	-0.35
	*FDA*	78.22	78.34	0.12
	*Arxiv*	70.92	71.00	0.07
	*AVG Complex*	74.22	74.50	0.28
*Conversational*	*MT-RAG*	81.63	82.56	0.93
	*AVG*	77.82	77.95	0.13

Table 2: Groundedness for the two APIs across datasets for queries with generated answers on both sides.

Besides the RAG triad, we compare the percentage of queries for which the two APIs return content that leads to a generated answer. Table 3 shows this comparison. The agentic retrieval API returns useful content for answer generation for a much higher percentage of queries. Furthermore, for queries where the agentic retrieval API leads to an answer but the search API leads to none, all three metrics have high values (as shown by Table 4). This means that, by using the agentic retrieval API, one can get good answers for many queries for which the search API has no answer.

Query Type	Dataset	Search API	Agentic Retrieval API	Delta
*Classical*	*MIML*	90.37	91.18	0.81
	*Support*	72.21	73.03	0.82
	*Content*	80.66	80.89	0.23
	*DAYI*	81.46	80.64	-0.82
	*FDA*	65.77	65.06	-0.71
	*Arxiv*	71.51	72.60	1.10
	*AVG Classical*	77.00	77.23	0.24
*Complex*	*MIML*	54.49	63.96	9.46
	*Support*	48.08	65.33	17.25
	*Content*	53.76	66.00	12.24
	*DAYI*	52.92	58.17	5.25
	*FDA*	45.58	76.08	30.50
	*Arxiv*	39.60	54.36	14.77
	*AVG Complex*	49.07	63.98	14.91
*Conversational*	*MT-RAG*	80.05	81.93	1.88
	*AVG*	64.34	71.48	7.14

Table 3: Percentage of queries for which an answer can be generated based on the retrieved content for the two APIs across datasets.

Query Type	Dataset	Answer Relevance	Content Relevance	Groundedness
*Classical*	*MIML*	82.85	67.60	76.49
	*Support*	72.57	52.50	66.25
	*Content*	86.26	66.61	75.77
	*DAYI*	63.24	53.43	77.84
	*FDA*	82.19	63.70	73.84
	*Arxiv*	78.13	54.69	71.88
	*AVG Classical*	77.54	59.76	73.68
*Complex*	*MIML*	84.36	53.98	70.25
	*Support*	92.87	63.80	75.26
	*Content*	89.44	58.17	71.05
	*DAYI*	81.08	58.49	67.89
	*FDA*	94.14	73.86	77.21
	*Arxiv*	88.36	58.30	68.18
	*AVG Complex*	88.38	61.10	71.64
*Conversational*	*MT-RAG*	89.52	71.23	82.03
	*AVG*	83.04	61.15	73.40

Table 4: The RAG triad for queries for which only the agentic retrieval API leads to an answer being generated.

Table 5 shows the performance of the agentic retrieval API when different models are used for query planning: we use GPT4o and GPT4.1 model families for this comparison. We observe very similar performance across different models, with only GPT4.1-nano performing slightly worse than the rest.

		Search API	Agentic Retrieval API
Query Type	Dataset	Search API	GPT4o	GPT4o-mini	GPT4.1	GPT4.1-mini	GPT4.1-nano
*Classical*	*MIML*	87.12	87.89	86.90	87.73	87.53	87.19
	*Support*	66.85	67.44	68.18	69.65	68.52	68.16
	*Content*	77.58	77.77	77.28	78.74	78.18	78.31
	*DAYI*	74.68	73.26	72.99	73.93	74.34	73.11
	*Arxiv*	68.97	69.66	67.60	68.77	69.45	68.01
	*FDA*	63.03	62.41	62.23	63.13	63.29	63.31
	*AVG Classical*	73.04	73.07	72.53	73.66	73.55	73.02
*Complex*	*MIML*	46.94	57.13	57.79	59.42	57.18	54.56
	*Support*	43.92	61.06	62.17	63.10	59.19	55.52
	*Content*	46.18	58.99	60.32	61.41	58.96	58.14
	*DAYI*	40.58	48.52	54.94	56.71	56.25	52.98
	*Arxiv*	33.43	48.83	48.72	50.31	49.10	43.23
	*FDA*	38.25	71.25	70.15	71.27	71.52	62.88
	*AVG Complex*	41.16	57.41	59.01	60.37	58.70	54.55
*Conversational*	*MT-RAG*	77.31	78.74	80.06	81.86	80.61	81.04
	*AVG*	58.65	66.28	66.87	68.16	67.24	65.11

Table 5: Answer relevance for the agentic retrieval API using different models for query planning. For answer generation, GPT4o was used everywhere.

Table 6 breaks down the answer relevance obtained with the two APIs by query type for complex queries. The definitions of the different query types can be found in the Appendix. The agentic retrieval API increases answer relevance on all query types,and on some segments by more than 20 points.

Category	MIML			Support
Category	Search API	Agentic Retrieval API	Delta	Search API	Agentic Retrieval API	Delta
*Aggregation*	68.57	72.17	3.60	48.36	68.44	20.08
*Analytical*	35.88	39.38	3.50	15.79	42.11	26.32
*Comparison*	39.33	57.86	18.53	38.24	58.14	19.90
*Compound*	47.61	62.56	14.94	41.86	68.02	26.16
*Contextual*	44.92	48.89	3.97	40.77	51.19	10.42
*Conversational*	55.17	59.56	4.39	45.13	64.31	19.18
*Exploratory*	48.62	55.49	6.87	48.12	69.64	21.52
*Factual*	52.62	58.14	5.52	47.85	66.40	18.55
*Filters*	50.40	56.54	6.14	38.94	46.63	7.69
*Index Interrogation*	63.94	69.59	5.65	50.56	63.89	13.33
*Misspellings*	36.43	53.05	16.62	55.21	70.14	14.93
*Range*	46.56	47.73	1.17	29.21	37.11	7.89
*Snippet*	49.20	58.12	8.92	27.78	59.72	31.94
*Terse*	52.42	60.26	7.85	36.25	63.13	26.88

Table 6: Answer relevance breakdown by category on the MIML and Support complex queries.

Finally, we analyze the performance of the agentic retrieval API as compared to the search API across languages. As shown by Table 7, the agentic retrieval API provides significant gains on complex queries for all languages on which it was tested. On classical queries, the agentic retrieval API matches the performance of the search API across languages.

	MIML (Complex)			Support (Classical)
Language	Search API	Agentic API	Delta	Search API	Agentic API	Delta
*German*	50.99	58.78	7.79	78.28	78.91	0.63
*English*	42.75	60.50	17.75	77.93	78.07	0.14
*Spanish*	58.69	63.57	4.88	77.64	80.65	3.00
*French*	46.42	56.68	10.26	76.07	76.91	0.84
*Japanese*	45.24	46.85	1.61	57.33	56.89	-0.43
*Chinese*	37.54	56.36	18.82	64.20	64.28	0.08

Table 7: Answer relevance breakdown by language on the MIML complex queries and Support classical queries.

The agentic retrieval API is now in public preview in select regions. You can learn more about it here.

[1] Azure AI Search: Outperforming vector search with hybrid retrieval and reranking | Microsoft Community Hub

[2] Raising the bar for RAG excellence: introducing generative query rewriting and new ranking model

[3] ⟁ RAG Triad - 🦑 TruLens

[4] Building and Evaluating Advanced RAG Applications - DeepLearning.AI

[5] Using the RAG Triad for RAG evaluation | DeepEval - The Open-Source LLM Evaluation Framework

Complex queries can belong to one or several of the following types:

Type	Description	Example
Aggregation	Queries that seek list or sets of items, or asking for summaries	“what are the average response time across regions”
Analytical	Queries that require a deeper analysis or interpretation of data or information to draw conclusions.	Analyze the impact of social media on mental health"
Comparison	Queries looking for similarities, differences, relationships etc. between attributes or entities	“cost of plan A vs cost of plan B”
Compound	Queries with multiple subjects or questions	“is it possible to do __? what about in __ circumstance”
Complex	Queries requiring information spread across multiple index documents to answer	"how do I set up an Azure Search index with vector and hybrid search"
Contextual	Queries that require understanding the context or background information to provide a relevant and accurate response	“What was the significance of the Berlin Wall during the Cold War”
Conversational	Queries issued as part of a dialog, thus containing conversational words like “what do you think”, “please help me find” etc.	“how do I troubleshoot the message "Your account is blocked due to xyz”
Exploratory	Abstract questions that require multiple sentences to answer.	“Why should I use semantic search to rank results?”
Factual	Fact seeking queries, usually with a single, clear answer.	"What is the capital of France?"
Filters	Queries with filter conditions that scope the subject/answer	“MSFT q2 2025 revenue”
Index interrogation	Queries seeking information at the entire index level	“what policies are available?”
Misspellings	Queries with one or more misspelled words	“How mny documents are samantically r4nked”
Range	Queries asking for data between given dates, greater than a given value etc.	“widgets that cost less than $1000 per unit”
Snippet	Queries looking for a specific substring (exact match of a substring from a document)	“Back your generative AI apps with information retrieval that leverages the strengths of both keyword and similarity search”
Terse	Shortened queries similar to those commonly entered into a search engine	“Best retrieval concept queries”