The Case Against LLMs as Rerankers

1 day ago 1

Authors: Apoorva Joshi, Zhenmei Shi, Akshay Goindani, Hong Liu
Research Leads: Zhenmei Shi, Akshay Goindani, Hong Liu

Large language models are increasingly being used for a broad range of tasks, including reranking, but they may not be the optimal choice when considering practical constraints like cost, latency, and accuracy in production applications.

In this blog post, we put our latest reranker model, rerank-2.5, to the test against some of the best-performing LLMs on the market to see whether LLMs are actually good rerankers. Our studies show the following:

Purpose-built rerankers, such as rerank-2.5, are up to 60x cheaper, 48x faster, and achieve up to 15% better reranking accuracy (NDCG@10) than state-of-the-art LLMs.
First-stage retrieval matters—pairing strong first-stage retrieval methods with specialized rerankers yields the best reranking quality.
While long context LLMs enable reranking all candidate documents at once, they underperform purpose-built rerankers.

Before we get to our experiments, let’s briefly discuss why reranking is crucial in AI applications. Two-stage retrieval systems have long been the standard in search applications—the first stage quickly retrieves potentially relevant documents using techniques such as lexical, vector, or hybrid search, while the second stage reranks these results in an improved order of relevance.

This two-stage approach is especially critical for RAG and other applications requiring LLMs to process large amounts of information (such as multi-document analysis, deep research, and code understanding) because LLMs suffer from the “lost in the middle” problem wherein their performance degrades substantially when relevant information is buried deeper in the context window. Reranking places the most relevant documents at the top, maximizing LLM performance on these tasks.

Fig 1: Two-stage retrieval

Read more about the importance of reranking in LLM applications in this blog post.

We currently see two common approaches for reranking:

Using specialized rerankers: Pass the query and the results from first-stage retrieval to a specialized reranking model, typically a cross-encoder, that jointly processes each query-document pair to produce relevance scores for reordering.
Using off-the-shelf LLMs as rerankers: Pass the query and the results from first-stage retrieval to an LLM, and prompt it to reorder the list in decreasing order of relevance.

While a purpose-built model for reranking objectively sounds like the better choice, there are several perceived advantages to using LLMs as rerankers—convenience and accessibility, strong out-of-the-box performance across diverse domains, and the ability to provide explanations for their ranking decisions. However, previous studies compare LLMs with outdated rerankers on academic datasets, on top of weak first-stage retrieval methods. There has been little empirical evidence to demonstrate whether modern LLMs or purpose-built rerankers perform better on real-world datasets and with strong first-stage retrieval baselines.

We conducted a comprehensive study benchmarking state-of-the-art LLMs against our latest rerank-2.5 model on metrics such as cost, latency, and accuracy on real-world datasets, and on top of different first-stage retrieval methods such as embedding-based retrieval and lexical search.

Evaluation setup

Datasets: 13 real-world datasets spanning 8 domains: technical documentation, code, law, finance, web reviews, medical, conversations, and datasets in the wild. Detailed information about each of the domains can be found in the appendix.
First stage retrieval methods: We perform reranking as a second stage on top of first stage retrieval methods, such as lexical search that uses the BM25 algorithm, vector search using voyage-3-large, as well as vector search using voyage-3-lite, a lightweight version of voyage-3-large optimized for lower cost and latency.
Rerankers: We compare rerank-2.5 and rerank-2.5-lite, against Cohere’s latest rerank-v3.5 as well as LLMs such as GPT-5, Claude Sonnet 4.5, Gemini 2.5 Pro, Gemini 2.0 Flash, and Qwen 3 32B.
NOTE: For LLMs, given a list of candidate documents, we prompt them to output the document indices in decreasing order of relevance. We adopt sliding window reranking¹ following RankLLM with a window size of 20. For Gemini 2.0 Flash, we also report results from using its 1M token context window to process all candidate documents at once.
Metrics: We use Normalized Discounted Cumulative Gain (NDCG) @10 as our main metric to measure reranking performance. NDCG is a standard metric used to evaluate ranking quality in recommendation and information retrieval systems. A higher NDCG indicates a better ability to rank relevant items higher in a list of retrieved results. NDCG@10 measures how well the top 10 results are ordered by relevance.

Additional details about the RankLLM implementation, LLM prompts, etc., are available in the appendix.

Results

This section highlights key findings from our experiments. All the evaluation results are available in this spreadsheet.

Specialized rerankers improve first-stage retrieval more than LLM rerankers

Fig 2: NDCG@10 scores across different domains

rerank-2.5 and rerank-2.5-lite consistently emerge as the top-performing rerankers, regardless of the domain and the first-stage retrieval method used. Averaged across the three first-stage retrieval methods and all datasets, rerank-2.5 outperforms GPT-5, Gemini 2.5 Pro, and Qwen 3 32B by 12.61%, 13.43% and 14.78%, respectively.

We also measured the impact of increasing the number of candidate documents used for reranking— intuitively, one might expect that giving a reranker access to more documents from first-stage retrieval would improve ranking quality since it has a larger pool of potentially relevant results to choose from.

Fig 3: Impact of reranking on strong first-stage retrieval

On top of voyage-3-large, we observed different behaviors across different models:

rerank-2.5 provides the highest improvement in NDCG@10, increasing voyage-3-large‘s performance from 81.58% to 83.02%. This further improves to 84.4% as the number of documents seen by the reranker increases from 10 to 50, followed by a slight dropoff when expanding to 200 documents.
Similar to rerank-2.5, rerank-2.5-lite improves the NDCG@10 of voyage-3-large from 81.58% to 82.55%. This further increases to 83.33% as the number of documents seen by the reranker goes from 10 to 50, followed by a slight dropoff beyond 50 documents.
Qwen 3 32B and Gemini 2.0 Flash actually degrade performance, with NDCG@10 dropping to 80.63% and 79.49%, respectively, from the baseline of 81.58%. This further drops to 79.99% and 75.97% as the number of documents increases from 10 to 200.

For all models, we see diminishing returns from increasing the number of documents for reranking beyond 100. This indicates that there’s a practical limit where additional documents cease to provide benefits.

Fig 4: Impact of reranking on weaker first-stage retrieval techniques

We also evaluated the models with “weaker” (compared to voyage-3-large) first-stage retrieval techniques, such as vector search using voyage-3-lite and lexical search. Both specialized rerankers and LLMs improve the ranking quality of these weaker retrieval techniques. However, on average, rerank-2.5 and rerank-2.5-lite offer higher gains of 36.74% and 35.53% from the baseline, compared to 29.8% and 24.66% from Qwen 3 32B and Gemini 2.0 Flash. Even here, we observe that performance flattens out beyond a certain number of documents.

The performance of the LLM rerankers is worth noting. While we observe that they degrade the performance of stronger first-stage retrieval methods, we find that they offer significant improvements over weaker retrieval techniques. We hypothesize that the general-purpose nature of LLMs and their positional biases conflict with high-quality initial rankings, yet they can still add value when the first-stage retrieval leaves substantial room for improvement.

Long context windows are not useful for LLM rerankers

The 1 million token context window of Gemini 2.0 Flash enables reranking all top 100 candidate documents in a single pass. Therefore, we also compare sliding window reranking and single-pass reranking with Gemini 2.0 Flash. Surprisingly, single-pass reranking with the top 100 documents significantly underperforms sliding window reranking, undermining the value proposition of longer context windows for reranking applications.

Fig 5: Sliding windows vs single-pass performance

As shown above, sliding window reranking outperforms single-pass reranking by 26.6%, 25.27%, and 22.2% when applied on top of voyage-3-large, voyage-3-lite, and lexical search as first-stage retrievers, respectively.

Pair strong first-stage retrieval with specialized rerankers for the best results

Another observation from Fig. 2 is that the first-stage retrieval quality sets the upper bound for the overall retrieval quality of the system. As shown above, on average, first-stage retrieval using voyage-3-large provides a strong initial NDCG of 81.58%. Adding rerank-2.5 improves performance by 3.36% , while the same reranker delivers much larger gains of 15.02% and 47.57% when applied to weaker first-stage retrievers like voyage-3-lite and lexical search, respectively.

While the reranker works harder to improve weaker initial results, the final system performance is ultimately bounded by the quality of the first-stage retrieval. Hence, voyage-3-large + rerank-2.5 achieves the highest overall NDCG@10 at 84.32%, followed by voyage-3-lite + rerank-2.5 at 82.54%, and lexical search + rerank-2.5 at 68.34%.

The takeaway is clear: to maximize overall system performance, pair your strongest first-stage retriever with your best reranking model.

Specialized rerankers are much cheaper than LLMs for reranking

Fig 6: Model cost vs accuracy for different reranking models

As shown above, specialized rerankers are more cost-effective than LLMs. This is because rerankers are small, tailored models for information retrieval, while LLMs are general-purpose models and orders of magnitude larger than rerankers.

rerank-2.5-lite offers the best tradeoff between cost and accuracy at $0.02 per 1M tokens, while achieving an NDCG@10 of 83.12%. rerank-2.5 offers the best reranking performance with an NDCG@10 of 84.32%. LLMs cost 25-60x more than rerank-2.5, with the cost of LLMs ranging from $1.25-$3 per 1M tokens compared to $0.05 for rerank-2.5.

Specialized rerankers are much faster than LLMs at reranking

Fig 7: Model cost vs latency for different reranking models

As shown above, rerankers are also several orders of magnitude faster than LLMs. Across 200 queries, Cohere’s rerank-v3.5 is the fastest model; however, rerank-2.5 and rerank-2.5-lite offer better tradeoffs between accuracy and speed. Once again, rerank-2.5 offers the best reranking performance, while being 9x, 36x, and 48x faster than Claude Sonnet 4.5, GPT-5, and Gemini 2.5 Pro, respectively.

This observation is not surprising given that LLMs are typically much larger models with more complex architectures, and have the computational overhead of text generation, while rerankers are optimized specifically for relevance scoring. Especially with sliding window reranking, it takes LLMs several iterations to generate final results.

In this blog post, we evaluated LLMs against specialized rerankers. Despite the growing trend of using LLMs for reranking, our findings reveal significant limitations—LLMs perform poorly when paired with strong first-stage retrieval methods, have significantly higher cost and latency, and are highly sensitive to implementation details such as document ordering and prompt design.

Specialized rerankers like rerank-2.5 offer superior performance, lower cost and latency, and consistent behavior—making them the clear choice for production applications.

To learn more about rerank-2.5 and rerank-2.5-lite, head over to the docs.

Evaluation dataset details

Category	Descriptions	Datasets
TECH	Technical documentation	5G
CODE	Code snippets, algorithms	LeetCodePython-rtl
LAW	Cases, statutes	LegalSummarization LegalBench–Corporate Lobbying
FINANCE	SEC filings, finance QA	RAG benchmark (Apple-10K-2022), FinanceBench, TAT-QA-rtl, HC3 Finance
WEB	Web contents, Reviews	Movie Reviews
CONVERSATION	Meeting transcripts, dialogues	Dialog Sum
HEALTHCARE	Health conversations	Mental Health Consult
Datasets in the wild	Real world applications	Real-world 1, Real-world 2

LLM reranking prompt

For LLM reranking, we follow the implementation of RankLLM with the following prompt:

prompt = ( "You are an outstanding AI researcher with strong expertise in fine-grained data ranking. " "Please rank the following documents based on their relevance to a given query.\n" "I will provide you with a set of documents, each identified by a numerical tag in brackets—for example: Document [0], Document [1], Document [2], Document [3], and so on.\n" "**Instructions**:\n" "1. List all document identifiers in descending order of relevance.\n" "2. Only respond with the ranking results, do not say any word or explain.\n" "3. Present the final ranking in the following JSON format:\n" '{"order": "[3] > [7] > [9] > [1] > [0] > [8] > ... > [6]"}\n' "In above example, Document [3] is the most relevant and Document [6] is the most irrelevant." "**Important Requirements**:\n" "1. All provided documents must be included in the output.\n" "2. Each identifier should appear exactly once in the ranking.\n" "3. The total number of identifiers in the output must match the number of input documents.\n" )

Read Entire Article