RAG has come a long way since the days of naive chunk retrieval; now agentic strategies are table stakes.
Nowadays, an AI engineer has to be aware of a plethora of techniques and terminology that encompass the data-retrieval aspects of agentic systems: hybrid search, CRAG, Self-RAG, HyDE, deep research, reranking, multi-modal embeddings, and RAPTOR just to name a few.
As we’ve built the Retrieval services in LlamaCloud, we’ve chosen to abstract a few of these techniques into our API, only exposing a few top-level hyper-parameters for controlling these algorithms. In this blog post, we will showcase these various techniques, explaining how and when to use them. We will build upon these techniques one by one and end with a fully agentic retrieval system that can intelligently query multiple knowledge bases at once.
Starting with the basics
You can’t talk about RAG without talking about “naive top-k retrieval”. In this basic approach, document chunks are stored in a vector database, and query embeddings are matched with the k most similar chunk embeddings.
 Naive top-k RAG
Naive top-k RAGHere’s a basic code snippet to index a simple folder of PDFs:
import os from llama_index.indices.managed.llama_cloud import LlamaCloudIndex financial_index = LlamaCloudIndex.from_documents( documents=[], name="Financial Reports", project_name=project_name, ) financial_reports_directory = "./data/financial_reports" for file_name in os.listdir(financial_reports_directory): file_path = os.path.join(financial_reports_directory, file_name) financial_index.upload_file(file_path, wait_for_ingestion=False) financial_index.wait_for_completion()Once this is indexing is completed, you can start retrieving these chunks with one more line:
query = "Where is Microsoft headquartered?" nodes = financial_index.as_retriever().retrieve(query) response = financial_index.as_query_engine().query(query)Going slightly beyond this naive chunk retrieval mode, there are also two more modes if you want to retrieve the entire contents of relevant files:
- files_via_metadata - use this mode when you want to handle queries where a specific filename or pathname is mentioned e.g. “What does the 2024_MSFT_10K.pdf file say about the financial outlook of MSFT?”.
- files_via_content - use this mode when you want to handle queries that are asking general questions about a topic but not a particular set of files e.g. “What is the financial outlook of MSFT?”.
 Multiple retrieval modes
Multiple retrieval modesWhile chunk retrieval is the default mode, you can use one of the other retrieval modes via the retrieval_mode kwarg:
files_via_metadata_nodes = financial_index.as_retriever(retrieval_mode="files_via_metadata").retrieve(query) files_via_content_nodes = financial_index.as_retriever(retrieval_mode="files_via_content").retrieve(query)Level up: Auto Mode
Now that we have an understanding how and when to use each of our retrieval modes, you’re now equipped with the power to answer any and all of types of questions about your knowledge base!
However, many applications will not know which type of question is being asked beforehand. Most of the time, these questions are being asked by your end user. You will need a way to know which retrieval mode would be most appropriate for the given query.
Enter the 4th retrieval mode - auto_routed mode! As the name suggests, this mode uses a lightweight agent to determine which of the other 3 retrieval modes to use for a given query.
 Agentically auto-routed retrieval
Agentically auto-routed retrievalUsing this mode is just as simple as using any of the other modes:
nodes = financial_index.as_retriever(retrieval_mode="auto_routed").retrieve("Where is Microsoft headquartered?") print(nodes[0].metadata["retrieval_mode"])Expanding Beyond a single knowledge base
With the use of auto_routed mode, we have a lightweight agentic system that is capable of competently answering a variety of questions. However, this system is somewhat restricted in terms of its search space - it is only able to retrieve data that has been ingested in a single index.
If all of your documents are of the same format (e.g. they’re all just SEC 10K filings), it may be actually be appropriate for you to just ingest all your documents through a single index. The parsing and chunking configurations on that single index can be highly optimized to fit the formatting of this homogenous set of documents. However, your overall knowledge base will surely encompass a wide variety of file formats - SEC Filings, Meeting notes, Customer Service requests, etc. These other formats will necessitate the setup of separate indices whose parsing & chunking settings are optimized to each subset of documents.
Let’s say you have your SEC filings in the financial_index from the prior code snippets, and additionally have created a slides_index that has ingested .ppt PowerPoint files from a folder of slide shows.
import os from llama_index.indices.managed.llama_cloud import LlamaCloudIndex slides_index = LlamaCloudIndex.from_documents( documents=[], name="Slides", project_name=project_name, ) slides_directory = "./data/slides" for file_name in os.listdir(slides_directory): file_path = os.path.join(slides_directory, file_name) slides_index.upload_file(file_path, wait_for_ingestion=False)Your application may now have users asking questions about the SEC Filings you’ve ingested in financial_index & the meeting slide shows you’ve ingested in slides_index.
This is where our Composite Retrieval APIs shine! They provide a single Retrieval API to retrieve relevant content from many indices - not just one. The Composite Retrieval API exposes a lightweight agent layer to clients to allow them to specify a name & description for each sub-index. These parameters can help you control how the agent decides to route a question between the various indices you’ve added to your composite retriever.
from llama_cloud import CompositeRetrievalMode from llama_index.indices.managed.llama_cloud import LlamaCloudCompositeRetriever composite_retriever = LlamaCloudCompositeRetriever( name="My App Retriever", project_name=project_name, create_if_not_exists=True, mode=CompositeRetrievalMode.ROUTED, rerank_top_n=5, ) composite_retriever.add_index( slides_index, description="Information source for slide shows presented during team meetings", ) composite_retriever.add_index( financial_index, description="Information source for company financial reports", ) nodes = retriever.retrieve("What was the key feature of the highest revenue product in 2024 Q4?")Piecing Together a Knowledge Agent
Now that we know how to use agents for both individual and multi-index level, we can put together a single system that does agentic retrieval at every step of retrieval! Doing so will enable the use of an LLM to optimize every layer of our search path.
The system works like this:
- At the top layer, the composite retriever uses LLM-based classification to decide which sub-index (or indices) are relevant for the given query.
- At the sub-index level, the auto_routed retrieval mode determines the most appropriate retrieval method (e.g., chunk, files_via_metadata, or files_via_content) for the query.
 Retrieval routed agentically across multiple auto-routed indexes
Retrieval routed agentically across multiple auto-routed indexesHere’s the code implementation:
from llama_cloud import CompositeRetrievalMode from llama_index.indices.managed.llama_cloud import LlamaCloudCompositeRetriever composite_retriever = LlamaCloudCompositeRetriever( name="Knowledge Agent", project_name=project_name, create_if_not_exists=True, mode=CompositeRetrievalMode.ROUTED, rerank_top_n=5, ) composite_retriever.add_index( financial_index, description="Detailed financial reports, including SEC filings and revenue analysis", ) composite_retriever.add_index( slides_index, description="Slide shows from team meetings, covering product updates and project insights", ) query = "What does the Q4 2024 financial report say about revenue growth?" nodes = composite_retriever.retrieve(query) for node in nodes: print(f"Retrieved from: {node.metadata['retrieval_mode']} - {node.text}")This setup ensures that retrieval decisions are intelligently routed at each layer, using LLM-based classification to handle complex queries across multiple indices and retrieval modes. The result is a fully agentic retrieval system capable of adapting dynamically to diverse user queries.
Naive RAG is dead, agentic retrieval is the future
Agents have become an essential part of modern applications. For these agents to operate effectively and autonomously, they need precise and relevant context at their fingertips. This is why sophisticated data retrieval is crucial for any agent-based system. LlamaCloud serves as the backbone for these intelligent systems, providing reliable, accurate context when and where agents need it most.
What’s next? You can learn more about LlamaCloud or sign up today and get started with 10,000 credits for free!
.png)
 4 months ago
                                21
                        4 months ago
                                21
                     
  


