Genie is Uber’s internal on-call copilot, designed to provide real-time support for thousands of queries across multiple help channels in Slack®. It enables users to receive prompt responses with proper citations from Uber’s internal documentation. It also improves the productivity of on-call engineers and subject matter experts (SMEs) by reducing the effort required to address common, ad-hoc queries.
While Genie streamlines the development of an LLM-powered on-call Slack bot, ensuring the accuracy and relevance of its responses remains a significant challenge. This blog details our efforts to improve Genie’s answer quality to near-human precision, allowing SMEs to rely on it for most queries without concern over potential misinformation in the engineering security and privacy domain.
Motivation
Genie has revolutionized on-call assistance within Uber by enabling domain teams to deploy an LLM-powered Slack bot overnight using a configurable framework. This framework seamlessly integrates with nearly all internal knowledge sources, including the engineering wiki, Terrablob PDFs, Google Docs™, and custom documents. Additionally, it supports the full RAG (Retrieval-Augmented Generation) pipeline, covering document loading, processing, vector storage, retrieval, and answer generation.
While this system is advanced from a machine learning infrastructure perspective, delivering highly precise and relevant responses to domain-specific queries remains an area for improvement. To assess whether the Genie-powered on-call bot was ready for deployment across all Slack channels related to engineering security and privacy, SMEs curated a golden set of 100+ test queries based on their extensive experience handling domain engineers’ inquiries.
When Genie was integrated with Uber’s repository of 40+ engineering security and privacy policy documents—stored as PDFs—and tested against the golden test set, the results revealed significant gaps in accuracy. SMEs determined that response quality didn’t meet the standards required for a broader deployment. Many answers were either incomplete, inaccurate, or failed to retrieve relevant information in correct detail from the knowledge base. Before rolling out the on-call copilot across critical security and privacy Slack channels, it was clear that significant improvements were needed to ensure response accuracy and reliability.
In this blog, we share our journey of improving the quality of responses by increasing the percentage of acceptable answers by a relative 27% and reducing incorrect advice by a relative 60% through our transition from a traditional RAG architecture to an enhanced agentic RAG approach.
Architecture
RAG, introduced by Lewis et al. in 2020, transformed LLMs’ effectiveness in domain-specific NLP tasks. However, recent studies highlight retrieval challenges, particularly in Q&A setups where ambiguous or context-lacking queries hinder accurate document retrieval. When irrelevant content is retrieved, LLMs struggle to generate correct answers, often leading to errors or hallucinations.
To tackle the above issue, we used the agentic RAG approach. We introduce LLM-powered agents to perform several pre-and post-requisite steps to make the retrieval and answer generation more accurate and relevant. Figure 1 showcases the agentic RAG workflow, including the enriched document processing. We refer to this workflow as EAg-Rag (Enhanced Agentic RAG).

Like any RAG architecture, EAg-RAG consists of two key components: offline document processing and near-real-time answer generation. We’ve introduced several improvements in both, as discussed in the following sections.
Enriched Document Processing
“The quality of a model depends on the quality of its assumptions and the quality of its data.”
– Judea Pearl
In today’s rapidly evolving and growing landscape of LLM-powered applications, this fundamental principle is often overlooked. As generative AI continues to advance at an unprecedented pace, ensuring robust assumptions and high-quality data remains essential for building reliable and effective AI systems.
As part of our efforts to improve the performance of the Genie chatbot, we evaluated the quality of processed documents before converting them into embedding vectors and storing them in the vector database as part of the RAG pipeline. During this assessment, we discovered that existing PDF loaders often fail to correctly capture structured text and formatting (such as bullet points and tables). This issue negatively impacts downstream processes like chunking, embedding, and retrieval. For example, many of our policy documents contain complex tables spanning more than five pages, including nested table cells. When processed using traditional PDF loaders (such as SimpleDirectoryLoader from LlamaIndex™ and PyPDFLoader from LangChain™), the extracted text loses its original formatting. As a result, many table cells become isolated, stand-alone text, disconnecting them from their respective row and column contexts. This fragmentation poses challenges for chunking, as the model may split a table into multiple chunks incorrectly. Additionally, during retrieval, the lack of contextual information often prevents semantic search from fetching the correct cell values. We experimented with several state-of-the-art PDF loaders, including PdfPlumber, PyMuPDF®, and LlamaIndex LlamaParse. While some of these tools provided better-formatted extractions, we were unable to find a universal solution that worked across all of our policy documents.
To address this challenge, we transitioned from PDFs to Google Docs, using HTML formatting for more accurate text extraction. Additionally, Google Docs offer built-in access control, crucial for security-sensitive applications. Access control metadata can be indexed and used during answer generation to prevent unauthorized access to restricted information. But even with HTML formatting, when we implemented traditional document loaders such as html2text or even state-of-the-art Markdownify from LangChain to extract the content of Google documents as markdown text, we found plenty of room for improvements, especially with formatting tables correctly.
To provide an example, we’ve generated a mock table with a nested structure, as shown in Figure 2.

After this Google document is parsed as markdown-formatted text using html2text, we used MarkDownTextSplitter to chunk it. When we did this, the row and column context of the table cells were often missing, as shown in Figure 3.


To address this issue, we built a custom Google document loader using the Google® Python API, extracting paragraphs, tables, and the table of contents recursively. For tables and structured text like bullet points, we integrated an LLM-powered enrichment process, prompting the LLM to convert extracted table contents into markdown-formatted tables. Additionally, we enriched the metadata with identifiers to distinguish table-containing text chunks, ensuring they remain intact during chunking. Figure 4 shows the improvements achieved through these enhancements. We also added a two-line summary and a few keywords from the table so that the corresponding chunk would improve the relevancy of the semantic search.

The same approach can also be applied to text, which requires more formatting and structures before conversion to embeddings.
To further enhance chatbot accuracy, we focused on improving text extraction and formatting as well as enriching the metadata. In addition to standard metadata attributes such as title, URL, and IDs, we’ve introduced several custom attributes. Leveraging the remarkable capabilities of LLMs to summarize large documents, we incorporated document summaries, a set of FAQs, and relevant keywords into the metadata. The FAQs and keywords were added after the chunking step, ensuring they dynamically align with specific chunks, whereas the document summary remains consistent across all chunks originating from the same document.
These enriched metadata serve two purposes. First, certain metadata attributes are used in the precursor or post-processing steps of the semantic retrieval process to refine the extracted context, making it more relevant and clear for the answer-generating LLM. Second, attributes such as FAQs and keywords are directly employed in the retrieval process itself to enhance the accuracy of the retrieval engine.
After enriching and chunking the extracted documents, we index them and generate embeddings for each chunk, storing them in a vector store using the pipeline configurations detailed in this blog. Additionally, we save artifacts like document lists (titles and summaries) and FAQs from the enrichment process in an offline feature store for later use in answer generation.
Agentic RAG Answer Generation
Traditionally, answer generation involves two steps: retrieving semantically relevant document chunks via vector search and passing them along with the user’s query and instructions, to an LLM. However, in domain-specific cases like Uber’s internal security and privacy channels, document chunks often have subtle distinctions—not only within the same policy document but also across multiple documents. These distinctions can include variations in data retention policies, data classification, and sharing protocols across different personas and geographies. Simple semantic similarity can lead to retrieving irrelevant context, reducing the accuracy of the final answer.
To address this, we introduced LLM-powered agents in the pre-retrieval and post-processing steps to improve context relevance and enhance extracted content before answer generation. This agentic RAG approach has significantly improved answer quality and opened avenues for further targeted enhancements.
In the pre-processing step, we use two agents: Query Optimizer and Source Identifier. Query Optimizer refines the query when it lacks context or is ambiguous. It also breaks down complex queries into multiple simpler queries for better retrieval. Source Identifier then processes the optimized query to narrow down the subset of policy documents most likely to contain relevant answers.
To achieve this, both agents use the document list artifact (titles, summaries, and FAQs) fetched from the offline store as context. Additionally, we provide a few-shot examples to improve in-context learning for the Source Identifier. The output of the pre-processing step is an optimized query and a subset of document titles, which are then used to restrict the retrieval search within the identified document set.
To further refine retrieval, we introduced an additional BM25-based retriever alongside traditional vector search. This retriever fetches the most relevant document chunks using enriched metadata, which includes summaries, FAQs, and keywords for each chunk. The final retrieval output is the union of results from the vector search and the BM25 retriever, which is then passed to the post-processing step.
The Post-Processor Agent performs two key tasks: de-duplication of retrieved document chunks and structuring the context based on the positional order of chunks within the original documents.
Finally, the original user query, optimized auxiliary queries, and post-processed retrieved context are passed to the answer-generating LLM, along with specific instructions for answer construction. The generated answer is then shared with the user through the Slack® interface.
Challenges
Improving accuracy in a RAG-powered system typically involves refining prompt instructions, adjusting retrieval configurations, and using advanced PDF parsers like LlamaParse instead of basic loaders. However, our use case presented two key challenges that led us to adopt the enhanced agentic RAG architecture:
- High SME involvement and slow evaluation. While Genie’s modular framework allowed easy experimentation, assessing improvements required significant SME bandwidth, often taking weeks.
- Marginal gains and plateauing accuracy. Many experiments yielded only slight accuracy improvements before plateauing, with no clear path for further enhancement.
At the start of development, overcoming these challenges was critical to ensuring Genie could reliably support security and privacy teams without risking inaccurate guidance. To address them, we introduced: Automated evaluation with generative AI and the agentic RAG framework. The automated evaluation reduced experiment evaluation time from weeks to minutes, enabling faster iterations and more effective directional experimentation. Unlike traditional RAG, the agentic RAG approach allows seamless integration of different agents, making it easier to test and assess incremental improvements quickly.
LLM-as-Judge for Automation of Batch Evaluation
In recent years, the LLM-as-a-Judge framework has been widely adopted to automate evaluation, identify improvement areas, and enhance performance. As shown in Figure 5 and detailed in the referenced paper, we use an LLM to assess chatbot responses (x) within a given context (C), producing structured scores, correctness labels, and AI-generated reasoning and feedback.
We apply this approach to automate bot response evaluation, ensuring alignment with SME quality standards (Figure 6). The process consists of three stages:
- One-time manual SME review. SMEs provide high-quality responses or feedback on chatbot-generated answers (SME responses).
- Batch execution. The chatbot generates responses based on its current version.
- LLM evaluation. The LLM-as-Judge module evaluates chatbot responses using the user query, SME response, and evaluation instructions as context (C), along with additional content retrieved from source documents via the latest RAG pipeline.
Integrating these additional documents enhances the LLM’s domain awareness, improving evaluation reliability—particularly for domain-specific complex topics like engineering security and privacy policies at Uber.

In our use case, the LLM-as-Judge module scores responses on a 0-5 scale, with 5 being the highest quality. It also provides reasoning for its evaluations, enabling us to incorporate feedback into future experiments.
Developing Agentic RAG Using LangChain and LangGraph
We built most components of the agentic RAG framework using Langfx, Uber’s internal LangChain-based service within Michelangelo. For agent development and workflow orchestration, we used LangChain LangGraph™, a scalable yet developer-friendly framework for agentic AI workflows. While our current implementation follows a sequential flow (Figure 2), integrating with LangGraph allows for future expansion into more complex agentic frameworks.
Use Cases at Uber
The EAg-RAG framework was tested for the on-call copilot Genie within the engineering security and privacy domain. In these domains, it showed a significant improvement in the accuracy and relevancy of the golden test-set answers. With these improvements, the copilot bot can now scale across multiple security and privacy help channels to provide real-time responses to common user queries. This has led to a measurable reduction in the support load for on-call engineers and SMEs, allowing them to focus on more complex and high-value tasks—ultimately increasing overall productivity for Uber Engineering. Additionally, by showing that better-quality source documentation enables improved bot performance, this development encourages teams to maintain more accurate and useful internal docs. These enhancements aren’t limited to the security and privacy domain. They’ve been designed as configurable components within the Michelangelo Genie framework, making them easily adoptable by other domain teams across Uber.
Next Steps
This blog represents an early step in Uber’s agentic AI evolution. As requirements evolve, more complex architectures may be needed. For example, currently our custom Google Docs plugin and document enrichment supports textual content. The same can be extended to extract and enrich multi-modal content, including images. In the answer generation step, instead of a single-step query optimization, an iterative Chain-of-RAG approach could enhance performance, especially for multi-hop reasoning queries. Additionally, a self-critique agent could be introduced after answer generation to dynamically refine responses and further reduce hallucinations. Further, to ensure flexibility with complex and simple type queries, we’d like to introduce many of these features as tools. Then, we can allow LLM-powered agents to choose the tools as required based on the type and complexity of the queries. With the development described in this blog, we aim to establish a foundation for building agentic RAG systems for Q&A automation at Uber, paving the way for future advancements.
By leveraging enriched document processing and the Agentic RAG framework, we’ve shown how the EAg-RAG architecture significantly improves answer quality. As we roll out these improvements across multiple help channels within Uber’s internal Slack, we aim to observe how better answers help users get accurate guidance faster and provide peace of mind to SMEs and on-call engineers by resolving common queries through the on-call copilot.
Cover photo attribution: Clicked by Arnab Chakraborty.
Google® and Google Docs™ are trademarks of Google LLC and this blog post is not endorsed by or affiliated with Google in any way.
LangChain™ and LangGraph™ are trademarks of Langchain, Inc.
LlamaIndex™ is a trademark of LlamaIndex, Inc.
PyMuPDF®, Artifex, the Artifex logo, MuPDF, and the MuPDF logo are registered trademarks of Artifex Software Inc.
Slack® is a registered trademark and service mark of Slack Technologies, Inc.