Evaluating AI Agents with Azure AI Evaluation

5 hours ago 2

In this post, we explore the latest Agentic metrics introduced in the Azure AI Evaluation library, a Python library designed to assess generative AI systems with both traditional NLP metrics (like BLEU and ROUGE) and AI-assisted evaluators (such as relevance, coherence, and safety). With the rise of agentic systems, the library now includes purpose-built evaluators for complex agent workflows. We’ll focus on three key metrics: Task Adherence, Tool Call Accuracy, and Intent Resolution—each capturing a critical dimension of an agent’s performance.

To help illustrate these evaluation strategies, you can find AgenticEvals, a simple public repo that showcases these metrics in action using Semantic Kernel for the agentic/orchestration layer and Azure AI Evaluation library for the evaluation.

Generative AI systems don’t fit neatly into the evaluation mold of traditional machine learning. In classical ML tasks such as classification or regression, we rely on objective metrics – accuracy, precision, recall, F1-score, etc. – which compare predictions to a single ground truth. Generative AI, by contrast, produces open-ended outputs (free-form text, code, images) where there may be many acceptable answers and quality is subjective.

Moreover, agentic systems add yet another layer of complexity. They don’t just generate output – they reason over tasks, break them into subtasks, invoke tools, make decisions, and adapt. We need to evaluate not only the final answer but the entire process. This includes how well the agent understands the user’s goal, whether it chooses the right tools, and if it follows the intended path to completion.

The Azure AI Evaluation library now supports metrics tailored for evaluating agentic behaviors. Let’s explore the three key metrics that help developers and researchers build more reliable AI agents.

Task Adherence – Is the Agent Answering the Right Question?

Task Adherence evaluates how well the agent’s final response satisfies the original user request. It goes beyond surface-level correctness and looks at whether the response is relevant, complete, and aligned with the user’s expectations. It uses an LLM-based evaluator to score adherence based on natural language prompt. This ensures that open-ended answers are assessed with contextual understanding. For example, if a user asks for a list of budget hotels in London, a response that lists luxury resorts would score poorly even if they’re technically located in London.

Tool Call Accuracy – Is the Agent Using Tools Correctly?

This metric focuses on the agent's procedural accuracy when invoking tools. It examines whether the right tool was selected for each step and whether inputs to the tool were appropriate and correctly formatted. The evaluator considers the context of the task and determines if the agent’s tool interactions were logically consistent and goal aligned. This helps identify subtle flaws such as correct answers that were reached via poor tool usage or unnecessary API calls.

Intent Resolution – Did the Agent Understand the User’s Goal?

Intent Resolution assesses whether the agent’s initial actions reflect a correct understanding of the user’s underlying need. It evaluates the alignment between the user’s input and the agent’s plan or early decisions. A high score means the agent correctly inferred what the user meant and structured its response accordingly. This metric is essential for diagnosing failure cases where the output seems fluent but addresses the wrong objective – a common issue in multi-step or ambiguous tasks.

Running a Full Evaluation with Azure AI Evaluation

To evaluate an agent’s performance at scale, the Azure AI Evaluation library supports batch evaluation via the evaluate() method. This function allows you to run multiple evaluators like Task Adherence, Tool Call Accuracy, and Intent Resolution on a structured dataset.

You can specify a JSONL dataset representing interaction logs, and the library will apply each evaluator to the corresponding fields. The results can be exported directly to an Azure AI Foundry Project Evaluation workspace, enabling seamless tracking, visualization, and comparison of evaluation runs.

As AI agents become more autonomous and embedded in real-world workflows, robust evaluation is key to ensuring they’re acting responsibly and effectively. The new Agentic metrics in Azure AI Evaluation give developers the tools to systematically assess agent behavior—not just outputs. Task Adherence, Tool Call Accuracy, and Intent Resolution offer a multi-dimensional view of performance that aligns with how modern agents operate.

To try these metrics in action, check out AgenticEvals on GitHub—a simple repo demonstrating how to evaluate agent traces using these metrics.

Call to Action: Start evaluating your own agents using the Azure AI Evaluation library. The metrics are easy to integrate and can surface meaningful insights into your agent’s behavior. With the right evaluation tools, we can build more transparent, effective, and trustworthy AI systems.

Read Entire Article

Evaluating AI Agents with Azure AI Evaluation

Related

One Company Poisoned the Planet [video]

How to Make a Living as a Writer

OpenAI the Hunt for Lost City of Z