April 4th, 2025

Unlocking the Power of Agentic Applications New Evaluation Metrics for Quality and Safety

Chang Liu
Senior Product Manager

We’re excited to announce the launch of new evaluation metrics in Azure AI Foundry, designed to help developers assess both the quality and safety of agentic applications. These metrics provide deeper insights into the performance of agent workflows, enabling teams to ensure transparency and optimize their AI systems—whether it’s for evaluating task adherence, tool call accuracy, or risk and safety aspects like code vulnerability and ungrounded attributes.

To support AI developers in addressing these challenges, Azure AI Foundry introduces new evaluation metrics to assess the quality of agentic applications’ processes and the safety of their outputs. These tools enhance existing AI red teaming capabilities, allowing organizations to automate risk and safety evaluations through streamlined campaigns and regular assessments, now accessible via the Azure AI Evaluations SDK.

Agentic applications can be powerful productivity assistants. They can plan, execute actions, or interact with human stakeholders or other agents to create more complex workflows for business needs. However, evaluating and optimizing the performance of individual agents is a critical challenge. To enable observability and transparency, developers need tools to assess the agentic workflows themselves. For example, a typical agentic workflow might look like this:

Triggered by a user query about “weather tomorrow”, the agentic workflow may include multiple steps, such as reasoning through user intents, tool calling, and utilizing retrieval-augmented generation to produce a final response. In this process, assessing the quality and safety of each step—along with the final output—is crucial. To assist AI developers with this challenge, we are pleased to launch new evaluation metrics in Azure AI Foundry, designed to help developers assess the quality of an agentic application’s processes and the safety of its outputs.

These evaluation tools complement the new AI red teaming capabilities in Azure AI Foundry released today, which enable organizations to automate red teaming campaigns alongside regular risk and safety evaluations.

New risk and safety evaluation metrics

  • Code Vulnerability: Measures whether AI generates code with security vulnerabilities, such as code injection, tar-slip, SQL injections, stack trace exposure, and other risks across Python, Java, C++, C#, Go, JavaScript, and SQL.
  • Ungrounded Attributes: Measures the frequency and severity of an application generating text responses that contain ungrounded inferences about a person’s attributes, such as their demographics or emotional state.

Now available in Azure AI Foundry via the Azure AI Evaluations SDK [link]:

Learn more in our documentation and try out these new evaluators using a sample exercise on GitHub for code vulnerability and ungrounded attributes.

New metrics to assess agentic applications for quality

  • Intent Resolution: Measures how well the agent identifies the user’s request, including how well it scopes the user’s intent, asks clarifying questions, and reminds end users of its scope of capabilities.
  • Tool Call Accuracy: Evaluates the agent’s ability to select the appropriate tools, process correct parameters from previous steps, and call tools in the optimal order.
  • Task Adherence: Measures how well the agent’s response adheres to its assigned tasks, according to its system message and prior steps.
  • Response Completeness: Measures how comprehensive the agent’s response is compared to the ground truth provided by the user’s input.

Learn more in our documentation and try these out using a sample exercise on GitHub for intent resolution, tool call accuracy, task adherence, response completeness as well as an end-to-end evaluation experience for Azure AI agents.

New metrics to assess applications and agents for risk and safety

  • Risk and Safety Metrics: Measures an application’s predisposition to generate harmful or inappropriate content, including its vulnerability to both direct and indirect prompt injection attacks
  • Performance and Quality Metrics: Assesses qualities such as fluency, coherence, groundedness, and relevance for Retrieval-Augmented Generation (RAG) and text-based applications.
  • Custom Metrics: Enables developers to measure performance against specific needs and objectives not covered by predefined evaluation metrics.

Agentic applications can be powerful productivity assistants and enablers. They can plan, execute actions, or interact with human stakeholders or other agents to create more complex workflows for business needs. However, evaluating and optimizing the performance of individual agents is a critical challenge. To build production-ready agentic applications and ensure observability and transparency, developers need tools to assess not only the final output from an agent’s workflows, but the quality and efficiency of the workflows themselves.

Developers using Azure AI Agent Service can seamlessly evaluate their agents via our converter support for Azure AI agent threads. While we plan to add more converter support for different agent data formats, developers can still use the evaluators’ low-level API to evaluate their agent data.

Empower cross-functional teams with built-in tools and templates that integrate responsible AI practices into existing workflows. Deliver more trustworthy applications with enterprise-grade privacy, security, and compliance capabilities developed for the era of AI. Put evaluators to work to measure output performance at scale across a broader range of risks.

Pricing

The quality evaluators (intent resolution, tool call accuracy, task adherence, response completeness) are available at no additional cost. However, please note that these evaluations use GPT-assisted measurement, where Azure OpenAI models serve as the judge for the evaluations. You will be charged for the underlying LLMs used as judges in these evaluators, as well as for any agents and tools associated with the Azure AI Agent Service, in line with Azure OpenAI and Azure AI Agent Service pricing.

For risk and safety evaluators (code vulnerabilities and ungrounded attributes), you will be billed based on the consumption of Safety Evaluations as following:

Service Name Safety Evaluations Price Per 1K Tokens (USD)
Azure Machine Learning Input token pricing $0.02
Azure Machine Learning Output token pricing $0.06

For Azure pricing details, please visit our pricing page and click on the tab labeled “Complete AI Toolchain” and find “Automated Evaluations” and “Azure AI Agent Service”.

Looking Forward

These new capabilities represent our belief that trustworthy AI isn’t a checkbox—it’s a continuous practice that should be deeply embedded in the AI development process.

Azure AI Foundry evaluators are built to work with any dataset, model, or endpoint. They are robust enough for experienced data scientists while remaining accessible to AI developers who are just starting their journey. Our goal is to help development teams easily adopt evaluations as a core practice throughout the AI development lifecycle—from model selection to post-production monitoring—whether running evaluations locally, in the cloud, or within CI/CD pipelines.

Evaluations in Azure AI Foundry are designed to work with any dataset, model, or endpoint. They are powerful enough for seasoned data scientists and accessible to AI developers just beginning their AI journey. Our goal is to make it easy for development teams to adopt evaluations as an iterative best practice , from initial model selection through post-production monitoring, using evaluations locally, in the cloud, and in CI/CD pipelines.

If your organization is brand new to evaluations for generative AI, we encourage you to explore these resources:

Author

Chang Liu
Senior Product Manager

0 comments