{"id":16541,"date":"2026-01-16T00:00:00","date_gmt":"2026-01-16T08:00:00","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/ise\/?p=16541"},"modified":"2026-01-16T04:38:33","modified_gmt":"2026-01-16T12:38:33","slug":"slm-function-calling-with-azure-ai-evaluation-sdk","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/ise\/slm-function-calling-with-azure-ai-evaluation-sdk\/","title":{"rendered":"Evaluate Small Language Model Function Calling using the Azure AI Evaluation SDK"},"content":{"rendered":"<h1>Introduction<\/h1>\n<p>In recent years, Small Language Models (SLMs) have become increasingly important. SLMs are language models with far fewer parameters and use significantly fewer resources than their Large Language Model (LLM) counterparts, while still achieving comparable performance in specific tasks.\nSLMs are ideal for environments where resources are constrained, speed is paramount, or where internet access may be limited. Examples include:<\/p>\n<ul>\n<li>Edge devices &#8211; e.g., IoT sensors, smartphones, and embedded systems where compute power and memory are limited.<\/li>\n<li>On-premise deployments &#8211; where organisations have privacy or latency requirements but limited GPU\/CPU capacity.<\/li>\n<li>Offline or low-connectivity settings &#8211; e.g., field operations, remote research stations, or military applications where internet access is intermittent.<\/li>\n<li>Browser or mobile applications &#8211; where models must run efficiently on-device to reduce API calls and ensure responsiveness.<\/li>\n<li>Cost-sensitive production systems &#8211; where serving large LLMs continuously would be too expensive to be feasible.<\/li>\n<li>Low-latency applications &#8211; e.g., real-time customer support or conversational agents that must respond instantly.<\/li>\n<\/ul>\n<p>Function calling techniques can be applied to SLMs to allow them to interface with external services. For example, a smartphone interacting with its apps (&#8220;<em>Turn on flash and take a photo<\/em>&#8220;), or Heating, Ventilation, and Air Conditioning (HVAC) controls within a car (&#8220;<em>I can&#8217;t see out of the windscreen<\/em>&#8220;).\nWith function calling, the SLM is provided with a JSON specification of functions (or tools) that it can call.\nBased on the user&#8217;s prompt, the model can choose to call one or more of the functions by providing the names and arguments of each function in JSON format, for example:<\/p>\n<p>User Prompt: <em>&#8220;I want to take a photo.&#8221;<\/em><\/p>\n<p>Model Output:<\/p>\n<pre><code class=\"language-json\">[{\"name\": \"open_application\", \"arguments\": {\"name\": \"Camera\"}}]<\/code><\/pre>\n<p>The SLM&#8217;s output is then parsed by the application and each function is called within the application code:<\/p>\n<pre><code class=\"language-py\">open_application(name=\"Camera\")<\/code><\/pre>\n<p>If the model calls the wrong function, uses the wrong arguments, or even attempts to call a function that does not exist, the application may behave unexpectedly.\nEnsuring that these models reliably call the correct functions with proper arguments is critical for operational safety, efficiency and for a good end-user experience.<\/p>\n<p>To achieve more accurate function calls, models can be further fine-tuned to support function calling for specific scenarios.\nThis is done by providing a dataset of user prompts and the expected function calls, and modifying the model&#8217;s parameters to better align with the expected outputs.<\/p>\n<blockquote><p>For more information on fine-tuning SLMs, such as Phi-4, please refer to the <a href=\"https:\/\/github.com\/microsoft\/PhiCookBook\">microsoft\/PhiCookBook<\/a>.<\/p><\/blockquote>\n<p>When fine-tuning, it is important to evaluate the performance of the new model by calculating <a href=\"#metrics\">metrics<\/a> related to the model&#8217;s function calling capabilities.\nThis helps to build confidence in a model&#8217;s performance before it is deployed.<\/p>\n<p>This post details how to use the Azure AI Evaluation SDK to:<\/p>\n<ul>\n<li>Evaluate a model&#8217;s function calling performance by calculating function calling metrics.<\/li>\n<li>View and compare results in Microsoft Foundry.<\/li>\n<\/ul>\n<blockquote><p><em>This example was run using a <a href=\"https:\/\/learn.microsoft.com\/en-us\/azure\/virtual-machines\/sizes\/gpu-accelerated\/nca100v4-series\">Standard_NC24ads_A100_v4<\/a> VM powered by a NVIDIA A100 PCIe GPU, available on Azure.<\/em><\/p><\/blockquote>\n<h3>Running the code<\/h3>\n<p>If running the code as a script, the code should be wrapped inside a <code>main()<\/code> function and run as follows:<\/p>\n<pre><code class=\"language-py\">def main():\r\n    ...\r\n\r\nif __name__ == \"__main__\":\r\n    main()<\/code><\/pre>\n<h1>Model<\/h1>\n<p>Within this post, we use <a href=\"https:\/\/huggingface.co\/microsoft\/Phi-4-mini-instruct\">Phi-4-mini-instruct<\/a>, which is available on Hugging Face. Phi-4-mini-instruct supports function calling by passing the function specifications within the system prompt, as explained in the model card.<\/p>\n<p>To begin with, install the required dependencies using pip:<\/p>\n<pre><code class=\"language-sh\">pip install 'accelerate==1.6.0' 'azure-ai-evaluation==1.13.7' 'datasets==3.5.1' 'huggingface_hub[cli]==0.30.2' 'torch==2.7.0' 'transformers==4.51.3'<\/code><\/pre>\n<p>We use the <code>transformers<\/code> library to load the model and its tokenizer.<\/p>\n<pre><code class=\"language-py\">import torch\r\nfrom transformers import AutoModelForCausalLM, AutoTokenizer, pipeline\r\n\r\ntorch.random.manual_seed(0)  # Optional: Set the random seed for reproducible results\r\n\r\nMODEL_NAME = \"microsoft\/Phi-4-mini-instruct\"\r\n\r\n# Download the model and tokenizer from Hugging Face\r\nmodel = AutoModelForCausalLM.from_pretrained(\r\n    MODEL_NAME,\r\n    device_map=\"auto\",\r\n    torch_dtype=\"auto\",\r\n    trust_remote_code=True,\r\n)\r\ntokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)<\/code><\/pre>\n<h1>Dataset<\/h1>\n<p>To perform an evaluation, we first need a dataset that can be used as an input to the SLM, from which we can compare the model outputs to the expected function calls.\nA valid dataset may be sourced from multiple locations, including:<\/p>\n<ul>\n<li>Private company data<\/li>\n<li>Public data sources (e.g. Hugging Face)<\/li>\n<li>Synthetic data (e.g. using an LLM)<\/li>\n<\/ul>\n<p>This post uses the <a href=\"https:\/\/huggingface.co\/datasets\/Salesforce\/xlam-function-calling-60k\">Salesforce\/xlam-function-calling-60k<\/a> dataset, which is available for use under the <a href=\"https:\/\/creativecommons.org\/licenses\/by\/4.0\/\">CC-BY-4.0 licence<\/a>.\nThe data was collected by <a href=\"https:\/\/apigen-pipeline.github.io\/\">APIGen<\/a>, as presented in the paper <a href=\"https:\/\/arxiv.org\/abs\/2406.18518\">APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets<\/a>.<\/p>\n<p>In a real-world scenario, the dataset would likely be split into separate sets for training, validation and testing.\nHowever, for this example, we only require a single test set to evaluate the model&#8217;s performance. We take the first 100 rows from the dataset to reduce the time taken to run the full evaluation.<\/p>\n<blockquote><p>As the dataset is gated, you will need to accept the dataset conditions and log in to the Hugging Face CLI before loading it.<\/p><\/blockquote>\n<pre><code class=\"language-py\">from datasets import Dataset, load_dataset\r\n\r\nDATASET = \"Salesforce\/xlam-function-calling-60k\"\r\n\r\n# Download the dataset from Hugging Face\r\n# The dataset contains one split, named `train`, of which we take the first 100 rows\r\ndataset = load_dataset(DATASET, split=\"train[:100]\")<\/code><\/pre>\n<p>The dataset includes the following headers:<\/p>\n<ul>\n<li><code>id<\/code>: The row ID.<\/li>\n<li><code>query<\/code>: The user query.<\/li>\n<li><code>answers<\/code>: The expected model output containing the function calls and arguments (ground truth) in JSON format.<\/li>\n<li><code>tools<\/code>: The JSON function schemas of the tools available to the model. This includes each function&#8217;s name, description and parameters, including parameter names, types, descriptions and default values.<\/li>\n<\/ul>\n<p>Each row in the dataset needs to be transformed into messages which will be passed to the SLM and formatted according to the model&#8217;s chat template.\nWe use a generator so that the entire dataset does not need to be loaded into memory at once, as the dataset may be large.<\/p>\n<pre><code class=\"language-py\">def get_inputs(dataset: Dataset):\r\n    for data in dataset:\r\n        yield [\r\n            {\r\n                \"role\": \"system\",\r\n                \"content\": \"You are a helpful assistant with some tools.\",\r\n                \"tools\": data[\"tools\"],\r\n            },\r\n            {\r\n                \"role\": \"user\",\r\n                \"content\": data[\"query\"],\r\n            },\r\n        ]<\/code><\/pre>\n<h1>Inference<\/h1>\n<p>With the model and dataset prepared, we can run inference using the SLM. We iterate through the model outputs, combine the generated text with the data from the dataset, and save this to a temporary file.\nWe use a temporary file, as the results file is only an intermediate file that is then passed to the evaluation.<\/p>\n<pre><code class=\"language-py\">import json\r\nimport tempfile\r\n\r\n# Create a pipeline to use for inference, based on the model and tokenizer\r\npipe = pipeline(\r\n    task=\"text-generation\",  # Phi-4 is a model of task type `text-generation`\r\n    model=model,\r\n    tokenizer=tokenizer,\r\n)\r\n\r\n# Full set of generation arguments can be found at https:\/\/huggingface.co\/docs\/transformers\/en\/main_classes\/text_generation\r\ngeneration_args = {\r\n    \"max_new_tokens\": 500,\r\n    \"return_full_text\": False,\r\n    \"do_sample\": False,\r\n}\r\n\r\noutputs = pipe(get_inputs(dataset), **generation_args)\r\n\r\nwith tempfile.NamedTemporaryFile(mode=\"w\", suffix=\".jsonl\", delete=False) as tmp:\r\n    for i, output in enumerate(outputs):\r\n        print(f\"{i}: {output}\")\r\n        data = dataset[i] | output[0]  # Combine the dataset row with the model output\r\n        tmp.write(json.dumps(data) + \"\\n\")\r\n\r\nprint(f\"Inference results saved to {tmp.name}\")<\/code><\/pre>\n<h1>Metrics<\/h1>\n<p>Before running the evaluation on our inference results, we need to create an evaluator to calculate the metrics related to function calling.<\/p>\n<p>Within the Azure AI Evaluation SDK, an evaluator can be either a function or a callable class which computes and returns metrics. The SDK includes many evaluators out of the box, however for this example we create a custom evaluator for the following metrics:<\/p>\n<ul>\n<li><code>valid_json<\/code>\n<ul>\n<li>Checks whether the model generated a tool call which can be parsed as JSON.<\/li>\n<li>If the response cannot be parsed, the system will fail before execution, regardless of whether the model understood the query or not.<\/li>\n<\/ul>\n<\/li>\n<li><code>exact_function_call<\/code>\n<ul>\n<li>Checks whether the generated function call matches the expected function call in the dataset.<\/li>\n<li>It reflects how well the model interprets the natural-language query and chooses the correct function(s) to call with the correct arguments.<\/li>\n<\/ul>\n<\/li>\n<li><code>valid_function_names<\/code>\n<ul>\n<li>Checks whether each function name in the generated tool call is a valid function name within the provided tools.<\/li>\n<li>It shows how closely the model adheres to the provided function specifications, and shows if the model is attempting to call functions that do not exist.<\/li>\n<li>This metric is more specific than <code>exact_function_call<\/code>, and is one of many possible metrics that could be used to identify the exact reason why <code>exact_function_call<\/code> may be failing.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<p>Each metric will have a value of either 0 or 1.<\/p>\n<p>It is possible to create and use multiple evaluators, with each evaluator calculating one or multiple metrics. In this example, we have one evaluator which calculates all three metrics.<\/p>\n<blockquote><p>The <code>FunctionCallEvaluator<\/code> class should be defined in a separate file, for example <code>function_call_evaluator.py<\/code>.<\/p><\/blockquote>\n<pre><code class=\"language-py\"># function_call_evaluator.py\r\nimport json\r\nimport re\r\nfrom dataclasses import dataclass\r\nfrom typing import Literal\r\n\r\n@dataclass\r\nclass Result:\r\n    valid_json: int = 0\r\n    valid_function_names: int = 0\r\n    exact_function_call: int = 0\r\n\r\nclass FunctionCallEvaluator:\r\n    def __init__(self):\r\n        pass\r\n\r\n    def parse_function_call(self, text: str) -&gt; str | None:\r\n        # This format is specific to Phi-4 function calling\r\n        match = re.fullmatch(r\"&lt;\\|tool_call\\|&gt;(.*)&lt;\\|\/tool_call\\|&gt;\", text)\r\n\r\n        return match and match.group(1)\r\n\r\n    def __call__(self, answers: str, tools: str, generated_text: str) -&gt; Result:\r\n        function_calls = self.parse_function_call(generated_text)\r\n\r\n        if function_calls is None:\r\n            return Result()\r\n\r\n        return Result(\r\n            valid_json=self.valid_json(function_calls),\r\n            valid_function_names=self.valid_function_names(function_calls, tools),\r\n            exact_function_call=self.exact_function_call(function_calls, answers),\r\n        )\r\n\r\n    def valid_json(self, function_calls: str) -&gt; Literal[0, 1]:\r\n        try:\r\n            # Check if the JSON can be parsed\r\n            json.loads(function_calls)\r\n            return 1\r\n        except json.JSONDecodeError:\r\n            return 0\r\n\r\n    def exact_function_call(self, function_calls: str, answers: str) -&gt; Literal[0, 1]:\r\n        answers_list = json.loads(answers)\r\n\r\n        try:\r\n            # Check if the generated function calls are the same as the `answers` from the dataset\r\n            if json.loads(function_calls) == answers_list:\r\n                return 1\r\n        except json.JSONDecodeError:\r\n            pass\r\n\r\n        return 0\r\n\r\n    def valid_function_names(self, function_calls: str, tools: str) -&gt; Literal[0, 1]:\r\n        tools_list: list[dict] = json.loads(tools)\r\n        tool_names: list[str] = [tool[\"name\"] for tool in tools_list]\r\n\r\n        try:\r\n            function_calls_list = json.loads(function_calls)\r\n\r\n            # Check if every tool call function name is present in the provided tools\r\n            if all(call[\"name\"] in tool_names for call in function_calls_list):\r\n                return 1\r\n        except (json.JSONDecodeError, KeyError, TypeError):\r\n            pass\r\n\r\n        return 0<\/code><\/pre>\n<p>There are many other metrics that could be calculated, such as checking for required parameters, adherence to the function schema, etc., as well as non-functional metrics such as overall latency and time to first token.<\/p>\n<h1>Evaluation<\/h1>\n<p>With our metrics defined and inference completed, we can now use the <code>evaluate()<\/code> function from the Azure AI Evaluation SDK to run our evaluation and calculate the function calling metrics.<\/p>\n<p>The <code>azure_ai_project<\/code> variable is used to specify the details of the Microsoft Foundry project where the results will be uploaded.\nChoose either the Foundry format or the hub-based format, depending on your <a href=\"https:\/\/learn.microsoft.com\/en-us\/azure\/ai-foundry\/what-is-azure-ai-foundry?view=foundry-classic#types-of-projects\">type of project<\/a>.<\/p>\n<pre><code class=\"language-py\">from datetime import datetime, UTC\r\n\r\nfrom azure.ai.evaluation import evaluate\r\n\r\nfrom function_call_evaluator import FunctionCallEvaluator  # Import the evaluator from the function_call_evaluator.py file\r\n\r\n# For Foundry projects\r\nazure_ai_project = \"https:\/\/&lt;AZURE_FOUNDRY_NAME&gt;.services.ai.azure.com\/api\/projects\/&lt;AZURE_FOUNDRY_PROJECT_NAME&gt;\"\r\n\r\n# For hub-based projects\r\nazure_ai_project = {\r\n    \"subscription_id\": \"&lt;AZURE_SUBSCRIPTION_ID&gt;\",\r\n    \"resource_group_name\": \"&lt;AZURE_RESOURCE_GROUP_NAME&gt;\",\r\n    \"project_name\": \"&lt;AZURE_FOUNDRY_PROJECT_NAME&gt;\",\r\n}\r\n\r\nresults = evaluate(\r\n    data=tmp.name,\r\n    evaluation_name=f\"{MODEL_NAME}-eval-{datetime.now(UTC).strftime('%Y%m%d-%H%M%S')}\",\r\n    evaluators={\r\n        \"function_calling\": FunctionCallEvaluator(),\r\n    },\r\n    azure_ai_project=azure_ai_project,\r\n    output_path=\".\/results.json\",\r\n)<\/code><\/pre>\n<p>The <code>evaluate()<\/code> function orchestrates the evaluation by taking the dataset in JSON lines format. For each row in the dataset, it calls each evaluator function to calculate metrics for that particular row. The evaluation results are saved to a file and returned as a dict to the <code>results<\/code> variable, allowing further results processing as required.\nBy providing Microsoft Foundry project details, we can also view the evaluation results in Microsoft Foundry.<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2026\/01\/evaluation-overall-results-1.webp\" alt=\"Evaluation results in Microsoft Foundry\" \/><\/p>\n<p>We can see the overall average value for each metric, as well as the breakdown of each row in the dataset.<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2026\/01\/evaluation-results-breakdown-1.webp\" alt=\"Breakdown of evaluation results in Microsoft Foundry\" \/><\/p>\n<p>Microsoft Foundry also allows the comparison of multiple runs, which can be useful when experimenting with different generation parameters or different models.<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2026\/01\/evaluation-comparison-1.webp\" alt=\"Comparison of evaluation results in Microsoft Foundry\" \/><\/p>\n<h1>Conclusion<\/h1>\n<p>A systematic and robust evaluation framework is critical for ensuring SLMs reliably perform function calling in production, helping teams build confidence in fine-tuned models and catch issues before deployment.\nAn evaluation framework also serves as a regression testing suite, ensuring that any further modifications to the model, prompts, function schemas etc. do not degrade model performance.\nThe Azure AI Evaluation SDK is a powerful tool that can be used to augment such a framework by helping to easily assess the performance of generative AI applications, including using SLMs for function calling. Its integration with Microsoft Foundry allows for easy visualisation and comparison of results, aiding experimentation and evaluation, and also facilitates collaboration between team members.\nThis post includes a simple example of how to use the SDK to evaluate Phi-4-mini-instruct. The same approach can be built upon to evaluate other base and fine-tuned models, as well as other metrics, datasets and even model runtimes, such as <a href=\"https:\/\/github.com\/microsoft\/onnxruntime-genai\">ONNX Runtime GenAI<\/a>.<\/p>\n<h1>References<\/h1>\n<ul>\n<li><a href=\"https:\/\/techcommunity.microsoft.com\/blog\/educatordeveloperblog\/welcome-to-the-new-phi-4-models---microsoft-phi-4-mini--phi-4-multimodal\/4386037\">Phi-4<\/a><\/li>\n<li><a href=\"https:\/\/arxiv.org\/abs\/2406.18518\">APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets<\/a><\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>This blog details how the Azure AI Evaluation SDK can be used to assess the performance of a small language model for function calling, such as Phi-4-mini-instruct, and view the results in Microsoft Foundry.<\/p>\n","protected":false},"author":147904,"featured_media":16553,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1,19],"tags":[3400],"class_list":["post-16541","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-cse","category-machine-learning","tag-ise"],"acf":[],"blog_post_summary":"<p>This blog details how the Azure AI Evaluation SDK can be used to assess the performance of a small language model for function calling, such as Phi-4-mini-instruct, and view the results in Microsoft Foundry.<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/posts\/16541","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/users\/147904"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/comments?post=16541"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/posts\/16541\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/media\/16553"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/media?parent=16541"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/categories?post=16541"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/tags?post=16541"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}