{"id":16568,"date":"2026-02-20T00:00:00","date_gmt":"2026-02-20T08:00:00","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/ise\/?p=16568"},"modified":"2026-02-20T08:57:26","modified_gmt":"2026-02-20T16:57:26","slug":"using-codes-to-increase-adherence-to-prompts","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/ise\/using-codes-to-increase-adherence-to-prompts\/","title":{"rendered":"Using Codes to Increase Adherence to Prompts"},"content":{"rendered":"<h2>Introduction: The Problem<\/h2>\n<p>Agentic systems have some discretion in the parameters they send to tooling, but there are cases, such as experimentation, when you need 100% adherence to a set of parameters.<\/p>\n<p>In practice, this tension exists because modern LLM-based agents are optimized for semantic correctness and helpfulness, not for strict schema compliance. Even when instructions are explicit, models may \u201chelpfully\u201d adjust parameters based on inferred intent, prior training patterns, or perceived optimization opportunities.<\/p>\n<p>For example, imagine your agent can call a search tool providing a query, a top_k parameter, and a threshold. Then imagine you are running an experiment to see how varying top_k impacts your retrieval performance. You might write in your prompt that your agent should always set top_k=10, but how often does it follow that instruction?<\/p>\n<p>In our testing across multiple OpenAI models\u2014from GPT-4o-mini through GPT-5-mini\u2014we observed the same class of problem. For example, with GPT-4o-mini and a prompt like this&#8230;<\/p>\n<pre><code class=\"language-text\">You MUST use the following format for retrieval tool calls:\r\n\r\n{\r\n    \"query\": \"&lt;query&gt;\",\r\n    \"top\": 9\r\n}<\/code><\/pre>\n<p>&#8230;we saw 68.4% adherence to the top_k=9 instruction over 1135 calls. Unfortunately, this is a significant deviation from the desired behavior and makes it extremely difficult to run controlled experiments.<\/p>\n<p>This type of issue might be more difficult to detect if you don&#8217;t have specific metrics validating the parameters that were sent.<\/p>\n<p>It is worth noting that this problem is primarily relevant during experimentation, where you need to sweep across parameter permutations and compare results reliably. In a production system, you would typically lock the configuration server-side and avoid exposing tunable parameters to the agent at all.<\/p>\n<h2>The Journey: Our Approach and Solution<\/h2>\n<p>We had a hypothesis that if we used a code word that didn&#8217;t have any semantic meaning in this context, that the model would be less likely to deviate from the instruction.<\/p>\n<p>This hypothesis was grounded in how language models reason. Parameters like top, k, or threshold have strong semantic associations in training data. Models \u201cunderstand\u201d what they do and may attempt to optimize them. A meaningless token, by contrast, offers no such affordances.<\/p>\n<p>So instead of using &#8220;top&#8221; as the parameter name, we used &#8220;dinosaur&#8221;.<\/p>\n<pre><code class=\"language-text\">You MUST use the following format for retrieval tool calls:\r\n\r\n{\r\n    \"query\": \"&lt;query&gt;\",\r\n    \"code\": \"dinosaur\"\r\n}<\/code><\/pre>\n<p>Crucially, the model was never told what \u201cdinosaur\u201d meant. It was simply instructed that this value must be provided exactly. From the model\u2019s perspective, there was no incentive\u2014or basis\u2014to modify it.<\/p>\n<p>Each code word maps to an entire permutation of parameters, not just a single value. For example, &#8220;dinosaur&#8221; might resolve to top_k=9, threshold=0.3, and search_mode=&#8221;vector&#8221;, while &#8220;caterpillar&#8221; might resolve to top_k=18, threshold=0.2, and search_mode=&#8221;hybrid&#8221;. This allows you to define a full experiment matrix of configurations, each identified by a single opaque code word.<\/p>\n<p>Inside the retrieval service (a proxy in front of <a href=\"https:\/\/azure.microsoft.com\/en-us\/products\/ai-services\/ai-search\">Azure AI Search<\/a>), we look up all the parameters associated with the code word and map them to the actual parameters needed for the retrieval call.<\/p>\n<p>This effectively decouples agent behavior from experimental configuration. The agent emits an opaque identifier, and all experimental control is handled server-side, where determinism can be enforced.<\/p>\n<p>Using this method, we saw 100% adherence to the code (top_k=9) instruction over 1135 calls.<\/p>\n<h2>Stability<\/h2>\n<p>We continued to use this method across 5 more sprints and dozens of experiments. In every case, we saw 100% adherence to the code instruction.<\/p>\n<p>This result held across different prompts, query distributions, and experiment types. Importantly, we did not observe any degradation over time, suggesting that the approach is robust rather than prompt-fragile.<\/p>\n<h2>Open Questions<\/h2>\n<p>We never tested code words like &#8220;top-k=9&#8221; that have some semantic meaning in this context. It would be interesting to see if the model would still adhere 100% of the time or if it would start to deviate again.<\/p>\n<p>Our expectation is that partial semantic grounding would reintroduce optimization behavior. Even weakly meaningful tokens may invite reinterpretation, especially under longer conversations or higher-temperature settings. This remains an open area for systematic testing.<\/p>\n<h2>Alternatives<\/h2>\n<p>Ideally, we would just not expose the parameters we want to exert strict control over.<\/p>\n<p>From a systems-design perspective, this is the cleanest solution. If the agent cannot express a parameter, it cannot modify it. However, current agent frameworks often require parameters to be surfaced explicitly in tool schemas.<\/p>\n<p>When running an experiment in <a href=\"https:\/\/ai.azure.com\/\">Azure AI Foundry<\/a>, we were creating an agent for each permutation. If the tool call (OpenAPI) sent a header or something containing the agent ID, then the retrieval service could lookup the parameters associated with that agent ID. This would eliminate the need for code words entirely. However, this is not currently supported in Azure AI Foundry.<\/p>\n<p>The product group suggested 2 other alternatives:<\/p>\n<ul>\n<li>Use MCP server and follow <a href=\"https:\/\/github.com\/Azure\/azure-sdk-for-python\/blob\/main\/sdk\/ai\/azure-ai-projects\/samples\/agents\/tools\/sample_agent_mcp.py\">this pattern<\/a> for adding extra information to the body.<\/li>\n<li>Call a Python function instead of OpenAPI. This would allow you to pass parameters directly to the function without needing to include them in the prompt.\n<p>Ultimately, we did switch to this approach because it was required for GPT\u20115 models anyway, which do not support OpenAPI tool calls.<\/li>\n<\/ul>\n<h3>Python Function Tool Calls<\/h3>\n<p>If you want to implement Python Function Tool Calls in <a href=\"https:\/\/ai.azure.com\/\">Azure AI Foundry<\/a>, this pattern gives you strict control over agent behavior while still allowing the model to plan tool usage correctly.\nThe key idea is separating the agent\u2019s reasoning surface from the actual execution logic. The agent only needs to know that a function exists and what its signature looks like\u2014it should not control how parameters are interpreted, authenticated, or executed.\nTo achieve this, we define a dummy function during agent creation. This function is never executed; it exists purely to provide a contract that the agent can reason about.<\/p>\n<pre><code class=\"language-python\"># Create retrieval tool\r\ndef call_retrieval_tool(query: str, code: str) -&gt; str:\r\n    \"\"\"\r\n    Performs a document retrieval operation using stored search configurations.\r\n\r\n    :param query: The search query text should be one or more complete sentences in the form of a question. EXAMPLE: When clicking on a feature, is it possible to spell out possible commands instead just having icons?\r\n    :param code: Code identifier for the stored search configuration to apply to the query. EXAMPLE: vanilla\r\n    :return: Search results containing matching documents and metadata.\r\n    \"\"\"\r\n    return \"\"\r\n\r\n# Define tool resources\r\nuser_functions = { call_retrieval_tool }\r\nfunctions = FunctionTool(functions=user_functions)\r\ntools = functions.definitions\r\n\r\n# Build agent creation kwargs\r\nagent_kwargs = {\r\n    \"model\": AGENT_MODEL,\r\n    \"name\": AGENT_NAME,\r\n    \"instructions\": PROMPT,\r\n    \"tools\": tools,\r\n    \"tool_resources\": None,\r\n    \"description\": DESCRIPTION,\r\n}\r\n\r\n# Create agent\r\nagent = project_client.agents.create_agent(**agent_kwargs)<\/code><\/pre>\n<p>There are several important details worth calling out here:<\/p>\n<p>The function body is intentionally empty. The agent will never execute this function.\nAzure AI Foundry requires the function signature and docstring so the model can:<\/p>\n<ul>\n<li>Decide when to call the tool<\/li>\n<li>Produce a syntactically valid function call<\/li>\n<\/ul>\n<p>The code parameter is treated as an opaque identifier by the agent. It has no semantic meaning from the model\u2019s perspective, which is exactly what we want.<\/p>\n<p>At runtime, this dummy function is replaced by the real implementation, which lives in the inference service. This is where authentication, parameter mapping, and network access are handled\u2014outside the agent\u2019s control.<\/p>\n<pre><code class=\"language-python\"># define the retrieval tool function\r\n# NOTE: its not great that requests is synchronous but this call\r\n# cannot be async because the Azure SDK FunctionTool interface is sync\r\ndef call_retrieval_tool(query: str, code: str) -&gt; str:\r\n    \"\"\"\r\n    Performs a document retrieval operation using stored search configurations.\r\n\r\n    :param query: The search query text should be one or more complete sentences in the form of a question. EXAMPLE: When clicking on a feature, is it possible to spell out possible commands instead just having icons?\r\n    :param code: Code identifier for the stored search configuration to apply to the query. EXAMPLE: vanilla\r\n    :return: Search results containing matching documents and metadata.\r\n    \"\"\"\r\n    try:\r\n        # Get access token with the specified scope\r\n        token = self._credential.get_token(PERMISSIONS_SCOPE)\r\n        retrieval_access_token = token.token\r\n\r\n        # Make the POST request\r\n        headers = {\r\n            \"Authorization\": f\"Bearer {retrieval_access_token}\",\r\n            \"Content-Type\": \"application\/json\"\r\n        }\r\n        payload = {\r\n            \"query\": query,\r\n            \"code\": code\r\n        }\r\n        response = self._session.post(RETRIEVAL_SERVICE_URL, headers=headers, json=payload, timeout=30)\r\n        response.raise_for_status()\r\n\r\n        return response.text\r\n    except Exception as e:\r\n        logger.error(f\"Failed to call retrieval tool: {e}\")\r\n        return json.dumps({\"error\": str(e)})\r\n\r\n# configure the function tool for retrieval calls\r\nuser_functions = { call_retrieval_tool }\r\nfunctions_tool = FunctionTool(functions=user_functions)\r\ntoolset = ToolSet()\r\ntoolset.add(functions_tool)\r\nself._client.agents.enable_auto_function_calls(toolset)\r\n\r\n# run the inference\r\nwith self._client.agents.runs.stream(\r\n    thread_id=thread.id, agent_id=agent.id\r\n) as stream:\r\n    # process the stream...\r\n    pass<\/code><\/pre>\n<p>In this setup:<\/p>\n<p>The agent emits a function call with { query, code }. The inference service:<\/p>\n<ul>\n<li>Authenticates the request<\/li>\n<li>Resolves the code into concrete experiment parameters (e.g. top_k = 9)<\/li>\n<li>Executes retrieval using those parameters<\/li>\n<\/ul>\n<p>The agent never sees or reasons about the real parameters. This creates a hard boundary between agent intent and system configuration. The model cannot \u201coptimize\u201d or reinterpret values like top_k, because they are no longer expressed as meaningful language\u2014only as data.<\/p>\n<h2>Metrics<\/h2>\n<p>When creating an experiment, it is important to think about how you will validate the experiment was run successfully. For example, if your experiment is varying top-K, you should have a metric that validates the model sent the correct top-K parameter to the search tool.<\/p>\n<p>Strict adherence to this concept is what allowed us to immediately detect that the agent was not sending the expected parameters reliably.<\/p>\n<h2>Conclusion<\/h2>\n<p>Using code words to represent parameters that need strict adherence is an effective strategy to ensure agentic systems follow instructions precisely.<\/p>\n<p>More broadly, this approach highlights a key design principle for agentic systems: if you need determinism, remove semantics. Treat configuration as data, not language.<\/p>\n<p>This approach has proven to be reliable across multiple experiments and sprints, providing a robust solution for controlled experimentation with agentic systems.<\/p>\n<h2>Attribution<\/h2>\n<p>The image used in this post was created using ChatGPT 5.2 (an OpenAI model).<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Agentic systems have some discretion in the parameters they sent to tooling, but there are cases, such as experimentation, when you need 100% adherence to a set of parameters.<\/p>\n","protected":false},"author":118226,"featured_media":16579,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1,3451],"tags":[3639,3638,110],"class_list":["post-16568","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-cse","category-ise","tag-agentic-systems","tag-agents","tag-bots"],"acf":[],"blog_post_summary":"<p>Agentic systems have some discretion in the parameters they sent to tooling, but there are cases, such as experimentation, when you need 100% adherence to a set of parameters.<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/posts\/16568","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/users\/118226"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/comments?post=16568"}],"version-history":[{"count":1,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/posts\/16568\/revisions"}],"predecessor-version":[{"id":16580,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/posts\/16568\/revisions\/16580"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/media\/16579"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/media?parent=16568"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/categories?post=16568"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/tags?post=16568"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}