Using Codes to Increase Adherence to Prompts

Introduction: The Problem

Agentic systems have some discretion in the parameters they send to tooling, but there are cases, such as experimentation, when you need 100% adherence to a set of parameters.

In practice, this tension exists because modern LLM-based agents are optimized for semantic correctness and helpfulness, not for strict schema compliance. Even when instructions are explicit, models may “helpfully” adjust parameters based on inferred intent, prior training patterns, or perceived optimization opportunities.

For example, imagine your agent can call a search tool providing a query, a top_k parameter, and a threshold. Then imagine you are running an experiment to see how varying top_k impacts your retrieval performance. You might write in your prompt that your agent should always set top_k=10, but how often does it follow that instruction?

In our testing across multiple OpenAI models—from GPT-4o-mini through GPT-5-mini—we observed the same class of problem. For example, with GPT-4o-mini and a prompt like this…

You MUST use the following format for retrieval tool calls:

{
    "query": "<query>",
    "top": 9
}

…we saw 68.4% adherence to the top_k=9 instruction over 1135 calls. Unfortunately, this is a significant deviation from the desired behavior and makes it extremely difficult to run controlled experiments.

This type of issue might be more difficult to detect if you don’t have specific metrics validating the parameters that were sent.

It is worth noting that this problem is primarily relevant during experimentation, where you need to sweep across parameter permutations and compare results reliably. In a production system, you would typically lock the configuration server-side and avoid exposing tunable parameters to the agent at all.

The Journey: Our Approach and Solution

We had a hypothesis that if we used a code word that didn’t have any semantic meaning in this context, that the model would be less likely to deviate from the instruction.

This hypothesis was grounded in how language models reason. Parameters like top, k, or threshold have strong semantic associations in training data. Models “understand” what they do and may attempt to optimize them. A meaningless token, by contrast, offers no such affordances.

So instead of using “top” as the parameter name, we used “dinosaur”.

You MUST use the following format for retrieval tool calls:

{
    "query": "<query>",
    "code": "dinosaur"
}

Crucially, the model was never told what “dinosaur” meant. It was simply instructed that this value must be provided exactly. From the model’s perspective, there was no incentive—or basis—to modify it.

Each code word maps to an entire permutation of parameters, not just a single value. For example, “dinosaur” might resolve to top_k=9, threshold=0.3, and search_mode=”vector”, while “caterpillar” might resolve to top_k=18, threshold=0.2, and search_mode=”hybrid”. This allows you to define a full experiment matrix of configurations, each identified by a single opaque code word.

Inside the retrieval service (a proxy in front of Azure AI Search), we look up all the parameters associated with the code word and map them to the actual parameters needed for the retrieval call.

This effectively decouples agent behavior from experimental configuration. The agent emits an opaque identifier, and all experimental control is handled server-side, where determinism can be enforced.

Using this method, we saw 100% adherence to the code (top_k=9) instruction over 1135 calls.

Stability

We continued to use this method across 5 more sprints and dozens of experiments. In every case, we saw 100% adherence to the code instruction.

This result held across different prompts, query distributions, and experiment types. Importantly, we did not observe any degradation over time, suggesting that the approach is robust rather than prompt-fragile.

Open Questions

We never tested code words like “top-k=9” that have some semantic meaning in this context. It would be interesting to see if the model would still adhere 100% of the time or if it would start to deviate again.

Our expectation is that partial semantic grounding would reintroduce optimization behavior. Even weakly meaningful tokens may invite reinterpretation, especially under longer conversations or higher-temperature settings. This remains an open area for systematic testing.

Alternatives

Ideally, we would just not expose the parameters we want to exert strict control over.

From a systems-design perspective, this is the cleanest solution. If the agent cannot express a parameter, it cannot modify it. However, current agent frameworks often require parameters to be surfaced explicitly in tool schemas.

When running an experiment in Azure AI Foundry, we were creating an agent for each permutation. If the tool call (OpenAPI) sent a header or something containing the agent ID, then the retrieval service could lookup the parameters associated with that agent ID. This would eliminate the need for code words entirely. However, this is not currently supported in Azure AI Foundry.

The product group suggested 2 other alternatives:

Use MCP server and follow this pattern for adding extra information to the body.
Call a Python function instead of OpenAPI. This would allow you to pass parameters directly to the function without needing to include them in the prompt.
Ultimately, we did switch to this approach because it was required for GPT‑5 models anyway, which do not support OpenAPI tool calls.

Python Function Tool Calls

If you want to implement Python Function Tool Calls in Azure AI Foundry, this pattern gives you strict control over agent behavior while still allowing the model to plan tool usage correctly. The key idea is separating the agent’s reasoning surface from the actual execution logic. The agent only needs to know that a function exists and what its signature looks like—it should not control how parameters are interpreted, authenticated, or executed. To achieve this, we define a dummy function during agent creation. This function is never executed; it exists purely to provide a contract that the agent can reason about.

# Create retrieval tool
def call_retrieval_tool(query: str, code: str) -> str:
    """
    Performs a document retrieval operation using stored search configurations.

    :param query: The search query text should be one or more complete sentences in the form of a question. EXAMPLE: When clicking on a feature, is it possible to spell out possible commands instead just having icons?
    :param code: Code identifier for the stored search configuration to apply to the query. EXAMPLE: vanilla
    :return: Search results containing matching documents and metadata.
    """
    return ""

# Define tool resources
user_functions = { call_retrieval_tool }
functions = FunctionTool(functions=user_functions)
tools = functions.definitions

# Build agent creation kwargs
agent_kwargs = {
    "model": AGENT_MODEL,
    "name": AGENT_NAME,
    "instructions": PROMPT,
    "tools": tools,
    "tool_resources": None,
    "description": DESCRIPTION,
}

# Create agent
agent = project_client.agents.create_agent(**agent_kwargs)

There are several important details worth calling out here:

The function body is intentionally empty. The agent will never execute this function. Azure AI Foundry requires the function signature and docstring so the model can:

Decide when to call the tool
Produce a syntactically valid function call

The code parameter is treated as an opaque identifier by the agent. It has no semantic meaning from the model’s perspective, which is exactly what we want.

At runtime, this dummy function is replaced by the real implementation, which lives in the inference service. This is where authentication, parameter mapping, and network access are handled—outside the agent’s control.

# define the retrieval tool function
# NOTE: its not great that requests is synchronous but this call
# cannot be async because the Azure SDK FunctionTool interface is sync
def call_retrieval_tool(query: str, code: str) -> str:
    """
    Performs a document retrieval operation using stored search configurations.

    :param query: The search query text should be one or more complete sentences in the form of a question. EXAMPLE: When clicking on a feature, is it possible to spell out possible commands instead just having icons?
    :param code: Code identifier for the stored search configuration to apply to the query. EXAMPLE: vanilla
    :return: Search results containing matching documents and metadata.
    """
    try:
        # Get access token with the specified scope
        token = self._credential.get_token(PERMISSIONS_SCOPE)
        retrieval_access_token = token.token

        # Make the POST request
        headers = {
            "Authorization": f"Bearer {retrieval_access_token}",
            "Content-Type": "application/json"
        }
        payload = {
            "query": query,
            "code": code
        }
        response = self._session.post(RETRIEVAL_SERVICE_URL, headers=headers, json=payload, timeout=30)
        response.raise_for_status()

        return response.text
    except Exception as e:
        logger.error(f"Failed to call retrieval tool: {e}")
        return json.dumps({"error": str(e)})

# configure the function tool for retrieval calls
user_functions = { call_retrieval_tool }
functions_tool = FunctionTool(functions=user_functions)
toolset = ToolSet()
toolset.add(functions_tool)
self._client.agents.enable_auto_function_calls(toolset)

# run the inference
with self._client.agents.runs.stream(
    thread_id=thread.id, agent_id=agent.id
) as stream:
    # process the stream...
    pass

In this setup:

The agent emits a function call with { query, code }. The inference service:

Authenticates the request
Resolves the code into concrete experiment parameters (e.g. top_k = 9)
Executes retrieval using those parameters

The agent never sees or reasons about the real parameters. This creates a hard boundary between agent intent and system configuration. The model cannot “optimize” or reinterpret values like top_k, because they are no longer expressed as meaningful language—only as data.

Metrics

When creating an experiment, it is important to think about how you will validate the experiment was run successfully. For example, if your experiment is varying top-K, you should have a metric that validates the model sent the correct top-K parameter to the search tool.

Strict adherence to this concept is what allowed us to immediately detect that the agent was not sending the expected parameters reliably.

Conclusion

Using code words to represent parameters that need strict adherence is an effective strategy to ensure agentic systems follow instructions precisely.

More broadly, this approach highlights a key design principle for agentic systems: if you need determinism, remove semantics. Treat configuration as data, not language.

This approach has proven to be reliable across multiple experiments and sprints, providing a robust solution for controlled experimentation with agentic systems.

Attribution

The image used in this post was created using ChatGPT 5.2 (an OpenAI model).

Using Codes to Increase Adherence to Prompts

Introduction: The Problem

The Journey: Our Approach and Solution

Stability

Open Questions

Alternatives

Python Function Tool Calls

Metrics

Conclusion

Attribution

Author

Read next

Building a Secure MCP Server with OAuth 2.1 and Azure AD: Lessons from the Field

Using Agents to Setup Experiments

Introduction: The Problem

The Journey: Our Approach and Solution

Stability

Open Questions

Alternatives

Python Function Tool Calls

Metrics

Conclusion

Attribution

Author

Read next

Building a Secure MCP Server with OAuth 2.1 and Azure AD: Lessons from the Field

Using Agents to Setup Experiments

Stay informed