What's New in Microsoft Foundry Fine-Tuning | April 2026

This month we’re shipping three updates that make Reinforcement Fine-Tuning (RFT) more accessible, more powerful, and easier to get right:

Global Training for o4-mini — train from 13+ Azure regions at lower per-token rates.
New model graders — GPT-4.1, GPT-4.1-mini, and GPT-4.1-nano are now available as model graders, giving you more flexibility and cost control when scoring model outputs.
RFT best practices — a distilled guide to help you design graders, prepare data, and avoid common pitfalls.

Read on for the details.

Global Training for o4-mini

Global Training expands the reach of model customization with the affordable pricing of our other Global offerings. With this update, o4-mini joins the list of models you can train globally:

Train from anywhere — launch fine-tuning jobs for o4-mini from 13 Azure regions today, expanding to all finetuning regions by end of April.
Save on training costs — benefit from lower per-token training rates compared to Standard training.
Same quality, broader reach — identical training infrastructure and model quality regardless of the region you start from.

Currently available regions: East US 2, North Central US, West US 3, Australia East, France Central, Germany West Central, Switzerland North, Norway East, Poland Central, Spain Central, Italy North, Switzerland West, and Sweden Central.

o4-mini is one of the most popular models for reasoning-intensive and agentic workloads. Adding Global Training support makes it significantly more cost-effective to customize at scale—especially for teams spread across multiple geographies.

o4-mini Global training

Create an o4-mini Global Training Job via REST API

curl -X POST "https://<your-resource>.openai.azure.com/openai/fine_tuning/jobs?api-version=2025-04-01-preview" \
  -H "Content-Type: application/json" \
  -H "api-key: $AZURE_OPENAI_API_KEY" \
  -d '{
    "model": "o4-mini",
    "training_file": "<your-training-file-id>",
    "method": {
      "type": "reinforcement",
      "reinforcement": {
        "grader": {
          "type": "string_check",
          "name": "answer-check",
          "input": "{{sample.output_text}}",
          "reference": "{{item.reference_answer}}",
          "operation": "eq"
        }
      }
    },
    "hyperparameters": {
      "n_epochs": 2,
      "compute_multiplier": 1.0
    },
    "trainingType": "globalstandard"
  }'

Learn more about fine-tuning

New Model Graders: GPT-4.1, GPT-4.1-mini, and GPT-4.1-nano

Graders are the engine of RFT—they define the reward signal your model optimizes against. Until now, model-based graders were limited to a smaller set of models. Starting this month, three additional models are available as graders:

GPT-4.1
GPT-4.1-mini
GPT-4.1-nano

When to Use Model Graders

Deterministic graders (string-match, Python, endpoint-based) should remain your default—they are faster, cheaper, and more reproducible. Reach for model graders when:

The task output is open-ended or subjective (e.g., summarization quality, tone adherence, multi-step reasoning coherence) and cannot be reduced to a simple string check.
You need to score partial credit across multiple dimensions—such as factual accuracy, completeness, and safety—in a single grading pass.
You are building an agentic workflow where tool-call correctness depends on semantic context that pattern matching cannot capture.

Choosing the Right Model Grader

Start with GPT-4.1-nano for initial iterations—its low cost lets you run more experiments and faster feedback loops.
Upgrade to GPT-4.1-mini once your grading rubric is stable and you need higher fidelity.
Reserve GPT-4.1 for production grading or complex rubrics where every scoring decision counts.

Tip: You can mix grader types within a single RFT job. For example, use a string-match grader for the “correct answer” dimension and a GPT-4.1-mini model grader for evaluating the “reasoning quality” dimension.

Reinforcement Fine-Tuning Best Practices

Whether you are using the new model graders or deterministic ones, the following best practices will help you get the most out of RFT.

When to Use RFT

RFT improves reasoning accuracy and decision quality in tasks where outputs can be clearly evaluated and scored. It is especially effective when:

Tool-calling accuracy matters — the model must select and invoke the right tools with correct parameters.
Policy or rubric enforcement — outputs need to follow specific business rules that a grader can validate.
Structured data extraction — correctness is unambiguous and can be scored deterministically.

Not a fit for style or tone. If you need formatting, voice, or stylistic adjustments, prefer prompt engineering, structured outputs, or supervised fine-tuning (SFT).

Step 1: Define the Objective

Start by clearly stating the task and what success looks like. Then design a grader that reflects real task quality as reliably as possible. The grader is the primary driver of RFT success—invest disproportionate effort here.

Step 2: Establish a Baseline

Before training, run a baseline evaluation on a small set of examples (10–100 samples) so you understand starting performance and can measure real improvement. Evaluate using a base model (for example, o4-mini) and experiment with system prompts to reach the best possible performance before fine-tuning.

Learn about Foundry Evaluation

Step 3: Design Effective Graders

The grader determines what the model optimizes for. Follow these principles:

Use the simplest grader that works. If validating an exact-match answer (a number, a multiple-choice letter), use a string-match grader rather than a model-based or Python grader.
Prefer deterministic checks. String validation, code/Python-based graders, and endpoint-based graders are more reliable than model-based grading.
Aim for well-distributed rewards. Rewards that are too sparse or too uniform produce weak learning signals that limit model improvement.
Validate on diverse, real-world inputs. Validate the grader on diverse, real world inputs rather than relying only on synthetic data.

Step 4: Start Small and Iterate

Begin with small datasets (10–100 samples), simple graders, and low epoch counts. A practical workflow:

Start with o4-mini RFT to validate the end-to-end setup and grader behavior.
Graduate to larger models once the reward signal and training loop look healthy.
Change one variable at a time so gains or regressions can be clearly attributed.

Step 5: Tune Training Parameters

Expect epoch count and compute_multiplier to have the most impact on quality. Adjust one at a time and monitor the reward trend and variance throughout training.

RFT Data Format

RFT requires a different data format from SFT. The final message in each row must be a User or Developer role—not Assistant.

SFT format (answer in the assistant message):

{
  "messages": [
    { "role": "system", "content": "Reply to the user's question as accurately as possible." },
    { "role": "user", "content": "Question: What is the capital of France?" },
    { "role": "assistant", "content": "Paris" }
  ]
}

RFT format (answer moved to a top-level key for the grader):

{
  "messages": [
    { "role": "developer", "content": "Reply to the user's question as accurately as possible." },
    { "role": "user", "content": "Question: What is the capital of France?" }
  ],
  "reference_answer": "Paris"
}

The reference_answer (or any custom top-level key) can be referenced in the grader as item.reference_answer.

Common Pitfalls

Data and Grader Mismatch

Every key referenced in your grader (e.g., item.reference_answer) must exist in all data rows. If your grader references item.capital but your data uses reference_answer, the job will fail silently or score incorrectly.

Example of a mismatched grader:

{
  "type": "string_check",
  "name": "answer-check",
  "input": "{{sample.output_text}}",
  "reference": "{{item.capital}}",
  "operation": "eq"
}

If your data uses reference_answer instead of capital, update the grader reference to {{item.reference_answer}}.

Missing Response Format

To reference sample.output_json in your grader, you must provide a response format in the job definition. Without it, the model outputs free-form text and JSON grader references will fail.

{
  "type": "json_schema",
  "json_schema": {
    "name": "response",
    "strict": true,
    "schema": {
      "properties": {
        "capital": { "title": "Capital", "type": "string" },
        "population": { "title": "Population", "type": "string" }
      },
      "title": "CapitalData",
      "type": "object",
      "additionalProperties": false
    }
  }
}

Advanced: Agentic RFT Scenarios

Tool Design

Treat tools as part of the environment, not passive helpers. Build tools that reflect the full decision-making cycle your task requires—not just the final action. For example, an automatic escalation workflow shouldn’t only have a tool to trigger escalation; it also needs a tool to check recipient availability first. Without that step, the model never learns when escalation is appropriate.

Design for training-scale traffic: set timeouts and rate limits, add tracing (latency + error codes), and plan retry behavior so that slow calls don’t cascade into a retry storm.

MCP Server Integration

RFT supports tool use through function-calling, but MCP is the preferred approach for production agentic systems. Implement each tool once, then expose it two ways—via an MCP interface for MCP-native clients and via a function-calling-compatible interface for fine-tuning. This lets you seamlessly integrate with Agents, Evaluations, and Reinforcement Fine-Tuning on the Foundry platform.

Explore agentic RFT samples

Monitor for Reward Hacking

Don’t wait for final scores—inspect outputs and evaluation metrics throughout training using the Metrics tab on the fine-tuning job detail page in Foundry.

Signs of reward hacking:

Eval scores improve while visible output quality degrades.
The model produces responses that “match” the grader without performing the intended behavior (e.g., a semantically incorrect tool call that still passes pattern checks).

Mitigations:

Use held-out evaluation sets with diverse, real-world inputs.
Give partial credit across multiple dimensions (outcome, tool use, safety).
Explicitly require critical intermediate steps (e.g., lookups before writes).
Keep grading deterministic so improvements reflect policy changes, not grader noise.

What’s Next

Read the full RFT Best Practices guide on GitHub.
Explore the fine-tuning code samples for end-to-end workflows.
Review the Reinforcement Fine-Tuning how-to in Microsoft Learn.

Join the Community

Conclusion

This month’s RFT updates work together: Global Training for o4-mini lowers your training costs across regions, new GPT-4.1 model graders give you richer reward signals for complex evaluation tasks, and the best practices guide helps you avoid common pitfalls from day one. Start small with a handful of scored examples and a simple grader, validate your setup, and scale from there.

What’s New in Microsoft Foundry Fine-Tuning | April 2026

Global Training for o4-mini

Create an o4-mini Global Training Job via REST API

New Model Graders: GPT-4.1, GPT-4.1-mini, and GPT-4.1-nano

When to Use Model Graders

Choosing the Right Model Grader

Reinforcement Fine-Tuning Best Practices

When to Use RFT

Step 1: Define the Objective

Step 2: Establish a Baseline

Step 3: Design Effective Graders

Step 4: Start Small and Iterate

Step 5: Tune Training Parameters

RFT Data Format

Common Pitfalls

Data and Grader Mismatch

Missing Response Format

Advanced: Agentic RFT Scenarios

Tool Design

MCP Server Integration

Monitor for Reward Hacking

What’s Next

Conclusion

Category

Topics

Author

0 comments

Leave a commentCancel reply

Read next

Introducing the new hosted agents in Foundry Agent Service: secure, scalable compute built for agents

From Local to Production: The Complete Developer Journey for Building, Composing, and Deploying AI Agents

Global Training for o4-mini

Create an o4-mini Global Training Job via REST API

New Model Graders: GPT-4.1, GPT-4.1-mini, and GPT-4.1-nano

When to Use Model Graders

Choosing the Right Model Grader

Reinforcement Fine-Tuning Best Practices

When to Use RFT

Step 1: Define the Objective

Step 2: Establish a Baseline

Step 3: Design Effective Graders

Step 4: Start Small and Iterate

Step 5: Tune Training Parameters

RFT Data Format

Common Pitfalls

Data and Grader Mismatch

Missing Response Format

Advanced: Agentic RFT Scenarios

Tool Design

MCP Server Integration

Monitor for Reward Hacking

What’s Next

Conclusion

Category

Topics

Share

Author

0 comments

Leave a commentCancel reply

Read next

Introducing the new hosted agents in Foundry Agent Service: secure, scalable compute built for agents

From Local to Production: The Complete Developer Journey for Building, Composing, and Deploying AI Agents

Stay informed