{"id":1947,"date":"2026-02-13T15:13:44","date_gmt":"2026-02-13T23:13:44","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/foundry\/?p=1947"},"modified":"2026-02-13T15:13:44","modified_gmt":"2026-02-13T23:13:44","slug":"dpo-fine-tuning-using-microsoft-foundry-sdk","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/foundry\/dpo-fine-tuning-using-microsoft-foundry-sdk\/","title":{"rendered":"DPO Fine-Tuning Using Microsoft Foundry SDK"},"content":{"rendered":"<p>In the rapidly evolving landscape of large language models (LLMs), achieving precise control over model behavior while maintaining quality has become a critical challenge. While models like GPT-4 demonstrate impressive capabilities, ensuring their outputs align with human preferences\u2014whether for safety, helpfulness, or style\u2014requires sophisticated fine-tuning techniques. Direct Preference Optimization (DPO) represents a breakthrough approach that simplifies this alignment process while delivering exceptional results.<\/p>\n<p>This comprehensive guide explores DPO fine-tuning, explaining what it is, how it works, when to use it, and how to implement it using <a href=\"https:\/\/learn.microsoft.com\/en-us\/azure\/ai-foundry\/how-to\/develop\/sdk-overview?view=foundry&amp;pivots=programming-language-python\">Microsoft Foundry SDK<\/a>. Whether you&#8217;re building a customer service chatbot that needs to be consistently helpful, a content generation system that should avoid harmful outputs, or any AI application where response quality matters, understanding DPO will empower you to create better-aligned models.<\/p>\n<h2 id=\"community-4484411-toc-hId-1339443320\">What is Direct Preference Optimization (DPO)?<\/h2>\n<p>Direct Preference Optimization is an innovative technique for training language models to align with human preferences without the complexity of traditional Reinforcement Learning from Human Feedback (RLHF). Introduced in the groundbreaking paper\u00a0<a class=\"lia-external-url\" href=\"https:\/\/arxiv.org\/abs\/2305.18290\" target=\"_blank\" rel=\"noopener nofollow noreferrer\">Direct Preference Optimization: Your Language Model is Secretly a Reward Model&#8221; by Rafailov, Sharma,&#8230;<\/a>, DPO fundamentally reimagines how we teach models to generate preferred outputs.<\/p>\n<p>Unlike traditional supervised fine-tuning where you show a model &#8220;what to say,&#8221; DPO teaches models by showing them comparative examples: &#8220;this response is better than that one.&#8221; For each prompt, you provide:<\/p>\n<ul>\n<li><strong>Preferred response<\/strong>: A high-quality, desirable output<\/li>\n<li><strong>Non-preferred response<\/strong>: A lower-quality or undesirable output<\/li>\n<\/ul>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/foundry\/wp-content\/uploads\/sites\/89\/2026\/01\/dpo-image.webp\"><img decoding=\"async\" class=\"alignnone size-full wp-image-1948\" src=\"https:\/\/devblogs.microsoft.com\/foundry\/wp-content\/uploads\/sites\/89\/2026\/01\/dpo-image.webp\" alt=\"dpo image image\" width=\"936\" height=\"573\" srcset=\"https:\/\/devblogs.microsoft.com\/foundry\/wp-content\/uploads\/sites\/89\/2026\/01\/dpo-image.webp 936w, https:\/\/devblogs.microsoft.com\/foundry\/wp-content\/uploads\/sites\/89\/2026\/01\/dpo-image-300x184.webp 300w, https:\/\/devblogs.microsoft.com\/foundry\/wp-content\/uploads\/sites\/89\/2026\/01\/dpo-image-768x470.webp 768w\" sizes=\"(max-width: 936px) 100vw, 936px\" \/><\/a><\/p>\n<p>The model learns to increase the likelihood of generating preferred responses while decreasing the probability of non-preferred ones, all without requiring explicit reward modeling or complex reinforcement learning pipelines.<\/p>\n<p id=\"community-4484411-toc-hId-222550331\"><strong>Best Use Cases for DPO:<\/strong><\/p>\n<ul>\n<li>Response Quality &amp; Accuracy Improvement<\/li>\n<li>Reading Comprehension &amp; Summarization<\/li>\n<li>Safety &amp; Harmfulness Reduction<\/li>\n<li>Style, Tone, &amp; Brand Voice Alignment<\/li>\n<li>Helpfulness &amp; User Preference Optimization<\/li>\n<\/ul>\n<h2 id=\"community-4484411-toc-hId-212047227\">How Direct Preference Optimization Works<\/h2>\n<p>The following code demonstrates DPO fine-tuning using the\u00a0<a class=\"lia-external-url\" href=\"https:\/\/pypi.org\/project\/azure-ai-projects\/\" target=\"_blank\" rel=\"noopener nofollow noreferrer\">Microsoft Foundry Projects SDK<\/a>:<\/p>\n<pre class=\"prettyprint language-py\"><code class=\"language-py\">import os\r\nfrom dotenv import load_dotenv\r\nfrom azure.identity import DefaultAzureCredential\r\nfrom azure.ai.projects import AIProjectClient\r\n\r\n# Load environment variables\r\nload_dotenv()\r\n\r\nendpoint = os.environ.get(\"AZURE_AI_PROJECT_ENDPOINT\")\r\nmodel_name = os.environ.get(\"MODEL_NAME\")\r\n\r\n# Define dataset file paths\r\ntraining_file_path = \"training.jsonl\"\r\nvalidation_file_path = \"validation.jsonl\"\r\n\r\ncredential = DefaultAzureCredential()\r\nproject_client = AIProjectClient(endpoint=endpoint, credential=credential)\r\nopenai_client = project_client.get_openai_client()\r\n\r\n# Upload training and validation files\r\nwith open(training_file_path, \"rb\") as f:\r\n    train_file = openai_client.files.create(file=f, purpose=\"fine-tune\")\r\n\r\nwith open(validation_file_path, \"rb\") as f:\r\n    validation_file = openai_client.files.create(file=f, purpose=\"fine-tune\")\r\n\r\nopenai_client.files.wait_for_processing(train_file.id)\r\nopenai_client.files.wait_for_processing(validation_file.id)\r\n\r\n# Create DPO Fine Tuning job\r\nfine_tuning_job = openai_client.fine_tuning.jobs.create(\r\n    training_file=train_file.id,\r\n    validation_file=validation_file.id,\r\n    model=model_name,\r\n    method={\r\n        \"type\": \"dpo\",\r\n        \"dpo\": {\r\n            \"hyperparameters\": {\r\n                \"n_epochs\": 3,\r\n                \"batch_size\": 1,\r\n                \"learning_rate_multiplier\": 1.0\r\n            }\r\n        }\r\n    },\r\n    extra_body={\"trainingType\": \"GlobalStandard\"}\r\n)<\/code><\/pre>\n<p id=\"community-4484411-toc-hId--904845762\">DPO Fine-Tuning Results:<\/p>\n<pre class=\"prettyprint language-py\"><code class=\"language-py\">print(f\"Testing fine-tuned model via deployment: {deployment_name}\")\r\n\r\nresponse = openai_client.responses.create(\r\n    model=deployment_name,\r\n    input=[{\"role\": \"user\", \"content\": \"Explain machine learning in simple terms.\"}]\r\n)\r\n\r\nprint(f\"Model response: {response.output_text}\")<\/code><\/pre>\n<p>Inference result:<\/p>\n<pre class=\"prettyprint language-py\"><code class=\"language-py\">Model response: Machine learning is like teaching a computer to learn from experience, similar to how people do. Instead of programming specific instructions for every task, we give the computer a lot of data and it figures out patterns on its own. Then, it can use what it learned to make decisions or predictions. For example, if you show a machine learning system lots of pictures of cats and dogs, it will learn to recognize which is which by itself.<\/code><\/pre>\n<p id=\"community-4484411-toc-hId-1582667071\">Data format example:<\/p>\n<pre class=\"prettyprint language-json\"><code class=\"language-json\">{\r\n  \"input\": {\r\n    \"messages\": [\r\n      {\"role\": \"system\", \"content\": \"You are a helpful assistant.\"},\r\n      {\"role\": \"user\", \"content\": \"What is the capital of France?\"}\r\n    ]\r\n  },\r\n  \"preferred_output\": [\r\n    {\"role\": \"assistant\", \"content\": \"The capital of France is Paris.\"}\r\n  ],\r\n  \"non_preferred_output\": [\r\n    {\"role\": \"assistant\", \"content\": \"I think it's London.\"}\r\n  ]\r\n}<\/code><\/pre>\n<h2 id=\"community-4484411-toc-hId-1572163967\">Comparing DPO to Other Methods<\/h2>\n<table border=\"1\">\n<tbody>\n<tr>\n<td><strong>Aspect<\/strong><\/td>\n<td><strong>DPO<\/strong><\/td>\n<td><strong>SFT<\/strong><\/td>\n<td><strong>RFT<\/strong><\/td>\n<\/tr>\n<tr>\n<td>Learning signal<\/td>\n<td>Comparative preferences<\/td>\n<td>Input-output pairs<\/td>\n<td>Graded exploration<\/td>\n<\/tr>\n<tr>\n<td>Data requirement<\/td>\n<td>Preference pairs<\/td>\n<td>Example demonstrations<\/td>\n<td>Problems + grader<\/td>\n<\/tr>\n<tr>\n<td>Best for<\/td>\n<td>Quality alignment<\/td>\n<td>Task learning<\/td>\n<td>Complex reasoning<\/td>\n<\/tr>\n<tr>\n<td>Computational cost<\/td>\n<td>Moderate<\/td>\n<td>Low<\/td>\n<td>High<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<h2 id=\"community-4484411-toc-hId--235290496\">Learn more<\/h2>\n<ul>\n<li role=\"presentation\"><span data-olk-copy-source=\"MessageBody\">Watch\u00a0on-demand:\u00a0<a id=\"OWA75711f6e-8ec1-572a-f983-a329eaccc0f9\" title=\"Original URL: https:\/\/ignite.microsoft.com\/en-US\/sessions\/BRK188?source=sessions. Click or tap if you trust this link.\" href=\"https:\/\/nam06.safelinks.protection.outlook.com\/?url=https%3A%2F%2Fignite.microsoft.com%2Fen-US%2Fsessions%2FBRK188%3Fsource%3Dsessions&amp;data=05%7C02%7Ckingernupur%40microsoft.com%7Cb6ae5ecd53ee4d3bf8e008de6989e2a6%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C639064238111810769%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&amp;sdata=gydnYWKyZiI22zsjFGcGwvuPv6KGV%2FmfT8Wheck8Suk%3D&amp;reserved=0\" data-auth=\"NotApplicable\" data-linkindex=\"7\">AI fine-tuning in Microsoft Foundry to make your agents unstoppable<\/a><\/span><\/li>\n<li role=\"presentation\">Join the next Model Mondays Livestream\u00a0<a id=\"OWAf38077b2-2d9d-f045-fe9c-f73cbbca27e2\" title=\"Original URL: https:\/\/developer.microsoft.com\/en-us\/reactor\/series\/S-1485\/. Click or tap if you trust this link.\" href=\"https:\/\/nam06.safelinks.protection.outlook.com\/?url=https%3A%2F%2Fdeveloper.microsoft.com%2Fen-us%2Freactor%2Fseries%2FS-1485%2F&amp;data=05%7C02%7Ckingernupur%40microsoft.com%7Cb6ae5ecd53ee4d3bf8e008de6989e2a6%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C639064238111821245%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&amp;sdata=nuaivTskhS7t4xe7cFp17kdN8lWBN3MEAP1uy5XO19M%3D&amp;reserved=0\" data-auth=\"NotApplicable\" data-linkindex=\"8\">Model Mondays | Microsoft Reactor<\/a><\/li>\n<li role=\"presentation\">Learn more about fine-tuning on Microsoft Foundry\u00a0<a id=\"OWAd24441c5-3a0a-1cd9-1ce7-c6176687c94a\" title=\"Original URL: https:\/\/learn.microsoft.com\/en-us\/azure\/ai-foundry\/concepts\/fine-tuning-overview?view=foundry-classic. Click or tap if you trust this link.\" href=\"https:\/\/nam06.safelinks.protection.outlook.com\/?url=https%3A%2F%2Flearn.microsoft.com%2Fen-us%2Fazure%2Fai-foundry%2Fconcepts%2Ffine-tuning-overview%3Fview%3Dfoundry-classic&amp;data=05%7C02%7Ckingernupur%40microsoft.com%7Cb6ae5ecd53ee4d3bf8e008de6989e2a6%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C639064238111831782%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&amp;sdata=L%2B96SWkHeG5t%2BZ%2FcbRQadiF91wrI7hfoGZoKhsUdZCE%3D&amp;reserved=0\" data-auth=\"NotApplicable\" data-linkindex=\"9\">Fine-tune models with Microsoft Foundry &#8211; Microsoft Foundry | Microsoft Learn<\/a><\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>In the rapidly evolving landscape of large language models (LLMs), achieving precise control over model behavior while maintaining quality has become a critical challenge. While models like GPT-4 demonstrate impressive capabilities, ensuring their outputs align with human preferences\u2014whether for safety, helpfulness, or style\u2014requires sophisticated fine-tuning techniques. Direct Preference Optimization (DPO) represents a breakthrough approach that [&hellip;]<\/p>\n","protected":false},"author":206591,"featured_media":1563,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1],"tags":[],"class_list":["post-1947","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-microsoft-foundry"],"acf":[],"blog_post_summary":"<p>In the rapidly evolving landscape of large language models (LLMs), achieving precise control over model behavior while maintaining quality has become a critical challenge. While models like GPT-4 demonstrate impressive capabilities, ensuring their outputs align with human preferences\u2014whether for safety, helpfulness, or style\u2014requires sophisticated fine-tuning techniques. Direct Preference Optimization (DPO) represents a breakthrough approach that [&hellip;]<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/foundry\/wp-json\/wp\/v2\/posts\/1947","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/foundry\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/foundry\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/foundry\/wp-json\/wp\/v2\/users\/206591"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/foundry\/wp-json\/wp\/v2\/comments?post=1947"}],"version-history":[{"count":2,"href":"https:\/\/devblogs.microsoft.com\/foundry\/wp-json\/wp\/v2\/posts\/1947\/revisions"}],"predecessor-version":[{"id":2007,"href":"https:\/\/devblogs.microsoft.com\/foundry\/wp-json\/wp\/v2\/posts\/1947\/revisions\/2007"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/foundry\/wp-json\/wp\/v2\/media\/1563"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/foundry\/wp-json\/wp\/v2\/media?parent=1947"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/foundry\/wp-json\/wp\/v2\/categories?post=1947"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/foundry\/wp-json\/wp\/v2\/tags?post=1947"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}