{"id":1941,"date":"2026-03-23T19:26:27","date_gmt":"2026-03-23T19:26:27","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/all-things-azure\/?p=1941"},"modified":"2026-03-23T19:26:27","modified_gmt":"2026-03-23T19:26:27","slug":"agentic-platform-engineering-with-github-copilot","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/all-things-azure\/agentic-platform-engineering-with-github-copilot\/","title":{"rendered":"Agentic Platform Engineering with GitHub Copilot"},"content":{"rendered":"<p>We&#8217;ve talked about the <a class=\"\" href=\"https:\/\/devblogs.microsoft.com\/all-things-azure\/the-human-scale-problem-in-platform-engineering\/\">human scale problem<\/a>\u00a0and what happens\u00a0<a class=\"\" href=\"https:\/\/devblogs.microsoft.com\/all-things-azure\/when-infrastructure-scales-but-understanding-doesnt\/\">when infrastructure scales but understanding doesn&#8217;t<\/a>. If you&#8217;ve been following along, you know the thesis: our tools have outpaced our ability to operate them, and platform engineering is how we&#8217;re fighting back.<\/p>\n<p>But here&#8217;s the thing &#8211; we&#8217;ve been fighting with one hand tied behind our backs. We&#8217;ve been encoding knowledge into runbooks that go stale, documentation that drifts, and tribal expertise that walks out the door when someone takes a new job. What if the platform itself could think alongside us?<\/p>\n<p>That&#8217;s what we mean by\u00a0<strong>agentic platform engineering<\/strong>: not replacing the humans, but giving the platform the ability to observe, reason, and act &#8211; with humans still firmly in the pilot seat.<\/p>\n<p>Everything we cover in this post has a companion repository you can clone, fork, and run yourself:\u00a0<a class=\"\" href=\"https:\/\/github.com\/microsoftgbb\/agentic-platform-engineering\" target=\"_blank\" rel=\"noopener noreferrer\">microsoftgbb\/agentic-platform-engineering<\/a>.<\/p>\n<p>You can also follow along Ray&#8217;s walkthrough on YouTube:<\/p>\n<p><iframe title=\"YouTube video player\" src=\"\/\/www.youtube.com\/embed\/M_YX74ATz0I?si=rzu1PHyYz53KHVnR\" width=\"560\" height=\"315\" frameborder=\"0\" allowfullscreen=\"allowfullscreen\"><\/iframe><\/p>\n<div id=\"title\" class=\"style-scope ytd-watch-metadata\">\n<p class=\"style-scope ytd-watch-metadata\"><iframe title=\"YouTube video player\" src=\"\/\/www.youtube.com\/embed\/sYM_X6tOgDw?si=DJrVx-2b-CWOmpns\" width=\"560\" height=\"315\" frameborder=\"0\" allowfullscreen=\"allowfullscreen\"><\/iframe><\/p>\n<\/div>\n<div id=\"top-row\" class=\"style-scope ytd-watch-metadata\">\n<div id=\"owner\" class=\"item style-scope ytd-watch-metadata\"><\/div>\n<\/div>\n<p>It&#8217;s organized into three acts that mirror the evolution we&#8217;ll walk through below, complete with agent definitions, GitHub Actions workflows, MCP configurations, and sample Argo CD manifests.<\/p>\n<h2 id=\"the-paradox-of-choice\" class=\"anchor anchorTargetStickyNavbar_Vzrq\">The Paradox of Choice<\/h2>\n<h2 class=\"anchor anchorTargetStickyNavbar_Vzrq\"><a href=\"https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2026\/03\/pe-landscape.webp\"><img decoding=\"async\" class=\"aligncenter wp-image-1968 size-full\" src=\"https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2026\/03\/pe-landscape.webp\" alt=\"pe landscape image\" width=\"721\" height=\"391\" srcset=\"https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2026\/03\/pe-landscape.webp 721w, https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2026\/03\/pe-landscape-300x163.webp 300w\" sizes=\"(max-width: 721px) 100vw, 721px\" \/><\/a><\/h2>\n<p>No platform engineering conversation is complete without the eye chart. It&#8217;s the paradox of choice made real. You have a thousand different tools. You might know they exist, but the granular details of how to use each one, when to reach for it, how to compose them together? <strong>There&#8217;s no way the human mind can hold all of that.<\/strong><\/p>\n<p>And yet, that&#8217;s exactly what we&#8217;re asking platform engineers to do. Case in point: Platform Engineering on AKS with GitOps, CAPZ and ASOv2.<\/p>\n<p>Build a GitOps-Driven Platform on AKS with the App of Apps Pattern | AKS LABS<a href=\"https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2026\/03\/app-of-apps.webp\"><img decoding=\"async\" class=\"aligncenter wp-image-1969 size-full\" src=\"https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2026\/03\/app-of-apps.webp\" alt=\"app of apps image\" width=\"943\" height=\"491\" srcset=\"https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2026\/03\/app-of-apps.webp 943w, https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2026\/03\/app-of-apps-300x156.webp 300w, https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2026\/03\/app-of-apps-768x400.webp 768w\" sizes=\"(max-width: 943px) 100vw, 943px\" \/><\/a><\/p>\n<p>What we&#8217;ve found working with customers is that there are common patterns. Not the only patterns, but ones that keep showing up. Developer self-service through an internal developer portal. A management cluster running Cluster API for Azure (CAPZ) and Azure Service Operator (ASO). GitOps with Argo CD syncing application state from config repos. GitHub Actions handling CI\/CD. It&#8217;s a well-trodden path-and it works.<\/p>\n<p>If you want a concrete AKS-focused example of that pattern, the\u00a0<a class=\"\" href=\"https:\/\/azure-samples.github.io\/aks-labs\/docs\/platform-engineering\/aks-capz-aso\/\" target=\"_blank\" rel=\"noopener noreferrer\">AKS platform engineering lab for CAPZ and ASO<\/a>\u00a0is a good reference architecture to study alongside this post.<\/p>\n<p>But getting there isn&#8217;t easy. And operating it on day two? That&#8217;s where things get interesting.<\/p>\n<h2 id=\"a-story-in-three-acts\" class=\"anchor anchorTargetStickyNavbar_Vzrq\">A Story in Three Acts<\/h2>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2026\/03\/pe-engineering-acts.webp\"><img decoding=\"async\" class=\"wp-image-1955 size-full aligncenter\" src=\"https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2026\/03\/pe-engineering-acts.webp\" alt=\"pe engineering acts image\" width=\"943\" height=\"525\" srcset=\"https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2026\/03\/pe-engineering-acts.webp 943w, https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2026\/03\/pe-engineering-acts-300x167.webp 300w, https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2026\/03\/pe-engineering-acts-768x428.webp 768w\" sizes=\"(max-width: 943px) 100vw, 943px\" \/><\/a><\/p>\n<p>We think about the evolution of agentic platform engineering in three acts, each building on the last. They also roughly map to the waves of GitHub Copilot itself-from autocomplete on steroids, to contextual enforcement, to autonomous agents that can take meaningful action.<\/p>\n<h2 id=\"act-one-the-plaform-is-growing-faster-than-the-team\" class=\"anchor anchorTargetStickyNavbar_Vzrq\">Act One: The Platform is growing faster than the Team<\/h2>\n<p>The first problem is deceptively simple:\u00a0<strong>knowledge lives in people, and people don&#8217;t scale<\/strong>.<\/p>\n<p>You&#8217;ve got tribal knowledge scattered across the team. Someone knows the migration path. Someone else knows the networking quirks. A third person wrote the Terraform modules three years ago and sort of remembers how they work. When you&#8217;re a small team, this is fine &#8211; you shout across the desk and get your answer. As the team grows, you don&#8217;t even know what everyone knows or doesn&#8217;t know. Documentation exists in a Word doc somewhere, maybe, if someone remembered to write it, and if it hasn&#8217;t drifted into irrelevance.<\/p>\n<p>The result? The platform team becomes a bottleneck. Every question routes through the same few experts. Every onboarding is a manual knowledge transfer that never quite covers everything. It&#8217;s the high-toil, rinse-and-repeat cycle that the Phoenix Project book by Gene Kim, Kevin Behr and George Spafford warned us about.<\/p>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2026\/03\/pattern-platform-aware.webp\"><img decoding=\"async\" class=\"aligncenter wp-image-1972 size-full\" src=\"https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2026\/03\/pattern-platform-aware.webp\" alt=\"pattern platform aware image\" width=\"942\" height=\"529\" srcset=\"https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2026\/03\/pattern-platform-aware.webp 942w, https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2026\/03\/pattern-platform-aware-300x168.webp 300w, https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2026\/03\/pattern-platform-aware-768x431.webp 768w\" sizes=\"(max-width: 942px) 100vw, 942px\" \/><\/a><\/p>\n<p><strong>The shift<\/strong>: we can now embed that knowledge directly into the platform. GitHub Copilot, aware of the source code, the infrastructure, and the conventions encoded in the repository itself, becomes the experienced colleague who&#8217;s always available. New developer needs to understand how the deployment pipeline works? Ask Copilot. Need to compose infrastructure from the service catalog? Copilot can navigate your Terraform module repository, understand what&#8217;s been vetted, and help you assemble what you need.<\/p>\n<p>This extends to brownfield environments too. Already deployed infrastructure through the portal without templatizing it? An AI assistant can reverse-engineer that infrastructure, examine the resource group, catalog the deployed services, and generate the Terraform or Bicep templates you should have written in the first place. It&#8217;s not magic. It&#8217;s making the knowledge that already exists in your environment accessible through conversation.<\/p>\n<p>For hands-on examples, the\u00a0<a class=\"\" href=\"https:\/\/github.com\/microsoftgbb\/agentic-platform-engineering\/tree\/main\/Act-1\" target=\"_blank\" rel=\"noopener noreferrer\">Act 1 workshop<\/a>\u00a0walks you through building your first platform agent-from defining a persona and codifying workflow rules to grounding the agent in your organization&#8217;s documentation. It also includes a\u00a0<a class=\"\" href=\"https:\/\/github.com\/microsoftgbb\/agentic-platform-engineering\/blob\/main\/Act-1\/starter-prompt.md\" target=\"_blank\" rel=\"noopener noreferrer\">starter prompt template<\/a>\u00a0you can adapt immediately. For reference implementations of IaC-aware agents, check out the\u00a0<a class=\"\" href=\"https:\/\/github.com\/ricardocovo\/iac-module-catalog\" target=\"_blank\" rel=\"noopener noreferrer\">IaC Module Catalog Agent<\/a>\u00a0and the\u00a0<a class=\"\" href=\"https:\/\/github.com\/ricardocovo\/ghcp-infra-reverse-engineer\" target=\"_blank\" rel=\"noopener noreferrer\">Infrastructure Reverse Engineer Agent<\/a>.<\/p>\n<h2 id=\"act-two-standards-exist-but-theyre-not-enforced\" class=\"anchor anchorTargetStickyNavbar_Vzrq\">Act Two: Standards Exist, but They\u2019re Not Enforced<\/h2>\n<p>Awareness gets you far, but it only gets you so far. The next question is:\u00a0<strong>how do we enforce standards consistently without creating friction?<\/strong><\/p>\n<p>We rely on each other to know exactly what to do, when to do it, and exactly how to go about it. That&#8217;s a brittle process. People forget. People copy-paste from Stack Overflow without fully understanding what they&#8217;re deploying. People unknowingly violate compliance rules &#8211; especially in regulated industries where the number of requirements exceeds what anyone can memorize.<\/p>\n<p>The pattern here is straightforward: every push to the repository triggers a GitHub Action. That action runs GitHub Copilot in the background with a standardized prompt &#8211; a template that tells it exactly what to check and what the expected outcome should be. Did documentation need to be updated? Were unit tests generated for the new code? Does the infrastructure configuration comply with your organization&#8217;s security policies?<\/p>\n<p>This catches problems early, when the cost of fixing them is low, instead of discovering violations during a security review weeks later. And here&#8217;s what makes it fundamentally different from a static linting rule or a brittle if &#8211; statement:\u00a0<strong>the AI assistant adapts<\/strong>. Update the rules in a markdown file and the guardrails update with it. New compliance requirement from your governance team? Add it to the instructions. No pipeline changes needed. No code rewrite. The enforcement mechanisms flex to new rules in a way that manual processes never could.<\/p>\n<p>For organizations with knowledge scattered across SharePoint, databases, or external compliance providers, this is where Microsoft Foundry comes in. You can host custom models for security or anomaly detection inside Foundry, connect to data sources through Foundry IQ, and have GitHub Copilot pull that information in via MCP servers. The rules don&#8217;t have to live in the repo &#8211; they just have to be reachable.<\/p>\n<p>The guardrails stop feeling like hurdles. They become the thing that frees you up-something else is carrying the burden of remembering all the rules so you don&#8217;t have to.<\/p>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2026\/03\/pattern-agentic-enforcement.webp\"><img decoding=\"async\" class=\"aligncenter wp-image-1971 size-full\" src=\"https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2026\/03\/pattern-agentic-enforcement.webp\" alt=\"pattern agentic enforcement image\" width=\"940\" height=\"532\" srcset=\"https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2026\/03\/pattern-agentic-enforcement.webp 940w, https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2026\/03\/pattern-agentic-enforcement-300x170.webp 300w, https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2026\/03\/pattern-agentic-enforcement-768x435.webp 768w\" sizes=\"(max-width: 940px) 100vw, 940px\" \/><\/a><\/p>\n<p>The\u00a0<a class=\"\" href=\"https:\/\/github.com\/microsoftgbb\/agentic-platform-engineering\/tree\/main\/Act-2\" target=\"_blank\" rel=\"noopener noreferrer\">Act 2 workshop<\/a>\u00a0walks through this crawl-walk-run progression in detail. It starts with reusable team prompts-stored as\u00a0<code>.prompt.md<\/code>\u00a0files in your\u00a0<code>.github\/prompts\/<\/code>\u00a0directory-that any team member can invoke on-demand. The repo includes production-ready examples for AKS operations:\u00a0<a class=\"\" href=\"https:\/\/github.com\/microsoftgbb\/agentic-platform-engineering\/blob\/main\/.github\/prompts\/aks-check-pods.prompt.md\" target=\"_blank\" rel=\"noopener noreferrer\">aks-check-pods.prompt.md<\/a>\u00a0for diagnosing unhealthy pods,\u00a0<a class=\"\" href=\"https:\/\/github.com\/microsoftgbb\/agentic-platform-engineering\/blob\/main\/.github\/prompts\/aks-check-nodes.prompt.md\" target=\"_blank\" rel=\"noopener noreferrer\">aks-check-nodes.prompt.md<\/a>\u00a0for node-level issues, and\u00a0<a class=\"\" href=\"https:\/\/github.com\/microsoftgbb\/agentic-platform-engineering\/blob\/main\/.github\/prompts\/aks-remediation.prompt.md\" target=\"_blank\" rel=\"noopener noreferrer\">aks-remediation.prompt.md<\/a>\u00a0for generating specific fix steps. From there, it shows how to wire these into CI\/CD with a\u00a0<a class=\"\" href=\"https:\/\/github.com\/microsoftgbb\/agentic-platform-engineering\/blob\/main\/.github\/workflows\/copilot.generate-docs.yml\" target=\"_blank\" rel=\"noopener noreferrer\">documentation generator workflow<\/a>\u00a0that runs GitHub Copilot CLI on every push.<\/p>\n<h2 id=\"act-three-kubernetes-operations-dont-scale-linearly\" class=\"anchor anchorTargetStickyNavbar_Vzrq\">Act Three: Kubernetes Operations Don\u2019t Scale Linearly<\/h2>\n<p>This is where it gets exciting. We&#8217;ve moved past the assistant that helps you write code and the enforcer that catches mistakes. Now we&#8217;re talking about\u00a0<strong>agents that can observe, diagnose, and propose remediation autonomously<\/strong>.<\/p>\n<p>The core issue on day two of platform operations is this: platform engineers spend their time firefighting instead of improving the platform. Misconfigurations, degraded services, mysterious latency spikes-these pull you into reactive mode. And the expertise to diagnose them doesn&#8217;t scale across teams. Runbooks are static. As much as we love them, they don&#8217;t map to every scenario. You need something\u00a0<strong>softer on the edges<\/strong>, something that adapts to the specific failure in front of you.<\/p>\n<p>&nbsp;<\/p>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2026\/03\/pattern-agent-assisted-ops.webp\"><img decoding=\"async\" class=\"aligncenter wp-image-1970 size-full\" src=\"https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2026\/03\/pattern-agent-assisted-ops.webp\" alt=\"pattern agent assisted ops image\" width=\"941\" height=\"530\" srcset=\"https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2026\/03\/pattern-agent-assisted-ops.webp 941w, https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2026\/03\/pattern-agent-assisted-ops-300x169.webp 300w, https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2026\/03\/pattern-agent-assisted-ops-768x433.webp 768w\" sizes=\"(max-width: 941px) 100vw, 941px\" \/><\/a><\/p>\n<h3 id=\"the-cluster-doctor\" class=\"anchor anchorTargetStickyNavbar_Vzrq\">The Cluster Doctor<\/h3>\n<p>We built what we&#8217;re calling the Cluster Doctor-a custom GitHub Copilot agent configured with the diagnostic knowledge of an experienced platform engineer. Think of it as codifying the troubleshooting instincts of your best SRE into a system that&#8217;s always on. The full\u00a0<a class=\"\" href=\"https:\/\/github.com\/microsoftgbb\/agentic-platform-engineering\/tree\/main\/Act-3\" target=\"_blank\" rel=\"noopener noreferrer\">Act 3 workshop<\/a>\u00a0covers setup, configuration, and a live failure simulation you can run yourself.<\/p>\n<p>The agent is defined in a single markdown file (<a class=\"\" href=\"https:\/\/github.com\/microsoftgbb\/agentic-platform-engineering\/blob\/main\/.github\/agents\/cluster-doctor.agent.md\" target=\"_blank\" rel=\"noopener noreferrer\">cluster-doctor.agent.md)<\/a>\u00a0that gives it a persona (senior Kubernetes administrator and SRE), a systematic diagnostic workflow (collect, verify, diagnose, triage, remediate), and critical safety constraints (never attempt destructive changes without authorization, verify cluster identity before any write action).<\/p>\n<p>Here&#8217;s how it works in practice:<\/p>\n<p><strong>Crawl<\/strong>: Start with prompt engineering. Your experienced engineers document their diagnostic steps &#8211; the <strong>kubectl<\/strong> commands they&#8217;d run, the things they&#8217;d check, the order they&#8217;d investigate. This lives in the repository as markdown files: agent definitions, instructions, and prompts that GitHub Copilot can follow.<\/p>\n<p><strong>Walk<\/strong>: Wire it into your operational workflow. Argo CD monitors application health in the cluster. When a deployment degrades, Argo fires a webhook to GitHub Actions, which creates a GitHub issue with the failure details &#8211; cluster name, resource group, the initial telemetry. A human sees the issue, tags it with a label (say,\u00a0<code>cluster-doctor<\/code>), and the agent spins up. It reads the issue, authenticates to Azure via Workload Identity Federation, runs <strong>kubectl<\/strong> commands against the affected cluster, queries the <a href=\"https:\/\/github.com\/Azure\/aks-mcp\">AKS MCP<\/a> server for deeper telemetry, and even leverages eBPF tooling through <a href=\"https:\/\/inspektor-gadget.io\/\">Inspektor Gadget<\/a> for hard-to-diagnose<a href=\"https:\/\/blog.aks.azure.com\/2025\/07\/23\/dns-debugging-build\"> problems like latency or CoreDNS issues<\/a>. Then it opens a pull request with the proposed fix.<\/p>\n<p><strong>Run<\/strong>: Remove the human trigger. When the issue lands in GitHub, the label is applied automatically. The Cluster Doctor starts its investigation immediately, walks through the diagnostic steps, and presents its findings-complete with a PR for the remediation and a summary of root cause analysis. A human reviews and approves. The agent did the detective work; you make the call.<\/p>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2026\/03\/agentic-pe.webp\"><img decoding=\"async\" class=\"aligncenter size-full wp-image-2019\" src=\"https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2026\/03\/agentic-pe.webp\" alt=\"agentic pe image\" width=\"2176\" height=\"1224\" srcset=\"https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2026\/03\/agentic-pe.webp 2176w, https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2026\/03\/agentic-pe-300x169.webp 300w, https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2026\/03\/agentic-pe-1024x576.webp 1024w, https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2026\/03\/agentic-pe-768x432.webp 768w, https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2026\/03\/agentic-pe-1536x864.webp 1536w, https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2026\/03\/agentic-pe-2048x1152.webp 2048w\" sizes=\"(max-width: 2176px) 100vw, 2176px\" \/><\/a><\/p>\n<p>What used to take hours of expert time-connecting to the cluster, running commands, correlating logs, understanding the failure chain-now happens in the background while you&#8217;re doing something else.<\/p>\n<h3 id=\"the-wiring\" class=\"anchor anchorTargetStickyNavbar_Vzrq\">The Wiring<\/h3>\n<p>The implementation is less exotic than it sounds. The first handoff is simply an event pipeline. Argo CD detects that a deployment has failed or become unhealthy. Argo CD Notifications then turns that signal into a fully customizable message, packages the app and cluster context into a payload, and sends it to GitHub using\u00a0<code>repository_dispatch<\/code>. That GitHub event starts a workflow whose job is to create or update a GitHub issue with the right labels, troubleshooting context, and metadata for the agent.<\/p>\n<p>Visually, the flow looks like this:<\/p>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2026\/03\/diagram.webp\"><img decoding=\"async\" class=\"wp-image-1952 size-full aligncenter\" src=\"https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2026\/03\/diagram.webp\" alt=\"diagram image\" width=\"1101\" height=\"406\" srcset=\"https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2026\/03\/diagram.webp 1101w, https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2026\/03\/diagram-300x111.webp 300w, https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2026\/03\/diagram-1024x378.webp 1024w, https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2026\/03\/diagram-768x283.webp 768w\" sizes=\"(max-width: 1101px) 100vw, 1101px\" \/><\/a><\/p>\n<p>In plain terms: Argo CD detects the problem, Argo CD Notifications shapes that problem into a useful message, GitHub receives it as an event, and GitHub Actions turns that event into a durable issue that both humans and agents can work from.<\/p>\n<p>That customization point in Argo CD Notifications is important. You are not limited to a generic alert. You can decide exactly what the downstream automation receives: application name, cluster name, Azure resource group, region, failure reason, links, suggested commands, and any other context that will help the next workflow or the responding engineer.<\/p>\n<p>Two GitHub Action workflows do the heavy lifting:<\/p>\n<ol>\n<li><strong><a class=\"\" href=\"https:\/\/github.com\/microsoftgbb\/agentic-platform-engineering\/blob\/main\/.github\/workflows\/argocd-deployment-failure.yml\" target=\"_blank\" rel=\"noopener noreferrer\">argocd-deployment-failure.yml<\/a><\/strong>\u00a0&#8211; receives the Argo CD webhook via\u00a0<code>repository_dispatch<\/code>, parses the payload, creates a structured GitHub issue with labels, troubleshooting commands, and all the context the agent will need. It also deduplicates-if an issue already exists for the same app, it adds a comment instead of creating a new one.<\/li>\n<li><strong><a class=\"\" href=\"https:\/\/github.com\/microsoftgbb\/agentic-platform-engineering\/blob\/main\/.github\/workflows\/copilot.trigger-cluster-doctor.yml\" target=\"_blank\" rel=\"noopener noreferrer\">copilot.trigger-cluster-doctor.yml<\/a><\/strong>\u00a0&#8211; fires when the\u00a0<code>cluster-doctor<\/code>\u00a0label is applied to an issue. It checks out the repo, installs GitHub Copilot CLI, authenticates to Azure via Workload Identity Federation, and invokes the Cluster Doctor agent with a prompt that points it at the triggering issue.<\/li>\n<\/ol>\n<p>The complete Argo CD notification setup-including webhook service definitions, payload templates, and triggers-is documented in the <a class=\"\" href=\"https:\/\/github.com\/microsoftgbb\/agentic-platform-engineering\/blob\/main\/Act-3\/SETUP.md\" target=\"_blank\" rel=\"noopener noreferrer\">ArgoCD GitHub Issue Creation Setup Guide<\/a>.<\/p>\n<h3><a href=\"https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2026\/03\/argocd-notification.webp\"><img decoding=\"async\" class=\"alignnone wp-image-1951 size-full\" src=\"https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2026\/03\/argocd-notification.webp\" alt=\"argocd notification image\" width=\"1575\" height=\"849\" srcset=\"https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2026\/03\/argocd-notification.webp 1575w, https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2026\/03\/argocd-notification-300x162.webp 300w, https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2026\/03\/argocd-notification-1024x552.webp 1024w, https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2026\/03\/argocd-notification-768x414.webp 768w, https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2026\/03\/argocd-notification-1536x828.webp 1536w\" sizes=\"(max-width: 1575px) 100vw, 1575px\" \/><\/a><\/h3>\n<p>&nbsp;<\/p>\n<h3>Setting up permissions and tokens<\/h3>\n<p>On the agentic-platform-engineering repo, we need to configure the\u00a0<strong>copilot<\/strong> environment and also a\u00a0<strong>PAT<\/strong> token that will be used to perform a few actions against the repo itself. Before you proceed, make sure you run the AKS side of this setup as described in the <a class=\"\" href=\"https:\/\/blog.aks.azure.com\/2025\/10\/22\/deploy-mcp-server-aks-workload-identity\" target=\"_blank\" rel=\"noopener noreferrer\">Deploy MCP Server on AKS with Workload Identity.<\/a> From the AKS setup, we need to save the following information, which we will use here: <strong>ARM_CLIENT_ID<\/strong>, <strong>ARM_SUBSCRIPTION_ID<\/strong> and <strong>ARM_TENANT_ID.<\/strong><\/p>\n<p>Back on GitHub, we need to setup the <strong>copilot <\/strong>environment. Go <strong>Settings, Environments\u00a0<\/strong>and\u00a0<strong>New Environment.\u00a0<\/strong><\/p>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2026\/03\/create-copilot-env.webp\"><img decoding=\"async\" class=\"alignnone wp-image-1948 size-full\" src=\"https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2026\/03\/create-copilot-env.webp\" alt=\"create copilot env image\" width=\"1172\" height=\"895\" srcset=\"https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2026\/03\/create-copilot-env.webp 1172w, https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2026\/03\/create-copilot-env-300x229.webp 300w, https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2026\/03\/create-copilot-env-1024x782.webp 1024w, https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2026\/03\/create-copilot-env-768x586.webp 768w\" sizes=\"(max-width: 1172px) 100vw, 1172px\" \/><\/a><\/p>\n<p>Next, create an environment named\u00a0<strong>copilot\u00a0<\/strong>and add the following secrets: <strong>ARM_CLIENT_ID<\/strong>, <strong>ARM_SUBSCRIPTION_ID<\/strong> and <strong>ARM_TENANT_ID.<\/strong><\/p>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2026\/03\/copilot-env-secrets.webp\"><img decoding=\"async\" class=\"alignnone wp-image-1949 size-full\" src=\"https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2026\/03\/copilot-env-secrets.webp\" alt=\"copilot env secrets image\" width=\"805\" height=\"363\" srcset=\"https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2026\/03\/copilot-env-secrets.webp 805w, https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2026\/03\/copilot-env-secrets-300x135.webp 300w, https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2026\/03\/copilot-env-secrets-768x346.webp 768w\" sizes=\"(max-width: 805px) 100vw, 805px\" \/><\/a><\/p>\n<h3><strong>Using MCP with GitHub Copilot<\/strong><\/h3>\n<p>An <a class=\"\" href=\"https:\/\/github.com\/microsoftgbb\/agentic-platform-engineering\/blob\/main\/.copilot\/mcp-config.json\" target=\"_blank\" rel=\"noopener noreferrer\">MCP configuration file<\/a>\u00a0tells Copilot how to reach both the GitHub MCP server (for reading issues and creating PRs) and the AKS MCP server running inside the cluster (for <strong>kubectl<\/strong> and deeper diagnostics). The cluster itself embeds metadata in its Argo CD config map: resource group name, cluster name, region. When the agent picks up an issue, it knows exactly where to look. It can run <strong>kubectl<\/strong>, query MCP endpoints, and even use eBPF-based tooling for deep packet-level diagnostics-all from within a GitHub Actions runner.<\/p>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2026\/03\/mcp-config.webp\"><img decoding=\"async\" class=\"alignnone wp-image-1947 size-full\" src=\"https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2026\/03\/mcp-config.webp\" alt=\"mcp config image\" width=\"1146\" height=\"851\" srcset=\"https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2026\/03\/mcp-config.webp 1146w, https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2026\/03\/mcp-config-300x223.webp 300w, https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2026\/03\/mcp-config-1024x760.webp 1024w, https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2026\/03\/mcp-config-768x570.webp 768w\" sizes=\"(max-width: 1146px) 100vw, 1146px\" \/><\/a><\/p>\n<h2 id=\"from-reactive-to-adaptive\" class=\"anchor anchorTargetStickyNavbar_Vzrq\">From Reactive to Adaptive<\/h2>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2026\/03\/agentic-platform-engineering.webp\"><img decoding=\"async\" class=\"alignnone size-full wp-image-1943\" src=\"https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2026\/03\/agentic-platform-engineering.webp\" alt=\"agentic platform engineering image\" width=\"939\" height=\"513\" srcset=\"https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2026\/03\/agentic-platform-engineering.webp 939w, https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2026\/03\/agentic-platform-engineering-300x164.webp 300w, https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2026\/03\/agentic-platform-engineering-768x420.webp 768w\" sizes=\"(max-width: 939px) 100vw, 939px\" \/><\/a><\/p>\n<p>The shift we&#8217;re describing isn&#8217;t just about automation. We&#8217;ve had automation for years. It&#8217;s about moving from\u00a0<strong>brittle, static processes to adaptive, reasoning-capable systems<\/strong>.<\/p>\n<table style=\"height: 144px; width: 67.9522%; border-collapse: collapse; border-style: solid; border-color: #000000; background-color: #ffffff;\">\n<tbody>\n<tr style=\"height: 24px;\">\n<th style=\"width: 8.80812%; height: 24px;\"><\/th>\n<th style=\"width: 18.1916%; height: 24px;\">Before<\/th>\n<th style=\"width: 15.128%; height: 24px;\">After<\/th>\n<\/tr>\n<tr style=\"height: 24px;\">\n<td style=\"width: 8.80812%; height: 24px;\"><strong>Knowledge<\/strong><\/td>\n<td style=\"width: 18.1916%; height: 24px;\">Tribal, in people&#8217;s heads<\/td>\n<td style=\"width: 15.128%; height: 24px;\">Encoded in repos, accessible via conversation<\/td>\n<\/tr>\n<tr style=\"height: 24px;\">\n<td style=\"width: 8.80812%; height: 24px;\"><strong>Standards<\/strong><\/td>\n<td style=\"width: 18.1916%; height: 24px;\">Manual enforcement, easily forgotten<\/td>\n<td style=\"width: 15.128%; height: 24px;\">Automatically applied on every push<\/td>\n<\/tr>\n<tr style=\"height: 24px;\">\n<td style=\"width: 8.80812%; height: 24px;\"><strong>Incident response<\/strong><\/td>\n<td style=\"width: 18.1916%; height: 24px;\">Reactive, expert-dependent<\/td>\n<td style=\"width: 15.128%; height: 24px;\">Agent-initiated, human-approved<\/td>\n<\/tr>\n<tr style=\"height: 24px;\">\n<td style=\"width: 8.80812%; height: 24px;\"><strong>Runbooks<\/strong><\/td>\n<td style=\"width: 18.1916%; height: 24px;\">Static documents<\/td>\n<td style=\"width: 15.128%; height: 24px;\">Dynamic agents that adapt to context<\/td>\n<\/tr>\n<tr style=\"height: 24px;\">\n<td style=\"width: 8.80812%; height: 24px;\"><strong>Onboarding<\/strong><\/td>\n<td style=\"width: 18.1916%; height: 24px;\">Weeks of knowledge transfer<\/td>\n<td style=\"width: 15.128%; height: 24px;\">Ask the platform, get answers immediately<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>The tools underneath haven&#8217;t changed-source control, GitHub Actions, infrastructure as code, <strong>kubectl<\/strong>. These are battle-tested and they&#8217;re not going anywhere. What&#8217;s changed is the layer on top: <strong>AI agents that can reason across all of these tools simultaneously<\/strong>, connecting dots that would take a human hours to trace.<\/p>\n<h2 id=\"what-this-means-for-your-platform\" class=\"anchor anchorTargetStickyNavbar_Vzrq\">What This Means for Your Platform<\/h2>\n<p>If you&#8217;re already doing platform engineering-even if it feels incomplete &#8211; you have a foundation to build on. The patterns we&#8217;ve described layer onto what you already have:<\/p>\n<ol>\n<li><strong>Start with awareness<\/strong>: Give GitHub Copilot access to your repos, your service catalogs, your infrastructure definitions. Let it become the knowledgeable colleague that&#8217;s always available.<\/li>\n<li><strong>Add enforcement<\/strong>: Set up GitHub Actions that trigger on code pushes and run Copilot-powered checks against your standards. Start with documentation generation-it&#8217;s low risk and high impact.<\/li>\n<li><strong>Enable agent operations<\/strong>: Wire Argo CD (or your monitoring tool of choice) to create GitHub issues on failures. Build a custom agent that can authenticate to your clusters and diagnose problems. Keep humans in the approval loop.<\/li>\n<\/ol>\n<p>You don&#8217;t have to boil the ocean. Pick one act, implement it, and iterate. The crawl-walk-run model applies here as much as anywhere else-each step delivers value on its own while building toward something greater.<\/p>\n<p>The full repository is at\u00a0<a class=\"\" href=\"https:\/\/github.com\/microsoftgbb\/agentic-platform-engineering\" target=\"_blank\" rel=\"noopener noreferrer\">microsoftgbb\/agentic-platform-engineering<\/a>. Clone it, walk through the acts, break the sample app on purpose, and watch the Cluster Doctor figure out what went wrong.<\/p>\n<p>For a shorter companion overview, the <a class=\"\" href=\"https:\/\/www.youtube.com\/watch?v=mGq442iwAF0\" target=\"_blank\" rel=\"noopener noreferrer\">Platform Engineering: Creating Scalable and Resilient Systems | BRK188<\/a> on YouTube is also worth watching and also make sure you check the <a href=\"https:\/\/devblogs.microsoft.com\/all-things-azure\/platform-engineering-for-the-agentic-ai-era\/\">Platform Engineering for the Agentic AI era | All things Azure<\/a> blog post, that provides a solid walkthrough of the current state of Platform Engineering and its future state.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>We&#8217;ve talked about the human scale problem\u00a0and what happens\u00a0when infrastructure scales but understanding doesn&#8217;t. If you&#8217;ve been following along, you know the thesis: our tools have outpaced our ability to operate them, and platform engineering is how we&#8217;re fighting back. But here&#8217;s the thing &#8211; we&#8217;ve been fighting with one hand tied behind our backs. [&hellip;]<\/p>\n","protected":false},"author":172655,"featured_media":1943,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[112,1,87,20,19,109],"tags":[123,125,124,122],"class_list":["post-1941","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-agentic-devops","category-azure","category-containers","category-developer-productivity","category-github-copilot","category-platform-engineering","tag-agenticdevops","tag-azureglobalblackbelts","tag-developerexperience","tag-platformengineering"],"acf":[],"blog_post_summary":"<p>We&#8217;ve talked about the human scale problem\u00a0and what happens\u00a0when infrastructure scales but understanding doesn&#8217;t. If you&#8217;ve been following along, you know the thesis: our tools have outpaced our ability to operate them, and platform engineering is how we&#8217;re fighting back. But here&#8217;s the thing &#8211; we&#8217;ve been fighting with one hand tied behind our backs. [&hellip;]<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-json\/wp\/v2\/posts\/1941","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-json\/wp\/v2\/users\/172655"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-json\/wp\/v2\/comments?post=1941"}],"version-history":[{"count":1,"href":"https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-json\/wp\/v2\/posts\/1941\/revisions"}],"predecessor-version":[{"id":2020,"href":"https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-json\/wp\/v2\/posts\/1941\/revisions\/2020"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-json\/wp\/v2\/media\/1943"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-json\/wp\/v2\/media?parent=1941"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-json\/wp\/v2\/categories?post=1941"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-json\/wp\/v2\/tags?post=1941"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}