{"id":2707,"date":"2026-06-18T13:37:18","date_gmt":"2026-06-18T20:37:18","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/foundry\/?p=2707"},"modified":"2026-06-18T15:15:30","modified_gmt":"2026-06-18T22:15:30","slug":"outcome-driven-learning-systems-enterprise-rl-with-openenv-and-foundry","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/foundry\/outcome-driven-learning-systems-enterprise-rl-with-openenv-and-foundry\/","title":{"rendered":"Outcome-driven learning systems: Enterprise RL with OpenEnv and Foundry"},"content":{"rendered":"<p class=\"intro-lede wp-block-paragraph\">We shipped a lot at <a href=\"https:\/\/devblogs.microsoft.com\/foundry\/whats-new-in-microsoft-foundry-build-2026\/\">Build 2026<\/a>: hosted agents, Toolboxes, Foundry IQ, Memory, Managed Compute, fine\u2011tuning, Frontier Tuning, and a new evaluation and optimization stack. Read as a feature list, it is a lot to hold in your head. So here is a simpler way to see it: these are the parts you need to build a <strong>learning system<\/strong>, with agents that get measurably better at your work over time, not a chatbot that answers once and forgets. This post is about assembling those parts into one loop you own, and the science that makes a small, owned model worth training.<\/p>\n<p class=\"wp-block-paragraph\">It builds on two pieces worth reading first. Jay Parikh\u2019s <a href=\"https:\/\/blogs.microsoft.com\/blog\/2026\/06\/02\/ai-alone-wont-change-your-business-the-system-running-it-will\/\">\u201cAI alone won\u2019t change your business, the system running it will\u201d<\/a> argues that the winners do not just adopt a model; they stand up a governed system that improves the longer it runs. Satya\u2019s framing in <a href=\"https:\/\/snscratchpad.com\/posts\/frontier-ecosystem\/\">\u201ca frontier ecosystem, not just a frontier model\u201d<\/a> sharpens it: the durable asset is not the model you rent, it is the <strong>learning loop you own<\/strong>. Same idea from two angles. Build a system that improves against your outcomes, and make sure you own it.<\/p>\n<p class=\"wp-block-paragraph\">Foundry makes that loop something you can build in an <strong>open, interoperable, and modular<\/strong> way, so you can swap any piece (the model, the trainer, the tools) without rebuilding the whole thing. Two ingredients sit under every such system: a place for the agent to <em>practice<\/em> (an <strong>environment<\/strong>) and a way to <em>judge<\/em> it (an <strong>eval<\/strong>). To keep both open, Microsoft joined the <a href=\"https:\/\/huggingface.co\/blog\/openenv-agentic-rl\"><strong>OpenEnv<\/strong><\/a> community.<\/p>\n<figure class=\"wp-block-image aligncenter diagram-img\"><a href=\"https:\/\/devblogs.microsoft.com\/foundry\/wp-content\/uploads\/sites\/89\/2026\/06\/img1_system-scaled.webp\"><img decoding=\"async\" class=\"alignnone size-full wp-image-2724\" src=\"https:\/\/devblogs.microsoft.com\/foundry\/wp-content\/uploads\/sites\/89\/2026\/06\/img1_system-scaled.webp\" alt=\"The hill-climbing loop. A hosted agent, your harness (the Microsoft Agent Framework) plus a swappable model, runs in a per-session ACA Sandbox; tracing captures every run and your rubric judges the outcome. Then it learns two ways: non-parametric learning keeps the weights frozen and tunes the prompt, skills, tools, and model choice with the Agent Optimizer and SkillOpt, while parametric learning changes the weights with Foundry post-training (Tinker) and ECHO on OpenEnv. Keep what wins on held-out tasks; the better agent is versioned and ready to ship to Teams, Microsoft 365, and any other channel, then re-enters the loop. Grounded by Foundry IQ, Memory, and Toolbox. Build in GitHub, deploy and operate in Foundry. The model is swappable; the learning stays yours.\" width=\"2500\" height=\"1613\" srcset=\"https:\/\/devblogs.microsoft.com\/foundry\/wp-content\/uploads\/sites\/89\/2026\/06\/img1_system-scaled.webp 2500w, https:\/\/devblogs.microsoft.com\/foundry\/wp-content\/uploads\/sites\/89\/2026\/06\/img1_system-300x194.webp 300w, https:\/\/devblogs.microsoft.com\/foundry\/wp-content\/uploads\/sites\/89\/2026\/06\/img1_system-1024x661.webp 1024w, https:\/\/devblogs.microsoft.com\/foundry\/wp-content\/uploads\/sites\/89\/2026\/06\/img1_system-768x495.webp 768w, https:\/\/devblogs.microsoft.com\/foundry\/wp-content\/uploads\/sites\/89\/2026\/06\/img1_system-1536x991.webp 1536w, https:\/\/devblogs.microsoft.com\/foundry\/wp-content\/uploads\/sites\/89\/2026\/06\/img1_system-2048x1321.webp 2048w\" sizes=\"(max-width: 2500px) 100vw, 2500px\" \/><\/a><br \/><figcaption class=\"wp-element-caption\"><strong>The hill-climbing loop.<\/strong> A <strong>hosted agent<\/strong> (your harness, the Microsoft Agent Framework, plus a swappable model) runs in a per-session <strong>ACA Sandbox<\/strong>; <strong>tracing<\/strong> captures every run and your <strong>rubric<\/strong> judges it. Then it learns two ways: <strong>non-parametric<\/strong> (Agent Optimizer + SkillOpt) and <strong>parametric<\/strong> (Foundry post-training with Tinker + ECHO on OpenEnv). Keep what wins, then ship the better agent to <strong>Teams, Microsoft 365, and any other channel<\/strong>. Grounded by <strong>Foundry IQ + Memory + Toolbox<\/strong>. The model is swappable; <strong>the learning stays yours<\/strong>.<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">&nbsp;<\/p>\n<p class=\"audience-note\"><em>However you come to this, whether PM, IT admin, developer, or AI engineer, the first half is the plain-language <strong>what<\/strong> and <strong>why<\/strong>. The second half is the deep science. Skim or dive as suits you.<\/em><\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<h2 class=\"wp-block-heading\">Environments, evals, and rubrics, in plain language<\/h2>\n<p class=\"wp-block-paragraph\">An <strong>environment<\/strong> (an RLE, or reinforcement-learning environment) is a <em>practice space<\/em> for your agent (harness + model). It is your real workflow and your standard operating procedure, codified so an agent can act inside it: the steps, the tools it is allowed to use, the data it sees. Think of it as a flight simulator for one of your business processes, close enough to the real thing that getting good in the simulator means getting good at the job.<\/p>\n<p class=\"wp-block-paragraph\">An <strong>eval<\/strong> is how you <em>judge<\/em> a result, and the heart of an eval is a <strong>rubric<\/strong>: a clear, scored definition of \u201cdone right\u201d for <em>your<\/em> outcome, not a public leaderboard. \u201cDid it reconcile the invoice to the contract? Did it cite a real clause? Did it stay inside policy?\u201d Foundry ships <a href=\"https:\/\/learn.microsoft.com\/azure\/foundry\/observability\/how-to\/evaluate-agent\"><strong>agent evaluation<\/strong><\/a> for writing exactly these judgments, and an optimizer (below) for acting on the scores.<\/p>\n<p class=\"wp-block-paragraph\">Here is the move that ties it together: <strong>an environment already contains its eval.<\/strong> Codify your workflow <em>plus<\/em> your outcome rubric, and you have not just written a test, you have built a <strong>hill-climbing space<\/strong>. The agent practices, the rubric scores, and the system climbs toward your outcome. That is why RLEs are also evals: it is one artifact that both exercises the agent and grades it.<\/p>\n<blockquote class=\"wp-block-quote pullquote-brand is-layout-flow wp-block-quote-is-layout-flow\">\n<p>Codify your workflow and your outcome into an environment, and the model becomes a part you can swap. The expertise lives in the loop you own, so the learning stays yours.<\/p>\n<\/blockquote>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<h2 class=\"wp-block-heading\">A system learns in two ways<\/h2>\n<p class=\"wp-block-paragraph\">Before any science, the single most useful idea in this post: a system can get better in two different ways, and you should reach for them in order.<\/p>\n<h3 class=\"wp-block-heading\">1. Non-parametric learning (the weights stay frozen)<\/h3>\n<p class=\"wp-block-paragraph\">The first kind leaves the model\u2019s weights untouched and improves the <strong>harness<\/strong> around them: the system prompt, the <strong>skills<\/strong> (named, reusable procedures), the tool descriptions, and the context the agent retrieves through <strong>Foundry IQ and Memory<\/strong>, plus which model it runs on. No GPUs, no training run, results in minutes. Foundry ships this as the <a href=\"https:\/\/devblogs.microsoft.com\/foundry\/agent-optimizer-build2026\/\"><strong>agent optimizer<\/strong><\/a>: it runs a closed loop (evaluate your agent against your criteria, generate better configurations, score them, deploy the winner) and will rewrite your instructions, synthesize skills, sharpen tool descriptions, or pick a better model for your quality and cost trade-off. In our example a bare support agent climbs from <strong>0.60 to 0.92<\/strong> on its rubric with no retraining and no code changes, just a smarter prompt, skills, and tools.<\/p>\n<p class=\"wp-block-paragraph\">Microsoft Research is pushing harness optimization even further. <a href=\"https:\/\/microsoft.github.io\/SkillOpt\/\"><strong>SkillOpt<\/strong><\/a> treats the skill document itself as the trainable thing: it edits a single Markdown skill from scored rollouts and accepts an edit <em>only<\/em> when a held-out validation score strictly improves, the same discipline that makes weight training reproducible, but with zero weight changes and zero extra calls at inference time. The deployable artifact is a compact <code>best_skill.md<\/code> that runs against an unchanged model, and it lifts no-skill accuracy by more than <strong>20 points<\/strong> on a frontier model across six benchmarks (<a href=\"https:\/\/arxiv.org\/abs\/2605.23904\">paper<\/a>). Start here. Non-parametric learning is cheap, fast, and often enough on its own.<\/p>\n<h3 class=\"wp-block-heading\">2. Parametric learning (the weights change)<\/h3>\n<p class=\"wp-block-paragraph\">The second kind is for when you want the behavior <em>in the model itself<\/em>: faster and cheaper to serve, and fully owned (sovereign). You change the weights with <strong>post-training<\/strong>. This is where a small open model can quietly overtake a frontier API on your task, and where the deepest new science lives, because you can teach the model not just <em>what to do<\/em> but <em>how its world responds<\/em>. That technique is <strong>ECHO<\/strong>, and most of the rest of this post is about it. The point to carry forward: do the cheap non-parametric learning first, and turn to parametric learning when the outcome, and the economics, justify owning the model.<\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<h2 class=\"wp-block-heading\">One agent, one sandbox, both kinds of learning<\/h2>\n<p class=\"wp-block-paragraph\">Both kinds of learning run against the same thing: a <a href=\"https:\/\/devblogs.microsoft.com\/foundry\/introducing-the-new-hosted-agents-in-foundry-agent-service-secure-scalable-compute-built-for-agents\/\"><strong>hosted agent<\/strong><\/a> (your model plus prompt plus tools and skills) executing in an isolated, project-owned sandbox. Foundry runs hosted agents on <a href=\"https:\/\/techcommunity.microsoft.com\/blog\/appsonazureblog\/introducing-azure-container-apps-sandboxes-secure-infrastructure-for-agentic-wor\/4524131\"><strong>Azure Container Apps (ACA) sandboxes<\/strong><\/a>: each gets its own filesystem, session, and state, with default-deny networking so a tool call cannot quietly exfiltrate a secret. The agent optimizer drives that hosted agent through its evaluation loop inside exactly this sandbox.<\/p>\n<p class=\"wp-block-paragraph\">And Microsoft\u2019s contribution to OpenEnv (more in a moment) makes the very same ACA sandbox an <em>OpenEnv environment<\/em>, so the agent you optimize non-parametrically is the agent you post-train parametrically, in the same secure box. One agent, one sandbox, two ways to climb. The diagram at the top of this post shows the whole loop on a page.<\/p>\n<p class=\"wp-block-paragraph\">Mapped to what shipped: <a href=\"https:\/\/devblogs.microsoft.com\/foundry\/introducing-toolboxes-in-foundry\/\"><strong>Toolboxes<\/strong><\/a> give the agent one governed set of tools; <strong>ACA sandboxes<\/strong> give it an isolated place to run; <a href=\"https:\/\/devblogs.microsoft.com\/foundry\/build-smarter-agents-faster-with-foundry-iq\/\"><strong>Foundry IQ<\/strong><\/a> is the knowledge plane that grounds it; <strong>agent evaluation<\/strong> is the rubric; the <a href=\"https:\/\/devblogs.microsoft.com\/foundry\/agent-optimizer-build2026\/\"><strong>agent optimizer<\/strong><\/a> and <strong>post-training<\/strong> are the two improve steps; <a href=\"https:\/\/devblogs.microsoft.com\/foundry\/announcing-foundry-managed-compute\/\"><strong>Managed Compute<\/strong><\/a> serves the result. One open standard underneath keeps the parts from locking you in: OpenEnv.<\/p>\n<h3 class=\"wp-block-heading\">We joined OpenEnv, with contributions for enterprise agentic learning loops<\/h3>\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/huggingface.co\/blog\/openenv-agentic-rl\">OpenEnv<\/a> is a <em>protocol<\/em> for environments: a small, shared contract (<code>reset<\/code>, <code>step<\/code>, <code>state<\/code>, carried over MCP, packaged with Docker, with the promise that the training environment matches production). It is not a reward framework and not a trainer; it is the thin interoperable layer that lets any model, harness, trainer, and environment compose. That is the interoperable in open-and-interoperable, and it is why Microsoft joined the community alongside Meta\u2019s PyTorch team, NVIDIA, Hugging Face, Unsloth, Prime Intellect, and others.<\/p>\n<p class=\"wp-block-paragraph\">Two of those contributions are already merged into <a href=\"https:\/\/github.com\/huggingface\/OpenEnv\">OpenEnv<\/a>: a hosted <a href=\"https:\/\/github.com\/huggingface\/OpenEnv\/pull\/793\"><strong>Azure Container Apps sandbox provider<\/strong><\/a>, so an RLE can run rollouts in the isolated, project-owned Azure sandbox above, with default-deny egress that blocks token theft, the enterprise-grade isolation an RLE needs; and <a href=\"https:\/\/github.com\/huggingface\/OpenEnv\/pull\/819\"><strong>ECHO env-token world-modeling as RFC 010<\/strong><\/a>, which teaches trainers to learn from the environment\u2019s own tokens. Private RLEs and private evals, kept open and interoperable on purpose with more to come.<\/p>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/foundry\/wp-content\/uploads\/sites\/89\/2026\/06\/banner-opeenv.webp\"><img decoding=\"async\" class=\"alignnone wp-image-2712\" src=\"https:\/\/devblogs.microsoft.com\/foundry\/wp-content\/uploads\/sites\/89\/2026\/06\/banner-opeenv.webp\" alt=\"Openenv standard\" width=\"896\" height=\"448\" srcset=\"https:\/\/devblogs.microsoft.com\/foundry\/wp-content\/uploads\/sites\/89\/2026\/06\/banner-opeenv.webp 1300w, https:\/\/devblogs.microsoft.com\/foundry\/wp-content\/uploads\/sites\/89\/2026\/06\/banner-opeenv-300x150.webp 300w, https:\/\/devblogs.microsoft.com\/foundry\/wp-content\/uploads\/sites\/89\/2026\/06\/banner-opeenv-1024x512.webp 1024w, https:\/\/devblogs.microsoft.com\/foundry\/wp-content\/uploads\/sites\/89\/2026\/06\/banner-opeenv-768x384.webp 768w\" sizes=\"(max-width: 896px) 100vw, 896px\" \/><\/a><\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<h2 class=\"wp-block-heading\">Post-training, without the heavy lifting<\/h2>\n<p class=\"wp-block-paragraph\">Before the frontier technique, the basics, because owning the weights is far more approachable than it used to be. Fine-tuning a small model on your task and serving it once meant standing up GPU clusters and a training stack. Foundry\u2019s managed post-training removes that: it exposes a <strong>Tinker-style training loop<\/strong>, the low-level primitives <code>sample<\/code>, <code>forward_backward<\/code>, and <code>optim_step<\/code>, running server-side on Foundry\u2019s GPUs while you keep the data and the loop. You write the loop; the service owns the hardware. No client GPUs, no cluster to babysit. Two Build sessions walk it end to end: <a href=\"https:\/\/www.youtube.com\/watch?v=uyxSyo7PJ7k\"><strong>BRK231, Deploy. Observe. Learn<\/strong><\/a> and <a href=\"https:\/\/www.youtube.com\/watch?v=ISurXM76eXI\"><strong>BRK232, Post-Training and Deploying Open-Source Reasoning Models in Foundry<\/strong><\/a>.<\/p>\n<p class=\"wp-block-paragraph\">That same managed loop is where Microsoft is pushing the frontier: not just consuming the stack but advancing it and contributing the pieces back to OpenEnv (the ACA sandbox provider, and the world-modeling work in the next section), so the whole ecosystem benefits. The clearest example is next. It turns the wasted half of every rollout into a free training signal, and it lands as a one-line change on exactly this loop.<\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<h2 class=\"wp-block-heading\">Pushing OpenEnv environments to the frontier<\/h2>\n<p class=\"wp-block-paragraph\">Joining OpenEnv is not a logo exercise. An open standard stays relevant only if it keeps absorbing the research frontier, so part of our work is diffusing that frontier into it. The clearest example is a contribution we landed as a pull request, <a href=\"https:\/\/github.com\/huggingface\/OpenEnv\/pull\/819\"><strong>ECHO world-modeling (RFC 010)<\/strong><\/a>, which brings a Microsoft Research result (<a href=\"https:\/\/arxiv.org\/abs\/2605.24517\">\u201cTerminal Agents Learn World Models for Free\u201d<\/a>) into OpenEnv, where any team can pick it up. That is how a lab technique becomes a shared capability and the learning loop gets democratized.<\/p>\n<p class=\"wp-block-paragraph\">Here is what it does. An agent transcript is half <strong>actions<\/strong> (what the model writes) and half <strong>observations<\/strong> (what the environment writes back). Standard agent-RL trains the actions and masks the observations away. ECHO keeps them: a small cross-entropy term that makes the policy predict the environment\u2019s own tokens, a world model, from logits it already computed in the same forward pass. No extra rollouts, no teacher, no labels.<\/p>\n<pre class=\"wp-block-preformatted\">  L  =  L_GRPO(action tokens)   +   \u03bb \u00b7 CrossEntropy(observation tokens)<\/pre>\n<figure class=\"wp-block-image aligncenter diagram-img\"><a href=\"https:\/\/devblogs.microsoft.com\/foundry\/wp-content\/uploads\/sites\/89\/2026\/06\/echo-scaled.webp\"><img decoding=\"async\" class=\"alignnone wp-image-2713\" src=\"https:\/\/devblogs.microsoft.com\/foundry\/wp-content\/uploads\/sites\/89\/2026\/06\/echo-scaled.webp\" alt=\"The ECHO mechanism. One rollout shares its logits across action tokens and environment-observation tokens. Action tokens go to the GRPO policy-gradient loss (sparse, reward-gated, standard RL). Observation tokens go to a lambda-weighted cross-entropy loss (dense, about free, the world model). Both gradients are summed into a single optimizer step, and that step is ECHO.\" width=\"843\" height=\"337\" srcset=\"https:\/\/devblogs.microsoft.com\/foundry\/wp-content\/uploads\/sites\/89\/2026\/06\/echo-scaled.webp 2500w, https:\/\/devblogs.microsoft.com\/foundry\/wp-content\/uploads\/sites\/89\/2026\/06\/echo-300x120.webp 300w, https:\/\/devblogs.microsoft.com\/foundry\/wp-content\/uploads\/sites\/89\/2026\/06\/echo-1024x410.webp 1024w, https:\/\/devblogs.microsoft.com\/foundry\/wp-content\/uploads\/sites\/89\/2026\/06\/echo-768x307.webp 768w, https:\/\/devblogs.microsoft.com\/foundry\/wp-content\/uploads\/sites\/89\/2026\/06\/echo-1536x614.webp 1536w, https:\/\/devblogs.microsoft.com\/foundry\/wp-content\/uploads\/sites\/89\/2026\/06\/echo-2048x819.webp 2048w\" sizes=\"(max-width: 843px) 100vw, 843px\" \/><\/a><figcaption class=\"wp-element-caption\"><strong>ECHO in one step.<\/strong> One rollout, split by per-token role: actions get the RL loss, observations get a \u03bb-weighted cross-entropy loss, summed into a single optimizer step. \u03bb = 0 is vanilla RL, so it is safe to adopt incrementally.<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">The free signal is large and real: on a captured agent episode, <strong>4,659 of 5,247 learnable tokens (89%) are environment observations<\/strong>, 7.9\u00d7 the action tokens, exactly the half standard agent-RL discards. Prime Intellect reaches the same conclusion in <a href=\"https:\/\/www.primeintellect.ai\/blog\/true-agents-model-the-world\">\u201cTrue Agents Model the World\u201d<\/a>, restating supervised learning on tool-response tokens as RL with a constant positive advantage, foldable in at no extra cost. Two groups, one direction: world-modeling belongs inside the RL loop, not bolted on afterward.<\/p>\n<p class=\"wp-block-paragraph\">On the honest ablation (\u03bb on versus off), the training reward barely moves; the gain is <strong>generalization<\/strong>. ECHO\u2019s published results: held-out <strong>pass@1 roughly doubles<\/strong> on TerminalBench-2.0, RL reaches its target about <strong>2.3\u00d7 faster<\/strong>, and it recovers <strong>50 to 104% of expert-SFT with no teacher<\/strong>. Even verifier-free (reward off), held-out tasks improve. Keep \u03bb small and sweep it; the dense signal overfits if pushed (one open model collapsed at 0.05 and was stable at 0.005).<\/p>\n<p>&nbsp;<\/p>\n<figure class=\"wp-block-image aligncenter diagram-img\"><a href=\"https:\/\/devblogs.microsoft.com\/foundry\/wp-content\/uploads\/sites\/89\/2026\/06\/parametric_update.webp\"><img decoding=\"async\" class=\"alignnone wp-image-2714\" src=\"https:\/\/devblogs.microsoft.com\/foundry\/wp-content\/uploads\/sites\/89\/2026\/06\/parametric_update.webp\" alt=\"Bar chart comparing ECHO against standard RL on two groups of tasks. On training tasks both reach a relative pass@1 of about one, so the training reward is about equal. On held-out tasks, standard RL stays at one while ECHO reaches about two, roughly doubling held-out pass@1. The gain is generalization at about the same training reward.\" width=\"833\" height=\"469\" srcset=\"https:\/\/devblogs.microsoft.com\/foundry\/wp-content\/uploads\/sites\/89\/2026\/06\/parametric_update.webp 1185w, https:\/\/devblogs.microsoft.com\/foundry\/wp-content\/uploads\/sites\/89\/2026\/06\/parametric_update-300x169.webp 300w, https:\/\/devblogs.microsoft.com\/foundry\/wp-content\/uploads\/sites\/89\/2026\/06\/parametric_update-1024x576.webp 1024w, https:\/\/devblogs.microsoft.com\/foundry\/wp-content\/uploads\/sites\/89\/2026\/06\/parametric_update-768x432.webp 768w\" sizes=\"(max-width: 833px) 100vw, 833px\" \/><\/a><figcaption class=\"wp-element-caption\"><strong>What the weight update buys.<\/strong> Same training reward; held-out pass@1 roughly doubles. ECHO also reports about 2.3\u00d7 faster RL and 50 to 104% of expert-SFT recovered with no teacher (<a href=\"https:\/\/arxiv.org\/abs\/2605.24517\">arXiv 2605.24517<\/a>, microsoft\/echo-rl on SkyRL; corroborated by Prime Intellect).<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">You can watch the mechanism on a laptop in about 40 seconds: a small model on a deterministic toy terminal env drives held-out env-token cross-entropy toward zero. It reaches zero only because that world is fully predictable; a real environment keeps its irreducible entropy (near 4.4 nats), so ECHO sharpens predictions rather than perfecting them. Reproduce it: <a href=\"https:\/\/github.com\/huggingface\/OpenEnv\/tree\/main\/examples\/echo_world_model\">OpenEnv <code>examples\/echo_world_model<\/code><\/a>, <code>python train_echo.py --steps 60 --seed 0<\/code>.<\/p>\n<figure class=\"wp-block-image aligncenter diagram-img\"><a href=\"https:\/\/devblogs.microsoft.com\/foundry\/wp-content\/uploads\/sites\/89\/2026\/06\/cross-entropy.webp\"><img decoding=\"async\" class=\"alignnone wp-image-2715\" src=\"https:\/\/devblogs.microsoft.com\/foundry\/wp-content\/uploads\/sites\/89\/2026\/06\/cross-entropy.webp\" alt=\"Reproduction curve on a tiny deterministic terminal environment. Two lines, train and held-out environment-token cross-entropy, both fall from about 5 to 6 nats per token toward zero over 60 steps. The held-out line reaches its best near 0.25 nats around step 40, then drifts mildly upward, which is the overfitting. Lower is better. Runs on a laptop CPU in about 40 seconds.\" width=\"836\" height=\"451\" srcset=\"https:\/\/devblogs.microsoft.com\/foundry\/wp-content\/uploads\/sites\/89\/2026\/06\/cross-entropy.webp 1215w, https:\/\/devblogs.microsoft.com\/foundry\/wp-content\/uploads\/sites\/89\/2026\/06\/cross-entropy-300x162.webp 300w, https:\/\/devblogs.microsoft.com\/foundry\/wp-content\/uploads\/sites\/89\/2026\/06\/cross-entropy-1024x553.webp 1024w, https:\/\/devblogs.microsoft.com\/foundry\/wp-content\/uploads\/sites\/89\/2026\/06\/cross-entropy-768x415.webp 768w\" sizes=\"(max-width: 836px) 100vw, 836px\" \/><\/a><figcaption class=\"wp-element-caption\"><strong>Reproduce it on CPU.<\/strong> A toy, fully deterministic terminal env, so cross-entropy can approach zero; a real env keeps its irreducible entropy instead. The held-out line bottoms near step 40 and then mildly overfits, which is why \u03bb stays small. Run it: <a href=\"https:\/\/github.com\/huggingface\/OpenEnv\/tree\/main\/examples\/echo_world_model\">OpenEnv <code>examples\/echo_world_model<\/code><\/a>, <code>python train_echo.py --steps 60 --seed 0<\/code>.<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">And it holds on the managed path. Because supervised learning on the observation tokens is just RL with a constant positive advantage, there is no second loss function: you reuse the same <code>forward_backward<\/code> and add a small positive advantage \u03bb on the environment tokens. One vector changes, and the same one-line config runs on the open SkyRL reference, Tinker, and Foundry post-training unchanged. We ran it live with a small Qwen model, and it also runs with <strong>MAI-Reasoning-1-Flash<\/strong>; the backend metrics came back namespaced <code>skyrl.ai<\/code>, the open reference stack running underneath the managed service.<\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<h2 class=\"wp-block-heading\">The loop that improves itself: RSI<\/h2>\n<p class=\"wp-block-paragraph\">One last reason to own the environment, not just the model: the gym is where compounding begins. Once your workflow, tools, and rubric live in an OpenEnv RLE, the same trace data that post-trains the model can also improve the <em>environment<\/em> itself. The OpenEnv roadmap points squarely at this, a family of self-improving-gym designs: curricula that generate harder tasks as the agent gets better, harness optimizers, and new environments built automatically from captured production traces. That is recursive self-improvement (RSI) in action. The system writes its own next set of exercises, and each cycle sharpens the next. The learning does not only accrue in the weights; it accrues in the gym, which is the part you own.<\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<h2 class=\"wp-block-heading\">Start building the loop that is yours<\/h2>\n<p class=\"wp-block-paragraph\">Pull it back up to the top. Codify your workflow and your outcome into an OpenEnv-compatible RLE, and you have a hill-climbing learning system that is genuinely yours: open, interoperable, and outcome-driven. Improve it the cheap way first (tune the prompt, skills, tools, and model choice with the <a href=\"https:\/\/devblogs.microsoft.com\/foundry\/agent-optimizer-build2026\/\">agent optimizer<\/a> and ideas from <a href=\"https:\/\/microsoft.github.io\/SkillOpt\/\">SkillOpt<\/a>), then the deep way when the economics justify it (post-train the weights, with ECHO turning the discarded 89% of your trajectories into a free world model). The model in the middle is a part you can swap; the loop around it is the asset that compounds the longer it runs.<\/p>\n<p class=\"wp-block-paragraph\">The managed on-ramp is <a href=\"https:\/\/microsoft.ai\/models\/microsoft-frontier-tuning\/\"><strong>Frontier Tuning<\/strong><\/a>: frontier-grade performance with superior token efficiency, improved through real usage in Foundry and Copilot, and secured inside your own environment. Early adopters including <strong>McKinsey<\/strong>, <strong>Bristol Myers Squibb<\/strong>, and <strong>Land O\u2019Lakes<\/strong> are already building with it.<\/p>\n<p class=\"cta-final wp-block-paragraph\">The fastest way in is a partnership. <strong>Ask your Microsoft Forward-Deployed Engineer (FDE) or your Microsoft account team to engage<\/strong>, and build the OpenEnv-compatible RLEs and outcome-driven learning systems where the model is swappable and the learning stays yours.<\/p>\n<p class=\"links-footer wp-block-paragraph\"><strong>Build it on Foundry:<\/strong> <a href=\"https:\/\/microsoft.ai\/models\/microsoft-frontier-tuning\/\">Frontier Tuning<\/a> \u00b7 <a href=\"https:\/\/devblogs.microsoft.com\/foundry\/agent-optimizer-build2026\/\">Agent Optimizer<\/a> \u00b7 <a href=\"https:\/\/devblogs.microsoft.com\/foundry\/introducing-the-new-hosted-agents-in-foundry-agent-service-secure-scalable-compute-built-for-agents\/\">Hosted Agents<\/a> \u00b7 <a href=\"https:\/\/devblogs.microsoft.com\/foundry\/build-smarter-agents-faster-with-foundry-iq\/\">Foundry IQ<\/a> \u00b7 <a href=\"https:\/\/devblogs.microsoft.com\/foundry\/introducing-toolboxes-in-foundry\/\">Toolboxes<\/a> \u00b7 <a href=\"https:\/\/devblogs.microsoft.com\/foundry\/announcing-foundry-managed-compute\/\">Managed Compute<\/a> \u00b7 <a href=\"https:\/\/devblogs.microsoft.com\/foundry\/whats-new-in-microsoft-foundry-build-2026\/\">What\u2019s new at Build 2026<\/a><\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>We shipped a lot at Build 2026: hosted agents, Toolboxes, Foundry IQ, Memory, Managed Compute, fine\u2011tuning, Frontier Tuning, and a new evaluation and optimization stack. Read as a feature list, it is a lot to hold in your head. So here is a simpler way to see it: these are the parts you need to [&hellip;]<\/p>\n","protected":false},"author":172657,"featured_media":2712,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[141,112,163,1,27],"tags":[],"class_list":["post-2707","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-agent-optimizer","category-foundry-agent-service","category-microsoft-build","category-microsoft-foundry","category-whats-new"],"acf":[],"blog_post_summary":"<p>We shipped a lot at Build 2026: hosted agents, Toolboxes, Foundry IQ, Memory, Managed Compute, fine\u2011tuning, Frontier Tuning, and a new evaluation and optimization stack. Read as a feature list, it is a lot to hold in your head. So here is a simpler way to see it: these are the parts you need to [&hellip;]<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/foundry\/wp-json\/wp\/v2\/posts\/2707","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/foundry\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/foundry\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/foundry\/wp-json\/wp\/v2\/users\/172657"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/foundry\/wp-json\/wp\/v2\/comments?post=2707"}],"version-history":[{"count":2,"href":"https:\/\/devblogs.microsoft.com\/foundry\/wp-json\/wp\/v2\/posts\/2707\/revisions"}],"predecessor-version":[{"id":2725,"href":"https:\/\/devblogs.microsoft.com\/foundry\/wp-json\/wp\/v2\/posts\/2707\/revisions\/2725"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/foundry\/wp-json\/wp\/v2\/media\/2712"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/foundry\/wp-json\/wp\/v2\/media?parent=2707"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/foundry\/wp-json\/wp\/v2\/categories?post=2707"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/foundry\/wp-json\/wp\/v2\/tags?post=2707"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}