{"id":12577,"date":"2026-06-22T06:21:18","date_gmt":"2026-06-22T13:21:18","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/cosmosdb\/?p=12577"},"modified":"2026-06-22T06:21:18","modified_gmt":"2026-06-22T13:21:18","slug":"deep-agents-to-plan-act-verify-against-operational-data","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/cosmosdb\/deep-agents-to-plan-act-verify-against-operational-data\/","title":{"rendered":"How to Use Deep Agents with Azure Cosmos DB \u2013 Plan, act, and verify against operational data"},"content":{"rendered":"<p><a href=\"https:\/\/docs.langchain.com\/oss\/python\/deepagents\/overview\">Deep Agents<\/a> is an agent harness built on <a href=\"http:\/\/docs.langchain.com\/oss\/python\/langgraph\/overview\">LangGraph<\/a>, for agents that need to work through a task over many steps instead of a single LLM call. The agent runs tools, looks at the results, and uses that to pick the next one, keeping a todo list as it goes. On top of that loop the harness brings what a longer-running agent needs. It can load instructions on demand instead of holding everything in the prompt (skills), offload large tool outputs so they don\u2019t fill the context window, and pause for human approval in apps that need an approval gate before data changes.<\/p>\n<p>Support Ops Agent is a sample app that puts this to work on a customer-support ticket queue. We can ask it which tickets are at risk, who\u2019s overloaded, or whether a run of similar complaints is really one outage. When a ticket needs to change, it updates the ticket and reads it back to confirm. Most requests become a handful of reads against the queue. Requests that change a ticket add a patch and a verification read.<\/p>\n<p>That queue lives in <a href=\"https:\/\/learn.microsoft.com\/azure\/cosmos-db\/nosql\/\">Azure Cosmos DB<\/a>, the operational database the support team already runs on. The agent reads and writes that same store through the Azure Cosmos DB SDK, so it works on the live tickets, with no side index to keep in sync. Each ticket is an Azure Cosmos DB item, with its tags and history kept right inside it, and the agent updates that item directly. With the partition key doing its job, point reads and customer-scoped queries stay cheap. Queue-wide investigations spend RUs based on the cross-partition work they do, which is why the tools project only the fields they need. The schema is flexible, so the agent can add a tag or append to a history array without a migration.<\/p>\n<p><div class=\"alert alert-primary\">The code is on <a href=\"https:\/\/github.com\/abhirockzz\/deepagents-cosmosdb-support-ops\">GitHub<\/a> with instructions to run it against your own Azure Cosmos DB account.<\/div><\/p>\n<p>In this post, I\u2019ll go through:<\/p>\n<ul>\n<li>what the agent can do, and the Azure Cosmos DB operation behind each kind of request<\/li>\n<li>why Deep Agents and Azure Cosmos DB fit this problem<\/li>\n<li>the tools it uses to work on the ticket queue<\/li>\n<li>practical examples of how the agent works: morning triage, resolving a ticket, and spotting an incident<\/li>\n<\/ul>\n<h2>Agent capabilities<\/h2>\n<p>The requests in this sample all come down to a few Azure Cosmos DB operations. Some questions only need reads. Others need the agent to read first, decide what changed, and then patch the ticket.<\/p>\n<table>\n<thead>\n<tr>\n<th>Ask it to\u2026<\/th>\n<th>What the agent does<\/th>\n<th>Cosmos DB operations<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><strong>Triage the queue<\/strong><\/td>\n<td>Finds the at-risk tickets (high priority, still active, gone stale) and reports the handful that actually matter<\/td>\n<td>cross-partition query, filter, ORDER BY<\/td>\n<\/tr>\n<tr>\n<td><strong>Resolve a ticket<\/strong><\/td>\n<td>Point-reads the ticket, checks related ones from the same customer, updates status, owner, and history, then re-reads to confirm<\/td>\n<td>point read, related-item query, update, verify<\/td>\n<\/tr>\n<tr>\n<td><strong>Spot an incident<\/strong><\/td>\n<td>Searches for a cluster across customers, including symptoms filed under the wrong area, and can tag the group as a known issue<\/td>\n<td>multi-step query, repeated patches<\/td>\n<\/tr>\n<tr>\n<td><strong>Check queue health<\/strong><\/td>\n<td>Summarizes the queue by status, by area, and by who is carrying the load<\/td>\n<td>grouped counts<\/td>\n<\/tr>\n<tr>\n<td><strong>Cover for someone<\/strong><\/td>\n<td>Takes an absent agent\u2019s active tickets and moves them to whoever has the lightest load, then confirms the rebalance<\/td>\n<td>grouped counts, repeated patches<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>I\u2019ll walk through the first three below. The other two use the same tools, so they are useful checks when you run the sample yourself.<\/p>\n<h2>Approach: Agentic vs Static<\/h2>\n<p>Most ticket questions don\u2019t have a one-query answer. Take \u201cis something breaking across customers.\u201d We run a query, look at what comes back, and only then know whether a second, narrower query is worth running. <a href=\"https:\/\/docs.langchain.com\/oss\/python\/deepagents\/overview\">Deep Agents<\/a> handles exactly that kind of back-and-forth. It plans the work as a short todo list, calls tools, reads results, and decides the next step, instead of trying to answer in a single pass. It also keeps the agent\u2019s instructions lean: the role and the ticket schema stay loaded at all times, while the longer how-to guides load only when a task needs them.<\/p>\n<p>Every ticket is stored under its customer (<code>\/customerId<\/code>), so anything scoped to one customer, like reading a single ticket or pulling everything for ACME, stays inside one partition and querying it cost-effective. Queue-wide questions like triage or incident detection read across partitions instead, which is the right call when we\u2019re asking about every customer at once. The agent picks single-partition or cross-partition to match the question.<\/p>\n<h3>How it works<\/h3>\n<p>Everything the agent does to the queue goes through the tools, each a thin wrapper over a single Azure Cosmos DB operation: a query, a point read, a grouped count, and a write. The agent never gets a raw database connection. It works the queue with the same handful of operations a support lead would, and decides which one each request calls for.<\/p>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-content\/uploads\/sites\/52\/2026\/06\/req-flow.png\"><img decoding=\"async\" class=\"aligncenter wp-image-12580\" src=\"https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-content\/uploads\/sites\/52\/2026\/06\/req-flow-300x294.png\" alt=\"Diagram showing a support request flowing to a Support Ops Agent that plans, acts, and verifies one tool call at a time using query, point-read, aggregation, and ticket-update tools connected to an Azure Cosmos DB for NoSQL support-ticket container partitioned by customer ID.\" width=\"554\" height=\"543\" srcset=\"https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-content\/uploads\/sites\/52\/2026\/06\/req-flow-300x294.png 300w, https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-content\/uploads\/sites\/52\/2026\/06\/req-flow-1024x1002.png 1024w, https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-content\/uploads\/sites\/52\/2026\/06\/req-flow-768x752.png 768w, https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-content\/uploads\/sites\/52\/2026\/06\/req-flow-24x24.png 24w, https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-content\/uploads\/sites\/52\/2026\/06\/req-flow-48x48.png 48w, https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-content\/uploads\/sites\/52\/2026\/06\/req-flow.png 1498w\" sizes=\"(max-width: 554px) 100vw, 554px\" \/><\/a><\/p>\n<p><code>run_query<\/code> is the one the agent reaches for most. It takes a <code>SELECT<\/code> and runs it cross-partition, which is what lets the agent search the whole queue. It\u2019s read-only: anything that isn\u2019t a <code>SELECT<\/code> is refused, and so is a cross-partition <code>GROUP BY<\/code> (more on that below). Writes have their own tool.<\/p>\n<pre class=\"prettyprint language-py\"><code class=\"language-py\">@tool\r\ndef run_query(query: str, parameters: str = \"[]\") -&gt; str:\r\n    \"\"\"Run a read-only Cosmos DB NoSQL SELECT over the tickets container.\"\"\"\r\n    stripped = query.strip()\r\n    if not stripped.upper().startswith(\"SELECT\"):\r\n        return \"Error: only SELECT queries are allowed. Use update_ticket for writes.\"\r\n    items = _get_container().query_items(\r\n        query=stripped,\r\n        parameters=json.loads(parameters) or None,\r\n        enable_cross_partition_query=True,\r\n    )\r\n ...<\/code><\/pre>\n<p><code>read_ticket<\/code> is the cheap path. When the agent already knows the ticket id and the customer, it does a point read on the partition key for around 1 RU instead of running a query.<\/p>\n<p><code>update_ticket<\/code> is the only way the agent writes. It patches a ticket in place, always refreshes <code>updatedAt<\/code>, and appends an entry to the ticket\u2019s history array, so every change it makes stays traceable.<\/p>\n<pre class=\"prettyprint language-py\"><code class=\"language-py\">ops = [{\"op\": \"set\", \"path\": f\"\/{k}\", \"value\": v} for k, v in fields.items()]\r\nops.append({\"op\": \"set\", \"path\": \"\/updatedAt\", \"value\": now})\r\nif history_note:\r\n    ops.append({\r\n        \"op\": \"add\", \"path\": \"\/history\/-\",\r\n        \"value\": {\"at\": now, \"by\": history_by, \"note\": history_note},\r\n    })\r\n_get_container().patch_item(\r\n    item=ticket_id, partition_key=customer_id, patch_operations=ops,\r\n)<\/code><\/pre>\n<p><code>aggregate_tickets<\/code> answers the queue-health questions: how many tickets sit in each status, which area is busiest, who is carrying the most load. It counts tickets across the whole queue, grouped by a single field.<\/p>\n<p>You might expect that to be a plain <code>GROUP BY<\/code>, and in Azure Cosmos DB\u2019s query language it is. The catch is in the SDK. The <code>azure-cosmos<\/code> Python SDK runs a <code>GROUP BY<\/code> fine within a single partition, but refuses one that spans partitions, returning <code>\u201cCross partition query only supports \u2018VALUE \u2019 for aggregates.\u201d<\/code><\/p>\n<p>The support queue spans every customer, so the grouped counts have to come some other way. <code>aggregate_tickets<\/code> projects the one field across partitions and counts the values in Python instead, and <code>run_query<\/code> points the agent here whenever it reaches for a <code>GROUP BY<\/code>.<\/p>\n<h2>Support Ops agent in action<\/h2>\n<p>I\u2019ll use three requests from the sample runs to show what that looks like. They start the way a support lead would ask them, and the agent has to turn each one into the right mix of queries, reads, and updates.<\/p>\n<h3>Morning triage<\/h3>\n<pre class=\"prettyprint language-default\"><code class=\"language-default\">I just got in, what should I look at first?<\/code><\/pre>\n<p>There&#8217;s no single query for &#8220;what should I look at first.&#8221; The agent reads it as a question about risk: which active tickets are closest to slipping. A ticket has no \u201cat risk\u201d field to sort by, so the agent has to decide what at-risk means and turn that into a query. It settles on the tickets that are high priority, still active, and have gone untouched for a few days, and asks Azure Cosmos DB for those, oldest first.<\/p>\n<pre class=\"prettyprint language-sql\"><code class=\"language-sql\">SELECT c.id, c.customerId, c.priority, c.status, c.area, c.assignee, c.updatedAt\r\nFROM c\r\nWHERE c.priority IN ('P1','P2')\r\n  AND c.status IN ('open','in-progress')\r\n  AND c.updatedAt &lt; @staleBefore\r\nORDER BY c.updatedAt ASC<\/code><\/pre>\n<p>It doesn\u2019t stop at the list. It also runs a few grouped counts to size up the rest of the queue behind those tickets, then reports the handful that matter with a one-line reason each and a suggested order to work them:<\/p>\n<pre class=\"prettyprint language-default\"><code class=\"language-default\">Start with the stale high-priority active tickets:\r\n\r\n- TICKET-1004 (UMBRELLA): P1 open, unassigned, data issue, untouched since\r\n  Jun 5. Oldest urgent ticket in the queue.\r\n- TICKET-1003 (INITECH): P2 open, unassigned, login lockout, untouched since\r\n  Jun 7. Customer access issue and stale.\r\n- TICKET-1001 (ACME): P1 open, unassigned, billing, untouched since Jun 8.\r\n  Revenue-impacting and still idle.\r\n- TICKET-1002 (GLOBEX): P1 in-progress, performance, with agent.lee, untouched\r\n  since Jun 9. Already owned, but stale enough to need a status check.\r\n- TICKET-1010 (INITECH): P2 open, unassigned, performance, untouched since\r\n  Jun 11.\r\n\r\nQueue health in brief: 93 active tickets (57 open, 36 in-progress), and 72 of\r\nthem are unassigned. Hottest active area is api (22), then billing (21).<\/code><\/pre>\n<p>Note how the user didn\u2019t define \u201cat risk\u201d up front \u2013 the agent had to turn that into something Azure Cosmos DB could query.<\/p>\n<h3>Resolve a ticket<\/h3>\n<p>The next request names a specific ticket and expects something done about it:<\/p>\n<pre class=\"prettyprint language-default\"><code class=\"language-default\">GLOBEX is unhappy about TICKET-1050, can you pick it up and move it forward?<\/code><\/pre>\n<p>The agent starts with a point read on the ticket. It knows the id and the customer, so this stays inside GLOBEX\u2019s partition. The ticket comes back as a login issue, open and unassigned. Before touching it, the agent looks for context: it pulls GLOBEX\u2019s other tickets, then checks the active login tickets across the rest of the queue, and notices this isn\u2019t an isolated report. Several customers have similar authentication problems open right now.<\/p>\n<p>With that picture, it does the obvious thing: the ticket needs an owner and a status update. It applies the change through <code>update_ticket<\/code>, which patches the fields, refreshes <code>updatedAt<\/code>, and appends a history note in one write so the trail is clear.<\/p>\n<pre class=\"prettyprint language-default\"><code class=\"language-default\">status \u2192 in-progress\r\nassignee \u2192 agent.dana\r\ntags += login, auth, needs-investigation\r\nhistory += \"Picked up after customer escalation; flagged for auth investigation.\"<\/code><\/pre>\n<p>Then it reads the ticket back to confirm the change landed, rather than trusting the write went through, and reports what it did and why:<\/p>\n<pre class=\"prettyprint language-default\"><code class=\"language-default\">TICKET-1050 (GLOBEX) is now in progress with agent.dana assigned.\r\n\r\nWhy this move:\r\n- It was a stale P2 login issue sitting open and unassigned.\r\n- There are similar active login\/auth tickets across other customers, so this\r\nmay be part of a broader authentication pattern rather than an isolated report.\r\n\r\nVerified state: status in-progress, assignee agent.dana, tags\r\nlogin \/ auth \/ needs-investigation, history note added.<\/code><\/pre>\n<p>For <code>TICKET-1050<\/code>, the agent didn\u2019t just assign an owner. It checked the customer context, looked for similar login issues, made the update, and then read the ticket back. The agent should not report a write until it has read the updated ticket and seen the new status, assignee, tags, and history note.<\/p>\n<h3>Spot an incident<\/h3>\n<p>Now try a login related request that starts broad:<\/p>\n<pre class=\"prettyprint language-default\"><code class=\"language-default\">Logins feel shaky this week, dig in and flag anything related.<\/code><\/pre>\n<p>The agent pulls in its incident-search guide before it starts querying. Its first pass is the obvious one: active tickets in the login area. That catches part of the problem, but not all of it. People describe the same symptom in different words, and some login tickets land under another area. So the agent runs a second query against the ticket text, looking for authentication symptoms that landed somewhere else.<\/p>\n<pre class=\"prettyprint language-default\"><code class=\"language-default\">SELECT c.id, c.area, c.title FROM c\r\nWHERE c.area != 'login' AND c.status IN ('open','in-progress')\r\n  AND (CONTAINS(LOWER(c.description), 'authentication')\r\n       OR CONTAINS(LOWER(c.description), 'sign in')\r\n       OR CONTAINS(LOWER(c.description), 'log in'))<\/code><\/pre>\n<p>That second query turns up <strong>TICKET-1052 (STARK)<\/strong>, filed under other as \u201cAuthentication errors on mobile app\u201d, which an <code>area = 'login'<\/code> filter would have missed entirely. A few more queries (grouped counts by customer, and a check on how recent these are) confirm the issue:<\/p>\n<pre class=\"prettyprint language-default\"><code class=\"language-default\">There is a real login\/auth cluster.\r\n\r\n7 active tickets across 6 customers, 6 of them created this week and 4 landed on\r\nJun 13 alone. The most urgent:\r\n\r\n- TICKET-1051 (HOOLI): P1 open, \"Multiple users can't log in this morning\"\r\n- TICKET-1053 (UMBRELLA): P2 open, \"SSO login broken after update\"\r\n- TICKET-1052 (STARK): P2 open, area `other`, \"Authentication errors on mobile app\"\r\n- TICKET-1050 (GLOBEX): P2 open, \"Cannot sign in - authentication failing\"\r\n- TICKET-1042 (ACME): P2 open, \"Unable to log in after MFA enabled\"\r\n\r\nOne relevant ticket is misfiled under `other`, so a pure area = 'login' view\r\nundercounts the issue.\r\n\r\nIf you want, I can tag these 7 with a shared marker like known-issue:login-surge\r\nso the cluster is easier to track.<\/code><\/pre>\n<p>Tagging seven tickets is different from updating one, so the agent stops and asks first. If the user confirms, it could have made the <code>update_ticket<\/code> patch on each one, append the tag and a history note. The login surge only becomes visible after the active-login query, the text search, the customer counts, and the dates are looked at together.<\/p>\n<h2>Try it, and build your own<\/h2>\n<p>The repo has everything to run this against your own Azure Cosmos DB account: the tools, the seed data, and a CLI that streams each step as the agent works. The <a href=\"https:\/\/github.com\/abhirockzz\/deepagents-cosmosdb-support-ops\/blob\/main\/README.md\">README<\/a> walks through setup and the az login auth. Run python <code>seed.py<\/code> to load the support queue data, then replay the runs above or ask the agent your own questions.<\/p>\n<p>Once you have the sample running, try the same idea with data from one of your own workflows. Start with read-only questions and watch how the agent breaks them into Azure Cosmos DB operations. Then add scoped writes when the boundary is clear: what the agent can change, what history it should leave, and how it verifies the result. That could be support tickets, incidents, orders, devices, or any other operational data where a multi-step agent can help.<\/p>\n<h2>Learn more<\/h2>\n<p>\ud83d\udcd8 For the agent framework, start with the <a href=\"https:\/\/docs.langchain.com\/oss\/python\/deepagents\/overview\">Deep Agents docs<\/a><\/p>\n<p>\ud83d\udcd8 <a href=\"https:\/\/learn.microsoft.com\/azure\/cosmos-db\/ai-agents\">AI agents in Azure Cosmos DB<\/a> is a good place to step back and review the broader agent concepts: planning, tool use, memory, copilots, autonomous agents, and multi-agent systems.<\/p>\n<p>\ud83d\udcd8 <a href=\"https:\/\/learn.microsoft.com\/azure\/cosmos-db\/gen-ai\/agentic-retrieval\">Agentic Retrieval Toolkit<\/a> shows how to ground answers with multi-step retrieval over Cosmos DB data<\/p>\n<p>\ud83d\udcd8 <a href=\"https:\/\/learn.microsoft.com\/azure\/cosmos-db\/gen-ai\/agent-memory-toolkit\">Agent Memory Toolkit<\/a> covers durable agent memory backed by Cosmos DB.<\/p>\n<p>\ud83d\udcd8\u00a0<a href=\"https:\/\/learn.microsoft.com\/azure\/cosmos-db\/gen-ai\/model-context-protocol-toolkit\">MCP Toolkit for Azure Cosmos DB<\/a> shows another way to expose Cosmos DB capabilities to agentic applications.<\/p>\n<div class=\"markdown-heading\" dir=\"auto\">\n<h2 class=\"heading-element\" dir=\"auto\" tabindex=\"-1\">About Azure Cosmos DB<\/h2>\n<\/div>\n<p dir=\"auto\">Azure Cosmos DB is a fully managed and serverless NoSQL and vector database for modern app development, including AI applications. With its SLA-backed speed and availability as well as instant dynamic scalability, it is ideal for real-time NoSQL and MongoDB applications that require high performance and distributed computing over massive volumes of NoSQL and vector data.<\/p>\n<p dir=\"auto\">To stay in the loop on Azure Cosmos DB updates, follow us on\u00a0<a href=\"https:\/\/twitter.com\/AzureCosmosDB\" rel=\"nofollow\">X<\/a>,\u00a0<a href=\"https:\/\/aka.ms\/AzureCosmosDBYouTube\" rel=\"nofollow\">YouTube<\/a>, and\u00a0<a href=\"https:\/\/www.linkedin.com\/company\/azure-cosmos-db\/\" rel=\"nofollow\">LinkedIn<\/a>. Join the discussion with other developers on the\u00a0<a href=\"https:\/\/discord.gg\/pczdC2SU\" rel=\"nofollow\">#nosql channel on the Microsoft Open Source Discord<\/a>.<\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Deep Agents is an agent harness built on LangGraph, for agents that need to work through a task over many steps instead of a single LLM call. The agent runs tools, looks at the results, and uses that to pick the next one, keeping a todo list as it goes. On top of that loop [&hellip;]<\/p>\n","protected":false},"author":181737,"featured_media":12615,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[14],"tags":[],"class_list":["post-12577","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-core-sql-api"],"acf":[],"blog_post_summary":"<p>Deep Agents is an agent harness built on LangGraph, for agents that need to work through a task over many steps instead of a single LLM call. The agent runs tools, looks at the results, and uses that to pick the next one, keeping a todo list as it goes. On top of that loop [&hellip;]<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-json\/wp\/v2\/posts\/12577","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-json\/wp\/v2\/users\/181737"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-json\/wp\/v2\/comments?post=12577"}],"version-history":[{"count":2,"href":"https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-json\/wp\/v2\/posts\/12577\/revisions"}],"predecessor-version":[{"id":12616,"href":"https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-json\/wp\/v2\/posts\/12577\/revisions\/12616"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-json\/wp\/v2\/media\/12615"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-json\/wp\/v2\/media?parent=12577"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-json\/wp\/v2\/categories?post=12577"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-json\/wp\/v2\/tags?post=12577"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}