June 22nd, 2026
0 reactions

How to Use Deep Agents with Azure Cosmos DB – Plan, act, and verify against operational data

Principal Product Manager

Deep Agents is an agent harness built on LangGraph, for agents that need to work through a task over many steps instead of a single LLM call. The agent runs tools, looks at the results, and uses that to pick the next one, keeping a todo list as it goes. On top of that loop the harness brings what a longer-running agent needs. It can load instructions on demand instead of holding everything in the prompt (skills), offload large tool outputs so they don’t fill the context window, and pause for human approval in apps that need an approval gate before data changes.

Support Ops Agent is a sample app that puts this to work on a customer-support ticket queue. We can ask it which tickets are at risk, who’s overloaded, or whether a run of similar complaints is really one outage. When a ticket needs to change, it updates the ticket and reads it back to confirm. Most requests become a handful of reads against the queue. Requests that change a ticket add a patch and a verification read.

That queue lives in Azure Cosmos DB, the operational database the support team already runs on. The agent reads and writes that same store through the Azure Cosmos DB SDK, so it works on the live tickets, with no side index to keep in sync. Each ticket is an Azure Cosmos DB item, with its tags and history kept right inside it, and the agent updates that item directly. With the partition key doing its job, point reads and customer-scoped queries stay cheap. Queue-wide investigations spend RUs based on the cross-partition work they do, which is why the tools project only the fields they need. The schema is flexible, so the agent can add a tag or append to a history array without a migration.

The code is on GitHub with instructions to run it against your own Azure Cosmos DB account.

In this post, I’ll go through:

  • what the agent can do, and the Azure Cosmos DB operation behind each kind of request
  • why Deep Agents and Azure Cosmos DB fit this problem
  • the tools it uses to work on the ticket queue
  • practical examples of how the agent works: morning triage, resolving a ticket, and spotting an incident

Agent capabilities

The requests in this sample all come down to a few Azure Cosmos DB operations. Some questions only need reads. Others need the agent to read first, decide what changed, and then patch the ticket.

Ask it to… What the agent does Cosmos DB operations
Triage the queue Finds the at-risk tickets (high priority, still active, gone stale) and reports the handful that actually matter cross-partition query, filter, ORDER BY
Resolve a ticket Point-reads the ticket, checks related ones from the same customer, updates status, owner, and history, then re-reads to confirm point read, related-item query, update, verify
Spot an incident Searches for a cluster across customers, including symptoms filed under the wrong area, and can tag the group as a known issue multi-step query, repeated patches
Check queue health Summarizes the queue by status, by area, and by who is carrying the load grouped counts
Cover for someone Takes an absent agent’s active tickets and moves them to whoever has the lightest load, then confirms the rebalance grouped counts, repeated patches

I’ll walk through the first three below. The other two use the same tools, so they are useful checks when you run the sample yourself.

Approach: Agentic vs Static

Most ticket questions don’t have a one-query answer. Take “is something breaking across customers.” We run a query, look at what comes back, and only then know whether a second, narrower query is worth running. Deep Agents handles exactly that kind of back-and-forth. It plans the work as a short todo list, calls tools, reads results, and decides the next step, instead of trying to answer in a single pass. It also keeps the agent’s instructions lean: the role and the ticket schema stay loaded at all times, while the longer how-to guides load only when a task needs them.

Every ticket is stored under its customer (/customerId), so anything scoped to one customer, like reading a single ticket or pulling everything for ACME, stays inside one partition and querying it cost-effective. Queue-wide questions like triage or incident detection read across partitions instead, which is the right call when we’re asking about every customer at once. The agent picks single-partition or cross-partition to match the question.

How it works

Everything the agent does to the queue goes through the tools, each a thin wrapper over a single Azure Cosmos DB operation: a query, a point read, a grouped count, and a write. The agent never gets a raw database connection. It works the queue with the same handful of operations a support lead would, and decides which one each request calls for.

Diagram showing a support request flowing to a Support Ops Agent that plans, acts, and verifies one tool call at a time using query, point-read, aggregation, and ticket-update tools connected to an Azure Cosmos DB for NoSQL support-ticket container partitioned by customer ID.

run_query is the one the agent reaches for most. It takes a SELECT and runs it cross-partition, which is what lets the agent search the whole queue. It’s read-only: anything that isn’t a SELECT is refused, and so is a cross-partition GROUP BY (more on that below). Writes have their own tool.

@tool
def run_query(query: str, parameters: str = "[]") -> str:
    """Run a read-only Cosmos DB NoSQL SELECT over the tickets container."""
    stripped = query.strip()
    if not stripped.upper().startswith("SELECT"):
        return "Error: only SELECT queries are allowed. Use update_ticket for writes."
    items = _get_container().query_items(
        query=stripped,
        parameters=json.loads(parameters) or None,
        enable_cross_partition_query=True,
    )
 ...

read_ticket is the cheap path. When the agent already knows the ticket id and the customer, it does a point read on the partition key for around 1 RU instead of running a query.

update_ticket is the only way the agent writes. It patches a ticket in place, always refreshes updatedAt, and appends an entry to the ticket’s history array, so every change it makes stays traceable.

ops = [{"op": "set", "path": f"/{k}", "value": v} for k, v in fields.items()]
ops.append({"op": "set", "path": "/updatedAt", "value": now})
if history_note:
    ops.append({
        "op": "add", "path": "/history/-",
        "value": {"at": now, "by": history_by, "note": history_note},
    })
_get_container().patch_item(
    item=ticket_id, partition_key=customer_id, patch_operations=ops,
)

aggregate_tickets answers the queue-health questions: how many tickets sit in each status, which area is busiest, who is carrying the most load. It counts tickets across the whole queue, grouped by a single field.

You might expect that to be a plain GROUP BY, and in Azure Cosmos DB’s query language it is. The catch is in the SDK. The azure-cosmos Python SDK runs a GROUP BY fine within a single partition, but refuses one that spans partitions, returning “Cross partition query only supports ‘VALUE ’ for aggregates.”

The support queue spans every customer, so the grouped counts have to come some other way. aggregate_tickets projects the one field across partitions and counts the values in Python instead, and run_query points the agent here whenever it reaches for a GROUP BY.

Support Ops agent in action

I’ll use three requests from the sample runs to show what that looks like. They start the way a support lead would ask them, and the agent has to turn each one into the right mix of queries, reads, and updates.

Morning triage

I just got in, what should I look at first?

There’s no single query for “what should I look at first.” The agent reads it as a question about risk: which active tickets are closest to slipping. A ticket has no “at risk” field to sort by, so the agent has to decide what at-risk means and turn that into a query. It settles on the tickets that are high priority, still active, and have gone untouched for a few days, and asks Azure Cosmos DB for those, oldest first.

SELECT c.id, c.customerId, c.priority, c.status, c.area, c.assignee, c.updatedAt
FROM c
WHERE c.priority IN ('P1','P2')
  AND c.status IN ('open','in-progress')
  AND c.updatedAt < @staleBefore
ORDER BY c.updatedAt ASC

It doesn’t stop at the list. It also runs a few grouped counts to size up the rest of the queue behind those tickets, then reports the handful that matter with a one-line reason each and a suggested order to work them:

Start with the stale high-priority active tickets:

- TICKET-1004 (UMBRELLA): P1 open, unassigned, data issue, untouched since
  Jun 5. Oldest urgent ticket in the queue.
- TICKET-1003 (INITECH): P2 open, unassigned, login lockout, untouched since
  Jun 7. Customer access issue and stale.
- TICKET-1001 (ACME): P1 open, unassigned, billing, untouched since Jun 8.
  Revenue-impacting and still idle.
- TICKET-1002 (GLOBEX): P1 in-progress, performance, with agent.lee, untouched
  since Jun 9. Already owned, but stale enough to need a status check.
- TICKET-1010 (INITECH): P2 open, unassigned, performance, untouched since
  Jun 11.

Queue health in brief: 93 active tickets (57 open, 36 in-progress), and 72 of
them are unassigned. Hottest active area is api (22), then billing (21).

Note how the user didn’t define “at risk” up front – the agent had to turn that into something Azure Cosmos DB could query.

Resolve a ticket

The next request names a specific ticket and expects something done about it:

GLOBEX is unhappy about TICKET-1050, can you pick it up and move it forward?

The agent starts with a point read on the ticket. It knows the id and the customer, so this stays inside GLOBEX’s partition. The ticket comes back as a login issue, open and unassigned. Before touching it, the agent looks for context: it pulls GLOBEX’s other tickets, then checks the active login tickets across the rest of the queue, and notices this isn’t an isolated report. Several customers have similar authentication problems open right now.

With that picture, it does the obvious thing: the ticket needs an owner and a status update. It applies the change through update_ticket, which patches the fields, refreshes updatedAt, and appends a history note in one write so the trail is clear.

status → in-progress
assignee → agent.dana
tags += login, auth, needs-investigation
history += "Picked up after customer escalation; flagged for auth investigation."

Then it reads the ticket back to confirm the change landed, rather than trusting the write went through, and reports what it did and why:

TICKET-1050 (GLOBEX) is now in progress with agent.dana assigned.

Why this move:
- It was a stale P2 login issue sitting open and unassigned.
- There are similar active login/auth tickets across other customers, so this
may be part of a broader authentication pattern rather than an isolated report.

Verified state: status in-progress, assignee agent.dana, tags
login / auth / needs-investigation, history note added.

For TICKET-1050, the agent didn’t just assign an owner. It checked the customer context, looked for similar login issues, made the update, and then read the ticket back. The agent should not report a write until it has read the updated ticket and seen the new status, assignee, tags, and history note.

Spot an incident

Now try a login related request that starts broad:

Logins feel shaky this week, dig in and flag anything related.

The agent pulls in its incident-search guide before it starts querying. Its first pass is the obvious one: active tickets in the login area. That catches part of the problem, but not all of it. People describe the same symptom in different words, and some login tickets land under another area. So the agent runs a second query against the ticket text, looking for authentication symptoms that landed somewhere else.

SELECT c.id, c.area, c.title FROM c
WHERE c.area != 'login' AND c.status IN ('open','in-progress')
  AND (CONTAINS(LOWER(c.description), 'authentication')
       OR CONTAINS(LOWER(c.description), 'sign in')
       OR CONTAINS(LOWER(c.description), 'log in'))

That second query turns up TICKET-1052 (STARK), filed under other as “Authentication errors on mobile app”, which an area = 'login' filter would have missed entirely. A few more queries (grouped counts by customer, and a check on how recent these are) confirm the issue:

There is a real login/auth cluster.

7 active tickets across 6 customers, 6 of them created this week and 4 landed on
Jun 13 alone. The most urgent:

- TICKET-1051 (HOOLI): P1 open, "Multiple users can't log in this morning"
- TICKET-1053 (UMBRELLA): P2 open, "SSO login broken after update"
- TICKET-1052 (STARK): P2 open, area `other`, "Authentication errors on mobile app"
- TICKET-1050 (GLOBEX): P2 open, "Cannot sign in - authentication failing"
- TICKET-1042 (ACME): P2 open, "Unable to log in after MFA enabled"

One relevant ticket is misfiled under `other`, so a pure area = 'login' view
undercounts the issue.

If you want, I can tag these 7 with a shared marker like known-issue:login-surge
so the cluster is easier to track.

Tagging seven tickets is different from updating one, so the agent stops and asks first. If the user confirms, it could have made the update_ticket patch on each one, append the tag and a history note. The login surge only becomes visible after the active-login query, the text search, the customer counts, and the dates are looked at together.

Try it, and build your own

The repo has everything to run this against your own Azure Cosmos DB account: the tools, the seed data, and a CLI that streams each step as the agent works. The README walks through setup and the az login auth. Run python seed.py to load the support queue data, then replay the runs above or ask the agent your own questions.

Once you have the sample running, try the same idea with data from one of your own workflows. Start with read-only questions and watch how the agent breaks them into Azure Cosmos DB operations. Then add scoped writes when the boundary is clear: what the agent can change, what history it should leave, and how it verifies the result. That could be support tickets, incidents, orders, devices, or any other operational data where a multi-step agent can help.

Learn more

📘 For the agent framework, start with the Deep Agents docs

📘 AI agents in Azure Cosmos DB is a good place to step back and review the broader agent concepts: planning, tool use, memory, copilots, autonomous agents, and multi-agent systems.

📘 Agentic Retrieval Toolkit shows how to ground answers with multi-step retrieval over Cosmos DB data

📘 Agent Memory Toolkit covers durable agent memory backed by Cosmos DB.

📘 MCP Toolkit for Azure Cosmos DB shows another way to expose Cosmos DB capabilities to agentic applications.

About Azure Cosmos DB

Azure Cosmos DB is a fully managed and serverless NoSQL and vector database for modern app development, including AI applications. With its SLA-backed speed and availability as well as instant dynamic scalability, it is ideal for real-time NoSQL and MongoDB applications that require high performance and distributed computing over massive volumes of NoSQL and vector data.

To stay in the loop on Azure Cosmos DB updates, follow us on XYouTube, and LinkedIn. Join the discussion with other developers on the #nosql channel on the Microsoft Open Source Discord.

 

Author

Abhishek Gupta
Principal Product Manager

Principal Product Manager in the Azure Cosmos DB team.

0 comments