Introducing a new system function and optimized query operators

The query engine in Azure Cosmos DB Core (SQL) API now has a new system function and optimizations for a set of query operations to better use the index. In this post we’ll cover the new DateTimeBin function as well as improvements to GROUP BY, DISTINCT, OFFSET LIMIT, and JOIN.

These example scenarios are for Contoso, a fictional online retailer that stores a product catalog of 1 million items in Azure Cosmos DB. They have 1.5 GB of data stored in a container partitioned by the id property. Here’s an example document:

{ 
    "id": "4e487804-a9a8-463c-b767-5fe307a86c47",
    "Name": "Embroidered Silk Shirt", 
    "Price": 35.0, 
    "Category": "Clothing", 
    "Description": "Luxurious silk shirt with hand embroidered flowers.",
    "FirstAvailable": "2021-06-21T00:00:00.0000000Z",
    "CustomerRatings": [ 
        { 
            "Username": "User1", 
            "Stars": 5 
        }, 
        { 
            "Username": "User2", 
            "Stars": 4 
        }, 
        { 
            "Username": "User3", 
            "Stars": 5 
        } 
    ] 
}

DateTimeBin system function

Contoso wants to know how many products were made available each week in June so far. They can use DateTimeBin, a new system function that makes aggregation queries with dates even more useful for analyzing data.

Contoso wrote the query below to achieve this:

SELECT Count(1) as NewProduts, DateTimeBin(c.FirstAvailable, 'd', 7) AS DayAvailable 
FROM c 
WHERE c.FirstAvailable > "2022-06-01T00:00:00.0000000Z" 
GROUP BY DateTimeBin(c.FirstAvailable, 'd', 7)

When using the DateTimeBin function, specify a UTC DateTime string in the format `YYYY-MM-DDThh:mm:ss. fffffffZ`, a unit of measurement to bin by, and optionally the size of the bins. A full list of supported parameters can be found in the documentation.

GROUP BY and DISTINCT optimizations

Contoso can also take advantage of new improvements to GROUP BY and DISTINCT that allow them to better utilize the index. Previously, queries with these operators would use the index while evaluating filters, but not for evaluating the operator itself. With these new improvements these operators can also benefit from either single or composite indices, leading to both increased performance and reduced RUs.

GROUP BY

In the query below, Contoso is using GROUP BY to get the average price of products in each category. Because there is an aggregation on Price in addition to grouping by Category, this query would benefit from a composite index on those two properties. If there were filter properties outside of the aggregation or grouping, a single index on those properties would also be evaluated.

SELECT AVG(c.Price), c.Category  
FROM c  
GROUP BY c.Category

Previous query charge: 23,885.49 RUs Previous execution time: 30.72 seconds

New query charge: 371.51 RUs New execution time: 6.35 seconds

DISTINCT

Contoso is using DISTINCT to get the name of all of their products that are over $500. Because this query can now take advantage of a single index on Name, it is much more efficient.

SELECT DISTINCT c.Name  
FROM c 
WHERE c.Price > 500

Previous query charge: 16,545.58 RUs Previous execution time: 47.38 seconds

New query charge: 1,005.8 RUs New execution time: 11.43 seconds

OFFSET LIMIT optimizations

The query engine will now be able to avoid document loading time and parsing time for documents outside of the offset range. Queries with large documents and a high offset will see proportional benefits. These improvements can be seen in queries with some combination of ORDER BY and filter clauses like the query below from Contoso which gets all products in a given price range by their FirstAvailable date. Queries with other operators such as aggregations, joins, or subqueries cannot take advantage of these improvements.

SELECT c.Name, c.Price, c.FirstAvailable 
FROM c 
WHERE c.Price > 15 AND c.Price < 50 
ORDER BY c.FirstAvailable DESC 
OFFSET 1000 LIMIT 100

Previous query charge: 332.92 RUs Previous execution time: 8.63 seconds

New query charge: 311.2 RUs New execution time: 7.01 seconds

JOIN optimizations

Contoso lets users rate the products sold on their site from 1 – 5 stars. To make sure their product quality is high, they want to check which of their products that cost over $100 have a rating below 3 stars. To achieve this Contoso can use JOIN to create a cross product of customer ratings and products in their catalog.

SELECT c.Name, c.Price, r as Rating 
FROM c 
JOIN r IN c.CustomerRatings 
WHERE c.Price > 100 and r.Stars < 3

Previously, Contoso had re-written this query to use a subquery which is more efficient when there is a large array and a filter on a property from that array. With a subquery, the filter is applied before the JOIN to ignore irrelevant rows before creating the cross product instead of after the join is applied. In this case, Contoso wants to only look at customer ratings below 3 stars.

SELECT c.Name, c.Price, Rating 
FROM c 
JOIN (SELECT VALUE r FROM r IN c.CustomerRatings WHERE r.Stars < 3) AS Rating 
WHERE c.Price > 100

Now, with new improvements to JOIN, Contoso no longer has to explicitly write this subquery themselves. The query engine will now automatically push down the filters while evaluating the cross product for you. This means that Contoso can now write this query in its original, simpler form to achieve the same performance.

Learn more

7 comments

Himal Patel July 7, 2022

Great Improvement!! Is there any similar improvement on Gremlin API ?

Kasper Nagel Nielsen June 29, 2022

Hi @Justine

We are running Cosmos DB in UK South but can’t see any gained performance on our GROUP BY queries .

Is this available in all regions ?

If yes – Can you give a hint on how to optimise/index below example on ItemId

select count(d.itemId) as ‘count’
from (
select distinct count(1) as num, c.itemId
from c join cc in c.images
where (
not IS_DEFINED(c.isDeleted) or c.isDeleted = false
)
and cc.isTraining = true
group by c.itemId
) d
where d.num > 1

Kasper Nagel Nielsen June 29, 2022

@Justine

Also can you help explain why the result differs on each query run

1st run Count = 14744
2nd run Count = 14738
3rd run Count = 14741

RU/s around 7-8K
- Justine Cocchi Author July 1, 2022
  Hi Kasper! Yes, these improvements have been deployed to all regions including UK South.
  
  This query doesn't take advantage of the improvements to GROUP BY because after the JOIN and filter on cc.isTraining = true, c.itemId is no longer precise predicate.
  
  You are seeing different results on various runs of the query because it can't be properly distributed across partitions, leading to incorrect results. There is a limitation on properly distributing nested subqueries with DISTINCT or GROUP BY when there is extra processing required after distribution (in this case, the WHERE d.num > 1 after your subquery). You can find more...
  Read more
  Hi Kasper! Yes, these improvements have been deployed to all regions including UK South.
  
  This query doesn’t take advantage of the improvements to GROUP BY because after the JOIN and filter on cc.isTraining = true, c.itemId is no longer precise predicate.
  
  You are seeing different results on various runs of the query because it can’t be properly distributed across partitions, leading to incorrect results. There is a limitation on properly distributing nested subqueries with DISTINCT or GROUP BY when there is extra processing required after distribution (in this case, the WHERE d.num > 1 after your subquery). You can find more information and a comprehensive list of limitations here: https://docs.microsoft.com/en-us/azure/cosmos-db/sql/sql-query-keywords#distinct
  
  Without knowing exactly what this query is trying to achieve it is hard to give definitive advice on how to improve it, but you can try something like this which will allow GROUP BY to use the index:
  
  SELECT c.itemId, Count(1) AS count FROM c WHERE (NOT IS_DEFINED(c.isDeleted) OR c.isDeleted = false) AND EXISTS(SELECT VALUE cc FROM cc IN c.images WHERE cc.isTraining) GROUP BY c.itemId
  Read less
- Kasper Nagel Nielsen July 4, 2022 · Edited
  
  Thanks alot Justine
  
  We wasn’t aware of this limitation and the query helps using the index 😉
  
  Will look into the right query behavior – thanks for the hints

Erik O'Leary June 22, 2022

This is really exciting!

I’m trying out the group by on my own database now, and I see that a simple COUNT(c) on a GROUP BY’d set of hundreds of thousands of docs is only 14 RU now, but if I add a SUM or AVG to that selection it jumps up to 7000 RUs. Can the index be used to improve on that as well?

Examples:

-- 14ru
select c.Stage, count(c) Count
from c
group by c.Stage

-- 7000 + ru
select c.Stage, SUM(c.Amount) Amount
from c
group by c.Stage

Justine Cocchi Author June 28, 2022 · Edited

Hi Erik! Yes, a composite index can be used to optimize queries with an aggregation and group by. Your query above can benefit from a composite index on Stage and Amount. You can learn more about adding a composite index here: https://docs.microsoft.com/azure/cosmos-db/sql/how-to-manage-indexing-policy#composite-index

Introducing a new system function and optimized query operators

DateTimeBin system function

GROUP BY and DISTINCT optimizations

GROUP BY

DISTINCT

OFFSET LIMIT optimizations

JOIN optimizations

Learn more

Category

Author

7 comments

Read next

Benchmarking Data Migration from Cassandra to Azure Cosmos DB Cassandra API

Run parallel CRUD operations with Python Async IO for Azure Cosmos DB

DateTimeBin system function

GROUP BY and DISTINCT optimizations

GROUP BY

DISTINCT

OFFSET LIMIT optimizations

JOIN optimizations

Learn more

Category

Share

Author

7 comments

Read next

Benchmarking Data Migration from Cassandra to Azure Cosmos DB Cassandra API

Run parallel CRUD operations with Python Async IO for Azure Cosmos DB

Stay informed