Introducing a new system function and optimized query operators

Justine Cocchi

The query engine in Azure Cosmos DB Core (SQL) API now has a new system function and optimizations for a set of query operations to better use the index. In this post we’ll cover the new DateTimeBin function as well as improvements to GROUP BY, DISTINCT, OFFSET LIMIT, and JOIN.

These example scenarios are for Contoso, a fictional online retailer that stores a product catalog of 1 million items in Azure Cosmos DB. They have 1.5 GB of data stored in a container partitioned by the id property. Here’s an example document: 

{ 
    "id": "4e487804-a9a8-463c-b767-5fe307a86c47",
    "Name": "Embroidered Silk Shirt", 
    "Price": 35.0, 
    "Category": "Clothing", 
    "Description": "Luxurious silk shirt with hand embroidered flowers.",
    "FirstAvailable": "2021-06-21T00:00:00.0000000Z",
    "CustomerRatings": [ 
        { 
            "Username": "User1", 
            "Stars": 5 
        }, 
        { 
            "Username": "User2", 
            "Stars": 4 
        }, 
        { 
            "Username": "User3", 
            "Stars": 5 
        } 
    ] 
}

DateTimeBin system function 

Contoso wants to know how many products were made available each week in June so far. They can use DateTimeBin, a new system function that makes aggregation queries with dates even more useful for analyzing data. 

Contoso wrote the query below to achieve this: 

SELECT Count(1) as NewProduts, DateTimeBin(c.FirstAvailable, 'd', 7) AS DayAvailable 
FROM c 
WHERE c.FirstAvailable > "2022-06-01T00:00:00.0000000Z" 
GROUP BY DateTimeBin(c.FirstAvailable, 'd', 7)

When using the DateTimeBin function, specify a UTC DateTime string in the format `YYYY-MM-DDThh:mm:ss. fffffffZ`, a unit of measurement to bin by, and optionally the size of the bins. A full list of supported parameters can be found in the documentation. 

 

GROUP BY and DISTINCT optimizations

Contoso can also take advantage of new improvements to GROUP BY and DISTINCT that allow them to better utilize the index. Previously, queries with these operators would use the index while evaluating filters, but not for evaluating the operator itself. With these new improvements these operators can also benefit from either single or composite indices, leading to both increased performance and reduced RUs. 

 

GROUP BY

In the query below, Contoso is using GROUP BY to get the average price of products in each category. Because there is an aggregation on Price in addition to grouping by Category, this query would benefit from a composite index on those two properties. If there were filter properties outside of the aggregation or grouping, a single index on those properties would also be evaluated. 

SELECT AVG(c.Price), c.Category  
FROM c  
GROUP BY c.Category

Previous query charge: 23,885.49 RUs               Previous execution time: 30.72 seconds 

New query charge: 371.51 RUs                           New execution time: 6.35 seconds 

 

DISTINCT

Contoso is using DISTINCT to get the name of all of their products that are over $500. Because this query can now take advantage of a single index on Name, it is much more efficient. 

SELECT DISTINCT c.Name  
FROM c 
WHERE c.Price > 500

Previous query charge: 16,545.58 RUs                Previous execution time: 47.38 seconds 

New query charge: 1,005.8 RUs                           New execution time: 11.43 seconds 

 

OFFSET LIMIT optimizations

The query engine will now be able to avoid document loading time and parsing time for documents outside of the offset range. Queries with large documents and a high offset will see proportional benefits. These improvements can be seen in queries with some combination of ORDER BY and filter clauses like the query below from Contoso which gets all products in a given price range by their FirstAvailable date. Queries with other operators such as aggregations, joins, or subqueries cannot take advantage of these improvements. 

SELECT c.Name, c.Price, c.FirstAvailable 
FROM c 
WHERE c.Price > 15 AND c.Price < 50 
ORDER BY c.FirstAvailable DESC 
OFFSET 1000 LIMIT 100

Previous query charge: 332.92 RUs                     Previous execution time: 8.63 seconds 

New query charge: 311.2 RUs                              New execution time: 7.01 seconds 

 

JOIN optimizations

Contoso lets users rate the products sold on their site from 1 – 5 stars. To make sure their product quality is high, they want to check which of their products that cost over $100 have a rating below 3 stars. To achieve this Contoso can use JOIN to create a cross product of customer ratings and products in their catalog.  

SELECT c.Name, c.Price, r as Rating 
FROM c 
JOIN r IN c.CustomerRatings 
WHERE c.Price > 100 and r.Stars < 3 

Previously, Contoso had re-written this query to use a subquery which is more efficient when there is a large array and a filter on a property from that array. With a subquery, the filter is applied before the JOIN to ignore irrelevant rows before creating the cross product instead of after the join is applied. In this case, Contoso wants to only look at customer ratings below 3 stars. 

SELECT c.Name, c.Price, Rating 
FROM c 
JOIN (SELECT VALUE r FROM r IN c.CustomerRatings WHERE r.Stars < 3) AS Rating 
WHERE c.Price > 100 

Now, with new improvements to JOIN, Contoso no longer has to explicitly write this subquery themselves. The query engine will now automatically push down the filters while evaluating the cross product for you. This means that Contoso can now write this query in its original, simpler form to achieve the same performance. 

 

Learn more

7 comments

Discussion is closed. Login to edit/delete existing comments.

  • Erik O'Leary 0

    This is really exciting!

    I’m trying out the group by on my own database now, and I see that a simple COUNT(c) on a GROUP BY’d set of hundreds of thousands of docs is only 14 RU now, but if I add a SUM or AVG to that selection it jumps up to 7000 RUs. Can the index be used to improve on that as well?

    Examples:

    -- 14ru
    select c.Stage, count(c) Count
    from c
    group by c.Stage
    
    -- 7000 + ru
    select c.Stage, SUM(c.Amount) Amount
    from c
    group by c.Stage
  • Kasper Nagel Nielsen 0

    Hi @Justine

    We are running Cosmos DB in UK South but can’t see any gained performance on our GROUP BY queries .

    Is this available in all regions ?

    If yes – Can you give a hint on how to optimise/index below example on ItemId

    select count(d.itemId) as ‘count’
    from (
    select distinct count(1) as num, c.itemId
    from c join cc in c.images
    where (
    not IS_DEFINED(c.isDeleted) or c.isDeleted = false
    )
    and cc.isTraining = true
    group by c.itemId
    ) d
    where d.num > 1

    • Kasper Nagel Nielsen 0

      @Justine

      Also can you help explain why the result differs on each query run

      1st run Count = 14744
      2nd run Count = 14738
      3rd run Count = 14741

      RU/s around 7-8K

      • Justine CocchiMicrosoft employee 0

        Hi Kasper! Yes, these improvements have been deployed to all regions including UK South.

        This query doesn’t take advantage of the improvements to GROUP BY because after the JOIN and filter on cc.isTraining = true, c.itemId is no longer precise predicate.

        You are seeing different results on various runs of the query because it can’t be properly distributed across partitions, leading to incorrect results. There is a limitation on properly distributing nested subqueries with DISTINCT or GROUP BY when there is extra processing required after distribution (in this case, the WHERE d.num > 1 after your subquery). You can find more information and a comprehensive list of limitations here: https://docs.microsoft.com/en-us/azure/cosmos-db/sql/sql-query-keywords#distinct

        Without knowing exactly what this query is trying to achieve it is hard to give definitive advice on how to improve it, but you can try something like this which will allow GROUP BY to use the index:

        SELECT c.itemId, Count(1) AS count
        FROM c
        WHERE (NOT IS_DEFINED(c.isDeleted) OR c.isDeleted = false) AND
            EXISTS(SELECT VALUE cc FROM cc IN c.images WHERE cc.isTraining)
        GROUP BY c.itemId
        • Kasper Nagel Nielsen 0

          Thanks alot Justine

          We wasn’t aware of this limitation and the query helps using the index 😉

          Will look into the right query behavior – thanks for the hints

  • Himal Patel 0

    Great Improvement!! Is there any similar improvement on Gremlin API ?

Feedback usabilla icon