Introducing a new system function and optimized query operators

Justine Cocchi

The query engine in Azure Cosmos DB Core (SQL) API now has a new system function and optimizations for a set of query operations to better use the index. In this post we’ll cover the new DateTimeBin function as well as improvements to GROUP BY, DISTINCT, OFFSET LIMIT, and JOIN.

These example scenarios are for Contoso, a fictional online retailer that stores a product catalog of 1 million items in Azure Cosmos DB. They have 1.5 GB of data stored in a container partitioned by the id property. Here’s an example document: 

{ 
    "id": "4e487804-a9a8-463c-b767-5fe307a86c47",
    "Name": "Embroidered Silk Shirt", 
    "Price": 35.0, 
    "Category": "Clothing", 
    "Description": "Luxurious silk shirt with hand embroidered flowers.",
    "FirstAvailable": "2021-06-21T00:00:00.0000000Z",
    "CustomerRatings": [ 
        { 
            "Username": "User1", 
            "Stars": 5 
        }, 
        { 
            "Username": "User2", 
            "Stars": 4 
        }, 
        { 
            "Username": "User3", 
            "Stars": 5 
        } 
    ] 
}

DateTimeBin system function 

Contoso wants to know how many products were made available each week in June so far. They can use DateTimeBin, a new system function that makes aggregation queries with dates even more useful for analyzing data. 

Contoso wrote the query below to achieve this: 

SELECT Count(1) as NewProduts, DateTimeBin(c.FirstAvailable, 'd', 7) AS DayAvailable 
FROM c 
WHERE c.FirstAvailable > "2022-06-01T00:00:00.0000000Z" 
GROUP BY DateTimeBin(c.FirstAvailable, 'd', 7)

When using the DateTimeBin function, specify a UTC DateTime string in the format `YYYY-MM-DDThh:mm:ss. fffffffZ`, a unit of measurement to bin by, and optionally the size of the bins. A full list of supported parameters can be found in the documentation. 

 

GROUP BY and DISTINCT optimizations

Contoso can also take advantage of new improvements to GROUP BY and DISTINCT that allow them to better utilize the index. Previously, queries with these operators would use the index while evaluating filters, but not for evaluating the operator itself. With these new improvements these operators can also benefit from either single or composite indices, leading to both increased performance and reduced RUs. 

 

GROUP BY

In the query below, Contoso is using GROUP BY to get the average price of products in each category. Because there is an aggregation on Price in addition to grouping by Category, this query would benefit from a composite index on those two properties. If there were filter properties outside of the aggregation or grouping, a single index on those properties would also be evaluated. 

SELECT AVG(c.Price), c.Category  
FROM c  
GROUP BY c.Category

Previous query charge: 23,885.49 RUs               Previous execution time: 30.72 seconds 

New query charge: 371.51 RUs                           New execution time: 6.35 seconds 

 

DISTINCT

Contoso is using DISTINCT to get the name of all of their products that are over $500. Because this query can now take advantage of a single index on Name, it is much more efficient. 

SELECT DISTINCT c.Name  
FROM c 
WHERE c.Price > 500

Previous query charge: 16,545.58 RUs                Previous execution time: 47.38 seconds 

New query charge: 1,005.8 RUs                           New execution time: 11.43 seconds 

 

OFFSET LIMIT optimizations

The query engine will now be able to avoid document loading time and parsing time for documents outside of the offset range. Queries with large documents and a high offset will see proportional benefits. These improvements can be seen in queries with some combination of ORDER BY and filter clauses like the query below from Contoso which gets all products in a given price range by their FirstAvailable date. Queries with other operators such as aggregations, joins, or subqueries cannot take advantage of these improvements. 

SELECT c.Name, c.Price, c.FirstAvailable 
FROM c 
WHERE c.Price > 15 AND c.Price < 50 
ORDER BY c.FirstAvailable DESC 
OFFSET 1000 LIMIT 100

Previous query charge: 332.92 RUs                     Previous execution time: 8.63 seconds 

New query charge: 311.2 RUs                              New execution time: 7.01 seconds 

 

JOIN optimizations

Contoso lets users rate the products sold on their site from 1 – 5 stars. To make sure their product quality is high, they want to check which of their products that cost over $100 have a rating below 3 stars. To achieve this Contoso can use JOIN to create a cross product of customer ratings and products in their catalog.  

SELECT c.Name, c.Price, r as Rating 
FROM c 
JOIN r IN c.CustomerRatings 
WHERE c.Price > 100 and r.Stars < 3 

Previously, Contoso had re-written this query to use a subquery which is more efficient when there is a large array and a filter on a property from that array. With a subquery, the filter is applied before the JOIN to ignore irrelevant rows before creating the cross product instead of after the join is applied. In this case, Contoso wants to only look at customer ratings below 3 stars. 

SELECT c.Name, c.Price, Rating 
FROM c 
JOIN (SELECT VALUE r FROM r IN c.CustomerRatings WHERE r.Stars < 3) AS Rating 
WHERE c.Price > 100 

Now, with new improvements to JOIN, Contoso no longer has to explicitly write this subquery themselves. The query engine will now automatically push down the filters while evaluating the cross product for you. This means that Contoso can now write this query in its original, simpler form to achieve the same performance. 

 

Learn more