The query engine in Azure Cosmos DB Core (SQL) API now has a new system function and optimizations for a set of query operations to better use the index. In this post we’ll cover the new DateTimeBin function as well as improvements to GROUP BY, DISTINCT, OFFSET LIMIT, and JOIN.
These example scenarios are for Contoso, a fictional online retailer that stores a product catalog of 1 million items in Azure Cosmos DB. They have 1.5 GB of data stored in a container partitioned by the id property. Here’s an example document:
{ "id": "4e487804-a9a8-463c-b767-5fe307a86c47", "Name": "Embroidered Silk Shirt", "Price": 35.0, "Category": "Clothing", "Description": "Luxurious silk shirt with hand embroidered flowers.", "FirstAvailable": "2021-06-21T00:00:00.0000000Z", "CustomerRatings": [ { "Username": "User1", "Stars": 5 }, { "Username": "User2", "Stars": 4 }, { "Username": "User3", "Stars": 5 } ] }
DateTimeBin system function
Contoso wants to know how many products were made available each week in June so far. They can use DateTimeBin, a new system function that makes aggregation queries with dates even more useful for analyzing data.
Contoso wrote the query below to achieve this:
SELECT Count(1) as NewProduts, DateTimeBin(c.FirstAvailable, 'd', 7) AS DayAvailable FROM c WHERE c.FirstAvailable > "2022-06-01T00:00:00.0000000Z" GROUP BY DateTimeBin(c.FirstAvailable, 'd', 7)
When using the DateTimeBin function, specify a UTC DateTime string in the format `YYYY-MM-DDThh:mm:ss. fffffffZ`, a unit of measurement to bin by, and optionally the size of the bins. A full list of supported parameters can be found in the documentation.
GROUP BY and DISTINCT optimizations
Contoso can also take advantage of new improvements to GROUP BY and DISTINCT that allow them to better utilize the index. Previously, queries with these operators would use the index while evaluating filters, but not for evaluating the operator itself. With these new improvements these operators can also benefit from either single or composite indices, leading to both increased performance and reduced RUs.
GROUP BY
In the query below, Contoso is using GROUP BY to get the average price of products in each category. Because there is an aggregation on Price in addition to grouping by Category, this query would benefit from a composite index on those two properties. If there were filter properties outside of the aggregation or grouping, a single index on those properties would also be evaluated.
SELECT AVG(c.Price), c.Category FROM c GROUP BY c.Category
Previous query charge: 23,885.49 RUs Previous execution time: 30.72 seconds
New query charge: 371.51 RUs New execution time: 6.35 seconds
DISTINCT
Contoso is using DISTINCT to get the name of all of their products that are over $500. Because this query can now take advantage of a single index on Name, it is much more efficient.
SELECT DISTINCT c.Name FROM c WHERE c.Price > 500
Previous query charge: 16,545.58 RUs Previous execution time: 47.38 seconds
New query charge: 1,005.8 RUs New execution time: 11.43 seconds
OFFSET LIMIT optimizations
The query engine will now be able to avoid document loading time and parsing time for documents outside of the offset range. Queries with large documents and a high offset will see proportional benefits. These improvements can be seen in queries with some combination of ORDER BY and filter clauses like the query below from Contoso which gets all products in a given price range by their FirstAvailable date. Queries with other operators such as aggregations, joins, or subqueries cannot take advantage of these improvements.
SELECT c.Name, c.Price, c.FirstAvailable FROM c WHERE c.Price > 15 AND c.Price < 50 ORDER BY c.FirstAvailable DESC OFFSET 1000 LIMIT 100
Previous query charge: 332.92 RUs Previous execution time: 8.63 seconds
New query charge: 311.2 RUs New execution time: 7.01 seconds
JOIN optimizations
Contoso lets users rate the products sold on their site from 1 – 5 stars. To make sure their product quality is high, they want to check which of their products that cost over $100 have a rating below 3 stars. To achieve this Contoso can use JOIN to create a cross product of customer ratings and products in their catalog.
SELECT c.Name, c.Price, r as Rating FROM c JOIN r IN c.CustomerRatings WHERE c.Price > 100 and r.Stars < 3
Previously, Contoso had re-written this query to use a subquery which is more efficient when there is a large array and a filter on a property from that array. With a subquery, the filter is applied before the JOIN to ignore irrelevant rows before creating the cross product instead of after the join is applied. In this case, Contoso wants to only look at customer ratings below 3 stars.
SELECT c.Name, c.Price, Rating FROM c JOIN (SELECT VALUE r FROM r IN c.CustomerRatings WHERE r.Stars < 3) AS Rating WHERE c.Price > 100
Now, with new improvements to JOIN, Contoso no longer has to explicitly write this subquery themselves. The query engine will now automatically push down the filters while evaluating the cross product for you. This means that Contoso can now write this query in its original, simpler form to achieve the same performance.
Great Improvement!! Is there any similar improvement on Gremlin API ?
Hi @Justine
We are running Cosmos DB in UK South but can't see any gained performance on our GROUP BY queries .
Is this available in all regions ?
If yes - Can you give a hint on how to optimise/index below example on ItemId
select count(d.itemId) as 'count'
from (
select distinct count(1) as num, c.itemId
from c join cc in c.images
where (
not IS_DEFINED(c.isDeleted)...
@Justine
Also can you help explain why the result differs on each query run
1st run Count = 14744
2nd run Count = 14738
3rd run Count = 14741
RU/s around 7-8K
Hi Kasper! Yes, these improvements have been deployed to all regions including UK South.
This query doesn't take advantage of the improvements to GROUP BY because after the JOIN and filter on cc.isTraining = true, c.itemId is no longer precise predicate.
You are seeing different results on various runs of the query because it can't be properly distributed across partitions, leading to incorrect results. There is a limitation on properly distributing nested subqueries with DISTINCT or...
Thanks alot Justine
We wasn’t aware of this limitation and the query helps using the index 😉
Will look into the right query behavior – thanks for the hints
This is really exciting!
I'm trying out the group by on my own database now, and I see that a simple COUNT(c) on a GROUP BY'd set of hundreds of thousands of docs is only 14 RU now, but if I add a SUM or AVG to that selection it jumps up to 7000 RUs. Can the index be used to improve on that as well?
Examples:
<code>
Hi Erik! Yes, a composite index can be used to optimize queries with an aggregation and group by. Your query above can benefit from a composite index on Stage and Amount. You can learn more about adding a composite index here: https://docs.microsoft.com/azure/cosmos-db/sql/how-to-manage-indexing-policy#composite-index