New string function performance improvements and case-insensitive search

Avatar

Tim

Querying in Azure Cosmos DB just got even better! You now have an option for case-insensitive queries with the following string search system functions:

  • Contains
  • EndsWith
  • StartsWith
  • StringEquals

Additionally, both Contains and EndsWith also have significant performance improvements. This update was rolled out to Azure Cosmos DB core (SQL) API accounts in our most recent service update. Customers using Azure Cosmos DB’s API for MongoDB accounts can leverage the case-insensitive support and performance improvements through the $regex query evaluation operator.

Testing performance

Let’s test the performance improvements by running some queries on a sample dataset with 8.5 million documents. I’ve uploaded the dataset to a Cosmos container with 30,000 provisioned RU’s.

The dataset was generated using Bogus and contains a unique id identifier, a name, an address, a company, and a job:

{    
"id": "ym570004cj",    
"name": "Evans Kilback",    
"address": "4561 Abernathy Station, Hahnmouth, Ecuador",
"company": "Terry, Lind and Bogan",
"job": "Administrator" 
}

Contains

Let’s first do a search for all the people that live on a street named “Brooks Street”. We’ll run the following query to check if the address property contains the word “Brooks Street”:

SELECT * FROM c 
WHERE CONTAINS(c.address, "Brooks Street")

The query returns 17 results with the following RU charges:

Original RU charge: 221,566.68 RUs

New RU charge: 224.99 RUs

The performance improvement for Contains, gave this query over a 99% decrease in RU charge! Aside from indexing the property in the Contains system function, there are no other changes you need to make to see these improvements.

 

EndsWith

Let’s look at an example with EndsWith. Here’s a query that finds all the people that have a name that ends in “Lee”:

SELECT * FROM c 
WHERE EndsWith (c.name, "Lee")

This query returns 0 results with the following RU charges:

Original RU charge: 198,649.32 RUs

New RU charge: 122.76 RUs

 

StartsWith

Additionally, you can take advantage of a new parameter in each of these system functions to get case-insensitive support. This parameter is optional and defaults to false when unspecified. The RU charge for Contains and EndsWith is the same regardless of whether they are case-insensitive.

Here’s an example with StartsWith to find the TOP 100 job titles that start with “developer” whether the case matches or not:

SELECT TOP 100 * FROM c
WHERE StartsWith(c.job, "developer", true)

RU charge: 38.91 RUs

 

StringEquals

If you wanted to check for a full match, you could use StringEquals:

SELECT TOP 100 * FROM c
WHERE StringEquals(c.job, "developer", true)

RU charge: 38.89 RUs

The RU charge for StartsWith and StringEquals is slightly higher with the case-insensitive option than without it. The case-insensitive option is available in the latest version of all Azure Cosmos DB SDK’s. For the .NET SDK, this is version 3.10 or later.

In general, Contains and EndsWith will consume more RUs than StartsWith or StringEquals. The RU charge of Contains and EndsWith will increase as the cardinality of the property in the system function increases. Learn more about Contains and EndsWith index utilization.

Next steps

We hope you try out these new query features! Here’s a query lab with some sample data to get started. Beyond adding an index for properties in the system functions, there is nothing else you need to do to leverage these significant improvements! If you are interested in additional text search functionality, you can leverage Azure Cosmos DB’s integration with Azure Cognitive Search.

 

If you have existing containers with these system functions, check out your Request Unit (RU) consumption. With these recent optimizations, your total RU consumption may have decreased and you may be able to decrease the amount of provisioned throughput on these containers.

4 comments

Leave a comment

      • Wah Yuen
        Wah Yuen

        I think where the frustration lies is that developers are often very eager to use these new features, however, they seem to come out without support or even a roadmap of when SDKs are going to support these new features. While yes, you say these features are not tied to an SDK, one would reasonable assume that developers are already using an SDK for their platform and would want to continue to do so. As it stands, developers either need to perform workarounds (effectively write raw SQL syntax, and even this had a bug in the .Net SDK) or wait for feature parity to be reached (LINQ support) without really ever knowing when this might arrive. While I can’t imagine this post getting much visibility internally, the utopian hope would be that there is better alignment between the backend and SDK teams to release these features in parallel.

        • Mark Brown
          Mark BrownMicrosoft logo

          I should correct the record here in that these features are indeed tied to SDK capabilities. And you have a valid point in that we need to do a better job providing support across our SDK’s when we release new features. As a team we are endeavoring to do just this to ensure that we have a unified surface area for customers across released features and that we keep it that way going forward.