Designing a Cosmos DB database
Premier Developer consultant Julien Oudot brings us this blog with some considerations when designing a Cosmos DB database.
The intent of this article is to summarize some of the key aspects to keep in mind while designing a database using Cosmos DB.
Azure Cosmos DB is Microsoft’s globally distributed, multi-model database. Sometimes referred to as a server-less database, the promise is the ability to transparently and indefinitely scale your data with high throughput, low latency and high reliability.
To achieve this goal, a few best practices need to be applied. There is no need to have advanced DBA skills to design your storage layer in Cosmos DB, but a deep understanding of the data and how it will grow is important.
Choosing the right partition
The partition key may be the most important thing to choose when creating a Cosmos DB collection. It is a boundary for the transaction capabilities and drives the elastic scale capabilities provided by Cosmos DB. Changing its format basically involves migrating the existing data from the previous partition schema to the new one, which is time consuming and might involve down time.
The partition key has to be chosen when creating the collection and this is a mandatory field for multi-partition collections. When choosing the partition key, a few things have to be considered:
- Read should minimize cross-partition look ups
- Write should be evenly distributed across different partition key values
More generally, for better performance, the partition key should be a known field when querying the database.
Single Collection versus Multi-Collections
In terms of collection granularity, three strategies are usually observed:
- One document type per collection
- One tenant per collection. When Cosmos DB users need to store data for different customers, there would be one collection per customer.
- All document types and all tenants in a single collection
The first option allows faster querying of a single document type because users know which collection they need to target depending on the query. It also gives the ability to better isolate the throughput consumption per type of queries and to do more fine-grained throughput provisioning. Because server-side functions (stored procedure and trigger) are tied to a collection, this option prevents users from running cross document type transactions, which is probably something needed in the case of a hierarchical document structure (see next section for more details about hierarchical documents).
The second option would be ideal to be able to bill throughput consumption to the tenants hosted on the platform. However, depending on the domain, it might not be relevant to assign some of the document types to a single tenant.
Finally, the third option is to have all tenants and all document types in a single collection. In this case, cross document type transactions are easy and there is a single stored procedure, trigger and user defined function repository, while they would need to be duplicated in the different collections for the previous options. One down side of this approach is the isolation between queries and tenants which is not as good as in the previous solutions. It is also more challenging to monitor the resource consumption per tenant.
Document types – flat structure versus hierarchical
When working in Cosmos DB, making sure that the document size is bounded is important. Indeed, every write request will be performed on the entire document, so its size has to be controlled for better performance.
When defining the document structure, there are two opposing strategies:
- A flat structure with only 1 level of document containing all sublevels. Also known as data modeling with denormalization This approach is good for:
- A hierarchical structure where top level documents reference lower level documents. Also known as data modeling with normalization This approach is good for:
A flat structure is more aligned with the NoSQL approach and allows to read or write a single document per query. However, because of the denormalization it involves, the quantity of data in such structure can become problematic. Due to the data duplication, there are also some challenges in terms of consistency. In practice, a mix of the two approaches is often chosen. For each document type, it is recommended to estimate:
- How big the document can be?
- For array types – how many items will there be?
- What type of query will need to be performed on this type of document? What does it mean from a performance and reliability standpoint (multiple partition access, transactions, …)?
Here is an example of how the hybrid approach would look:
Here, the Author contains a list of book ids represented as independent documents. In the same way, the book documents reference a list of authors. On the contrary, images are embedded inside the author document.
Trigger versus stored procedure
Along with automatic indexing, another great feature of Cosmos DB is the ability to define server-side functions called triggers and stored procedures. Although they have the same name as in the SQL world, triggers are slightly different since they cannot be enforced at the database level. The name of the invoked trigger has to be used when working with documents.
This is actually the same when invoking a stored procedure except that users have to provide a URI or link to the procedure.
In essence, triggers and stored procedures are close concepts in Cosmos DB. With triggers, application code is usually simpler while stored procedures require arguments to be prepared prior to the call, which can also give more flexibility.
Real world use case for trigger
Recently, some customers reached out to us with a simple constraint. They wanted to be able to detect when a child document type (meaning that it depended on a parent document) was created pointing to a parent that did not exist. This was a perfect use case for pre-trigger because they already provided all parameters necessary to perform this check on the server side.
The trigger logic would query the existing database looking for other documents with the same parent id (this field being the partition key for the parent, it should be a fast query). If at least one document is found, then the document can be created. Otherwise an exception would be thrown from the trigger code, which would abort the operation.
Users would need to explicitly set the trigger every time they want to use it (as shown below).
The few design choices described in this article can drastically change the performance and scalability of the database. Keeping this guidance in mind will help you leverage the power of Cosmos DB to build low latency and massively scalable applications anywhere in the world.
as far as I know, there is no support for cross-document join in cosmos DB, so in your author/books example, how can I query the book titles of those authors with countOfBook > 3?