{"id":7320,"date":"2023-12-19T07:00:06","date_gmt":"2023-12-19T15:00:06","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/cosmosdb\/?p=7320"},"modified":"2024-01-12T21:42:18","modified_gmt":"2024-01-13T05:42:18","slug":"scaling-iot-time-series-metering-workloads-with-azure-cosmos-db-for-postgresql","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/cosmosdb\/scaling-iot-time-series-metering-workloads-with-azure-cosmos-db-for-postgresql\/","title":{"rendered":"Scaling IoT\/Time Series\/Metering Workloads with Azure Cosmos DB for PostgreSQL"},"content":{"rendered":"<h3>Introduction :<\/h3>\n<p>For many organizations around the world, the capture of consumption data is a common yet critical requirement. Some of these organizations struggle to manage the sheer volumes of data that they must capture and process. No wonder! We conservatively estimate that there would petabytes of such data globally and it is likely growing at a rapid rate.<\/p>\n<p>Today, many of the systems tasked to handle this data were designed decades ago, and many are likely hosted \u2018on-premises\u2019. We have seen, firsthand, such systems can be incredibly difficult to scale, costly to operate and complex to maintain. This all has the potential to simultaneously increase business risk and limit business growth.<\/p>\n<p>Two of the most common industries where we see large scale consumption data workloads are;<\/p>\n<ul>\n<li>Telecommunications : Voice and Data usage<\/li>\n<li>Utilities : Energy or resource consumption such as electricity, gas or water<\/li>\n<\/ul>\n<p>In these industries, once the data has been captured it is often used for a variety of business purposes including;<\/p>\n<ol>\n<li>Revenue Collection or Billing<\/li>\n<li>Analysis e.g. Seasonal Trends, Demand Forecasting, Peak periods etc.<\/li>\n<li>Reporting e.g. Ad-hoc Customer Enquiries, Supporting dispute resolution etc.<\/li>\n<li>Auditing and compliance<\/li>\n<\/ol>\n<p>What is often downstream of this data is a billing system of some description that is used to transform this data into customer invoices or statements. It is quite often the case that in addition to the billing system there is also an analytics system.<\/p>\n<p>While the subject of this blog is specific to the Energy\/Electricity industry, the technology we will showcase is equally capable of handling workloads across many industries, including Healthcare, Manufacturing, Monitoring, FSI and others.<\/p>\n<p>If you want to see how we helped a number of Energy industry companies modernize their data platforms to provide performance, scalability and simplified data management for these ever-growing datasets then please read on.<\/p>\n<h3>What is Azure Cosmos DB for PostgreSQL :<\/h3>\n<p>Even today we often hear \u201crelational databases can\u2019t scale\u201d and \u201cyou can\u2019t run OLTP and Analytics on the same database\u201d \u2026. whilst those statements may have come from an era where CPUs were few and storage was painfully slow, today with Azure Cosmos DB for PostgreSQL service those statements no longer hold true.<\/p>\n<p>Azure Cosmos DB for PostgreSQL is a fully managed Azure Database as a Service (DBaaS) where you leave the heavy lifting to us and focus on Applications and Data. More details are in the following link <a href=\"https:\/\/learn.microsoft.com\/en-us\/azure\/cosmos-db\/postgresql\/introduction\">Introduction\/Overview &#8211; Azure Cosmos DB for PostgreSQL<\/a> but here is a summary;<\/p>\n<ul>\n<li>Is built with Open Source PostgreSQL and is powered by the Open Source Citus extension<\/li>\n<li>Scales to 1000s of vCores and Petabytes of data for a single PostgreSQL \u2018cluster\u2019 that can handle the most of demanding of workloads<\/li>\n<li>Is ideal for a number of uses cases such as high throughput transactional, real-time analytics, multi-tenant SaaS and others<\/li>\n<\/ul>\n<p>We have seen some great articles showcasing the power of this offering when it comes to large scale data requirements such as Petabyte scale analytics and IoT. In particular, these two blogs explain how we <a href=\"https:\/\/techcommunity.microsoft.com\/t5\/azure-database-for-postgresql\/architecting-petabyte-scale-analytics-by-scaling-out-postgres-on\/ba-p\/969685\">Architected petabyte-scale analytics<\/a> internally at Microsoft;<\/p>\n<p><img decoding=\"async\" class=\"alignnone wp-image-7321\" src=\"https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-content\/uploads\/sites\/52\/2023\/12\/thumbnail-image-3-of-blog-post-titled.png\" alt=\"Figure 1 : Architecting petabyte-scale analytics by scaling out Postgres on Azure with the Citus extension \" width=\"998\" height=\"488\" srcset=\"https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-content\/uploads\/sites\/52\/2023\/12\/thumbnail-image-3-of-blog-post-titled.png 998w, https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-content\/uploads\/sites\/52\/2023\/12\/thumbnail-image-3-of-blog-post-titled-300x147.png 300w, https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-content\/uploads\/sites\/52\/2023\/12\/thumbnail-image-3-of-blog-post-titled-768x376.png 768w\" sizes=\"(max-width: 998px) 100vw, 998px\" \/><\/p>\n<p>and how building IoT applications with Azure Cosmos DB for PostgreSQL, formerly Hyperscale(Citus) <a href=\"https:\/\/techcommunity.microsoft.com\/t5\/azure-database-for-postgresql\/building-iot-apps-using-postgres-citus-and-azure\/ba-p\/3501175#query\">Building IoT apps using Postgres, Citus, and Azure<\/a> is a popular choice given the data involved and the typical volumes;<\/p>\n<p><img decoding=\"async\" class=\"alignnone wp-image-7322\" src=\"https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-content\/uploads\/sites\/52\/2023\/12\/thumbnail-image-2-captioned-figure-1-a-reference.png\" alt=\"Figure 2: A reference architecture to build scalable IoT apps using Postgres, Citus and Azure. \" width=\"836\" height=\"500\" srcset=\"https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-content\/uploads\/sites\/52\/2023\/12\/thumbnail-image-2-captioned-figure-1-a-reference.png 836w, https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-content\/uploads\/sites\/52\/2023\/12\/thumbnail-image-2-captioned-figure-1-a-reference-300x179.png 300w, https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-content\/uploads\/sites\/52\/2023\/12\/thumbnail-image-2-captioned-figure-1-a-reference-768x459.png 768w\" sizes=\"(max-width: 836px) 100vw, 836px\" \/><\/p>\n<p>In terms of benchmarks, Azure Cosmos DB for PostgreSQL is capable of some very impressive performance. We have seen that it also costs significantly less than other distributed PostgreSQL offerings. The full details are here <a href=\"https:\/\/devblogs.microsoft.com\/cosmosdb\/distributed-postgresql-benchmarks-using-hammerdb-by-gigaom\/\">Distributed PostgreSQL benchmarks using HammerDB, by GigaOM &#8211; Azure Cosmos DB Blog (microsoft.com)<\/a>, but to summarize, it was able to perform many times better than the competition at a fraction of the cost;<\/p>\n<p><img decoding=\"async\" class=\"alignnone wp-image-7323\" src=\"https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-content\/uploads\/sites\/52\/2023\/12\/a-screenshot-of-a-graph-description-automatically.png\" alt=\"Figure 3 : New Order Per Minute Graph\" width=\"937\" height=\"576\" srcset=\"https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-content\/uploads\/sites\/52\/2023\/12\/a-screenshot-of-a-graph-description-automatically.png 937w, https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-content\/uploads\/sites\/52\/2023\/12\/a-screenshot-of-a-graph-description-automatically-300x184.png 300w, https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-content\/uploads\/sites\/52\/2023\/12\/a-screenshot-of-a-graph-description-automatically-768x472.png 768w\" sizes=\"(max-width: 937px) 100vw, 937px\" \/><\/p>\n<p><img decoding=\"async\" class=\"alignnone wp-image-7324\" src=\"https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-content\/uploads\/sites\/52\/2023\/12\/a-graph-of-a-graph-description-automatically-gene.png\" alt=\"Figure 4 : Cost Comparison Graph\" width=\"951\" height=\"585\" srcset=\"https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-content\/uploads\/sites\/52\/2023\/12\/a-graph-of-a-graph-description-automatically-gene.png 951w, https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-content\/uploads\/sites\/52\/2023\/12\/a-graph-of-a-graph-description-automatically-gene-300x185.png 300w, https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-content\/uploads\/sites\/52\/2023\/12\/a-graph-of-a-graph-description-automatically-gene-768x472.png 768w\" sizes=\"(max-width: 951px) 100vw, 951px\" \/><\/p>\n<p>Hopefully you are hanging in there, and you can see that we have a service that is capable of amazing performance, scale and cost effectiveness.<\/p>\n<h3>Background:<\/h3>\n<p>Billing data is considered business critical by most organisations, as it is the final step in revenue realization. It makes sense that the data that was used to generate the billing information should also be categorized as such. This is where the concept of \u2018Single Source of Truth\u2019 is often heard and this data is typically stored in a transactional database given its criticality.<\/p>\n<p>In the Australian Electricity Market, metering data is mainly distributed as CSV files and the format of those files is specified by the Australian Energy Market Operator (AEMO). The format specification is called \u2018NEM12\u2019 and is in a structured format. Here is a snippet of such a file. Notice that \u2018300\u2019 Lines \u2018Interval Data Record\u2019 are readings for the given day;<\/p>\n<p><img decoding=\"async\" class=\"alignnone wp-image-7325\" src=\"https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-content\/uploads\/sites\/52\/2023\/12\/a-screenshot-of-a-computer-description-automatica.png\" alt=\"Figure 5 : NEM 12 File\" width=\"1235\" height=\"675\" srcset=\"https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-content\/uploads\/sites\/52\/2023\/12\/a-screenshot-of-a-computer-description-automatica.png 1582w, https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-content\/uploads\/sites\/52\/2023\/12\/a-screenshot-of-a-computer-description-automatica-300x164.png 300w, https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-content\/uploads\/sites\/52\/2023\/12\/a-screenshot-of-a-computer-description-automatica-1024x560.png 1024w, https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-content\/uploads\/sites\/52\/2023\/12\/a-screenshot-of-a-computer-description-automatica-768x420.png 768w, https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-content\/uploads\/sites\/52\/2023\/12\/a-screenshot-of-a-computer-description-automatica-1536x840.png 1536w\" sizes=\"(max-width: 1235px) 100vw, 1235px\" \/><\/p>\n<p>The above screenshot was taken from this document <a href=\"https:\/\/www.yurika.com.au\/wp-content\/uploads\/2022\/01\/NEM12-Fact-Sheet.pdf\">NEM12 Fact Sheet (yurika.com.au)<\/a>. For a more detailed understanding of the file format, you can checkout this document <a href=\"https:\/\/www.aemo.com.au\/-\/media\/Files\/Electricity\/NEM\/Retail_and_Metering\/Metering-Procedures\/2018\/MDFF-Specification-NEM12--NEM13-v106.pdf\">Meter data file format specification nem12 &amp; nem13 (aemo.com.au)<\/a>.<\/p>\n<p>What typically happens is that this data is read from these CSV files, often by some sort of scripting language such as python, and uploaded to a database for storage and further processing. Many organisations rely on this metering data daily. For example, retailers, distributors and generators often store this data in a relational database, often MS SQL or Oracle. This \u2018metering\u2018 database is typically seen as the \u2018Single Source of Truth\u2019 for this data and as such it often underpins most billing systems and downstream revenue processes.<\/p>\n<p>This type of metering data is not strictly IoT or Time Series, but almost a hybrid of the two. As we will see throughout this article, Azure Cosmos DB for PostgreSQL has some pretty compelling features that make operations and management of this data very straightforward.<\/p>\n<p>One of the key requirements of metering data in this context is that often there can be data that is \u2018restated\u2019, which basically means, that it can be updated over time. For example, it is possible that an attempt to read a meter fails for any one of a variety of reasons, such as a communication error. In this situation it is common that \u2018Substitute\u2019 data is provided instead. After some time, the meter will eventually be read and the \u2018Actual\u2019 data will be supplied. There are 2 likely options to handle any data updates;<\/p>\n<ol>\n<li style=\"list-style-type: none;\">\n<ol>\n<li>Modify the records in the database for the updated data<\/li>\n<li>Insert the updated records and handle the \u201cduplicates\u201d in the application or through database management tasks such as stored procedures<\/li>\n<\/ol>\n<\/li>\n<\/ol>\n<p>Typical requirements for such metering systems are;<\/p>\n<ul>\n<li style=\"list-style-type: none;\">\n<ul>\n<li>Scalable ingestion<\/li>\n<li>Low latency ad-hoc queries<\/li>\n<li>Analytics such as aggregations<\/li>\n<li>Individual meter reporting<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<p>Azure Cosmos DB for PostgreSQL to the Rescue:<\/p>\n<p>Back to our \u2018NEM12\u2019 data. I have created the following standard PostgreSQL table as an example of one way to store it;<\/p>\n<p style=\"padding-left: 40px;\"><span style=\"font-family: 'courier new', courier, monospace;\">create table meter_data (<\/span><\/p>\n<p style=\"padding-left: 40px;\"><span style=\"font-family: 'courier new', courier, monospace;\">nmi varchar(10) not null,<\/span><\/p>\n<p style=\"padding-left: 40px;\"><span style=\"font-family: 'courier new', courier, monospace;\">uom varchar(5) not null,<\/span><\/p>\n<p style=\"padding-left: 40px;\"><span style=\"font-family: 'courier new', courier, monospace;\">interval_date date not null,<\/span><\/p>\n<p style=\"padding-left: 40px;\"><span style=\"font-family: 'courier new', courier, monospace;\">interval_val numeric[],<\/span><\/p>\n<p style=\"padding-left: 40px;\"><span style=\"font-family: 'courier new', courier, monospace;\">quality varchar(3) not null,<\/span><\/p>\n<p style=\"padding-left: 40px;\"><span style=\"font-family: 'courier new', courier, monospace;\">nmi_num bigint not null,<\/span><\/p>\n<p style=\"padding-left: 40px;\"><span style=\"font-family: 'courier new', courier, monospace;\">update_datetime timestamp not null,<\/span><\/p>\n<p style=\"padding-left: 40px;\"><span style=\"font-family: 'courier new', courier, monospace;\">primary key (nmi, interval_date, quality, update_datetime)<\/span><\/p>\n<p style=\"padding-left: 40px;\"><span style=\"font-family: 'courier new', courier, monospace;\">) partition by range (interval_date);<\/span><\/p>\n<p>Table details;<\/p>\n<ul>\n<li style=\"list-style-type: none;\">\n<ul>\n<li>\u2018nmi\u2019 is the meter device ID.<\/li>\n<li>\u2018interval_val\u2019 is a native PostgreSQL numeric array type, which caters for any number of values such as 30 minute intervals (48 values) or 5 minutes intervals (288 values) without the need to modify the table structure to cater for a changing number of intervals.<\/li>\n<li>The primary key includes the \u2018update_datetime\u2019 column to support \u2018restated\u2019 data for an already existing \u2018nmi\u2019 and \u2018interval_date\u2019.<\/li>\n<li>PostgreSQL native partitioning has been used to partition the underlying table by \u2018interval_date\u2019 on a monthly basis<\/li>\n<li>It is also an option to store the interval values as individual rows but that would lead to significantly more rows in your database for no real gain and perhaps an increased management overhead<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<p>With this standard PostgreSQL table, we can also make use of the Azure Cosmos DB for PostgreSQL User Defined Function (UDF) to simplify partition creation. Here is an example of a call to the UDF \u2018create_time_partitions\u2019 <a href=\"https:\/\/learn.microsoft.com\/azure\/cosmos-db\/postgresql\/reference-functions#create_time_partitions\">https:\/\/learn.microsoft.com\/azure\/cosmos-db\/postgresql\/reference-functions#create_time_partitions<\/a> that creates monthly partitions on the meter_data table from 2010 to the end of 2023;<\/p>\n<p style=\"padding-left: 40px;\"><span style=\"font-family: 'courier new', courier, monospace;\">SELECT create_time_partitions(table_name:= &#8216;meter_data&#8217;,<\/span><\/p>\n<p style=\"padding-left: 40px;\"><span style=\"font-family: 'courier new', courier, monospace;\">partition_interval:= &#8216;1 month&#8217;,<\/span><\/p>\n<p style=\"padding-left: 40px;\"><span style=\"font-family: 'courier new', courier, monospace;\">end_at:= &#8216;2023-12-31&#8217;,<\/span><\/p>\n<p style=\"padding-left: 40px;\"><span style=\"font-family: 'courier new', courier, monospace;\">start_from:= &#8216;2010-01-01&#8217;);<\/span><\/p>\n<p>In addition, PostgreSQL with the open source \u2018Citus\u2019 extension provides PostgreSQL with the super-power of distributed tables. It can be used to further split, or \u2018shard\u2019 your data across multiple compute nodes. For example, with a 4 node cluster, if you run this command to \u2018shard\u2019 your meter_data table by the \u2018nmi\u2019 column;<\/p>\n<p style=\"padding-left: 40px;\"><span style=\"font-family: 'courier new', courier, monospace;\">select create_distributed_table(&#8216;meter_data&#8217;, &#8216;nmi&#8217;);<\/span><\/p>\n<p>You would end up with something like this;<\/p>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-content\/uploads\/sites\/52\/2023\/12\/shardpart.png\"><img decoding=\"async\" class=\"alignnone wp-image-7330 size-full\" src=\"https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-content\/uploads\/sites\/52\/2023\/12\/shardpart.png\" alt=\"Figure 6 : Examples of sharded and partitioned table\" width=\"871\" height=\"553\" srcset=\"https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-content\/uploads\/sites\/52\/2023\/12\/shardpart.png 871w, https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-content\/uploads\/sites\/52\/2023\/12\/shardpart-300x190.png 300w, https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-content\/uploads\/sites\/52\/2023\/12\/shardpart-768x488.png 768w\" sizes=\"(max-width: 871px) 100vw, 871px\" \/><\/a><\/p>\n<p>Not only have we partitioned by \u2018interval_date\u2019 we have also \u2018sharded\u2019 by \u2018nmi\u2019. And our meter_data table is now spread across 4 nodes. Each of these nodes has their own compute and storage and can operate independently of the other nodes in the cluster. Here is a link to a great article detailing the similarities and differences between partitioning and sharding in PostgreSQL <a href=\"https:\/\/techcommunity.microsoft.com\/t5\/azure-database-for-postgresql\/understanding-partitioning-and-sharding-in-postgres-and-citus\/ba-p\/3891629\">Understanding partitioning and sharding in Postgres and Citus &#8211; Microsoft Community Hub<\/a><\/p>\n<p>It is possible to have 100s of nodes in a cluster allowing for truly massive scale to handle the largest of relational datasets and workloads. Amazingly, your application does not even have to know that your table has been sharded across multiple nodes. You still use standard PostgreSQL SQL commands and if they filter on the shard column, \u2018nmi\u2019 in this example, then they will be directed to the nodes that hold the required data. Once the node receives the query, it will have all the power of partitioning. All you have to do is run the \u2018create_distributed_table\u2019 UDF and the sharding is applied automatically as show in this illustration;<\/p>\n<p><img decoding=\"async\" class=\"alignnone wp-image-7327\" src=\"https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-content\/uploads\/sites\/52\/2023\/12\/word-image-7320-7.png\" alt=\"Figure 7 : Shard Creation\" width=\"1008\" height=\"567\" srcset=\"https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-content\/uploads\/sites\/52\/2023\/12\/word-image-7320-7.png 2500w, https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-content\/uploads\/sites\/52\/2023\/12\/word-image-7320-7-300x169.png 300w, https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-content\/uploads\/sites\/52\/2023\/12\/word-image-7320-7-1024x576.png 1024w, https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-content\/uploads\/sites\/52\/2023\/12\/word-image-7320-7-768x432.png 768w, https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-content\/uploads\/sites\/52\/2023\/12\/word-image-7320-7-1536x864.png 1536w, https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-content\/uploads\/sites\/52\/2023\/12\/word-image-7320-7-2048x1152.png 2048w\" sizes=\"(max-width: 1008px) 100vw, 1008px\" \/><\/p>\n<p>The co-ordinator node of the cluster stores metadata that identifies which distribution column ranges are stored on which underlying worker node. As you can see, the transparent sharding and partitioning of tables across multiple cluster nodes sets your database up to take advantage of parallelism.<\/p>\n<h3>Ingestion Scalability:<\/h3>\n<p>To demonstrate the scalability of the service, we created and ingested a small dataset into the meter_data table. We have 47GB of \u2018NEM12\u2019 like CSV files that consists of 10000 unique NMI with 10 years of daily data (5 minute intervals with 288 interval values per day). The resultant PostgreSQL table size is 46GB and contains 36.52 million rows (10,000 * 3652).<\/p>\n<p>What we are about to illustrate here is an example of one approach where files are ingested into the database in a batch fashion, however, there are many alternate approaches that could be considered such as <a href=\"https:\/\/learn.microsoft.com\/en-us\/azure\/cosmos-db\/postgresql\/howto-ingest-azure-stream-analytics\">Real-time data ingestion with Azure Stream Analytics &#8211; Azure Cosmos DB for PostgreSQL | Microsoft Learn<\/a>.<\/p>\n<p>For the tests below, the psql command was run from an Azure VM in the same Australia East region as the database server itself. Specs for the VM were as follows;<\/p>\n<ul>\n<li style=\"list-style-type: none;\">\n<ul>\n<li>Standard D16as v5<\/li>\n<li>16 vcpus<\/li>\n<li>64GiB RAM<\/li>\n<li>1TB SSD with max 5000 IOPS<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<p>Data was loaded in a multi-threaded fashion using the Linux command line and the PostgreSQL \\COPY instruction (<a href=\"https:\/\/www.postgresql.org\/docs\/current\/sql-copy.html\">PostgreSQL: Documentation: 16: COPY<\/a>) via the standard PostgreSQL \u2018psql\u2019 utility;<\/p>\n<p style=\"padding-left: 40px;\"><span style=\"font-family: 'courier new', courier, monospace;\">find . -type f | time xargs -n1 -P32 sh -c &#8220;psql -U citus -c \\&#8221;\\\\copy demo.meter_data from &#8216;\\$0&#8217; with csv\\&#8221;&#8221;;<\/span><\/p>\n<p>The following table shows 4 tests, each with a different cluster configuration, and the time taken to load the dataset;<\/p>\n<table>\n<tbody>\n<tr>\n<td style=\"width: 32.3438px;\">Test<\/td>\n<td style=\"width: 150.609px;\">Co-Ordinator vCores<\/td>\n<td style=\"width: 118.312px;\"># Worker Nodes<\/td>\n<td style=\"width: 152.094px;\">Worker Node vCores<\/td>\n<td style=\"width: 157.453px;\">Worker Node Storage<\/td>\n<td style=\"width: 147.047px;\">Total Cluster vCores<\/td>\n<td style=\"width: 61.7031px;\">Threads<\/td>\n<td style=\"width: 134.609px;\">Load Time MM:SS<\/td>\n<td style=\"width: 37.7031px;\">TPS<\/td>\n<\/tr>\n<tr>\n<td style=\"width: 32.3438px;\">1<\/td>\n<td style=\"width: 150.609px;\">32<\/td>\n<td style=\"width: 118.312px;\">4<\/td>\n<td style=\"width: 152.094px;\">4<\/td>\n<td style=\"width: 157.453px;\">2TB<\/td>\n<td style=\"width: 147.047px;\">48<\/td>\n<td style=\"width: 61.7031px;\">32<\/td>\n<td style=\"width: 134.609px;\">5:52<\/td>\n<td style=\"width: 37.7031px;\">103k<\/td>\n<\/tr>\n<tr>\n<td style=\"width: 32.3438px;\">2<\/td>\n<td style=\"width: 150.609px;\">96<\/td>\n<td style=\"width: 118.312px;\">8<\/td>\n<td style=\"width: 152.094px;\">4<\/td>\n<td style=\"width: 157.453px;\">16TB<\/td>\n<td style=\"width: 147.047px;\">128<\/td>\n<td style=\"width: 61.7031px;\">96<\/td>\n<td style=\"width: 134.609px;\">2:21<\/td>\n<td style=\"width: 37.7031px;\">259k<\/td>\n<\/tr>\n<tr>\n<td style=\"width: 32.3438px;\">3<\/td>\n<td style=\"width: 150.609px;\">64<\/td>\n<td style=\"width: 118.312px;\">8<\/td>\n<td style=\"width: 152.094px;\">16<\/td>\n<td style=\"width: 157.453px;\">16TB<\/td>\n<td style=\"width: 147.047px;\">192<\/td>\n<td style=\"width: 61.7031px;\">64<\/td>\n<td style=\"width: 134.609px;\">1:35<\/td>\n<td style=\"width: 37.7031px;\">384k<\/td>\n<\/tr>\n<tr>\n<td style=\"width: 32.3438px;\">4<\/td>\n<td style=\"width: 150.609px;\">96<\/td>\n<td style=\"width: 118.312px;\">8<\/td>\n<td style=\"width: 152.094px;\">8<\/td>\n<td style=\"width: 157.453px;\">16TB<\/td>\n<td style=\"width: 147.047px;\">160<\/td>\n<td style=\"width: 61.7031px;\">96<\/td>\n<td style=\"width: 134.609px;\">1:21<\/td>\n<td style=\"width: 37.7031px;\">450k<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>&nbsp;<\/p>\n<p>The fastest was test #4, which is interesting as it didn\u2019t have the most total cluster compute at 160 vCores. Test #3 has 192 total vCores but was beaten by the more appropriately configured test #4 as it had more co-ordinator compute, and in these ingest heavy scenarios, co-ordinator compute resources can sometimes affect ingestion throughput more than worker node compute resources.<\/p>\n<p>They key things to highlight with the tests are;<\/p>\n<ul>\n<li style=\"list-style-type: none;\">\n<ul>\n<li>16 vcpus\/vcores on the client could be a bottleneck where the thread count increases<\/li>\n<li>From 100k TPS for a relatively small cluster up to 450k TPS for a still moderately sized cluster is very impressive for a relational database, rivalling some NoSQL databases, and with the beauty of ANSI SQL support, ACID compliance, referential integrity, primary keys and joins !<\/li>\n<li>We used an 8 node cluster in most of our tests, bear in mind 20 nodes are available through the portal and it is possible to request even more than 20 nodes by raising a support request.<\/li>\n<li>There are many ways to configure a cluster for optimal performance, whether your workload is write heavy, read heavy or perhaps somewhere in the middle with an even split between reads and writes. You are free to choose any permutation to cater for IO or CPU intensive workloads.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<p>As a side note, what we saw in our test above was that the co-ordinator node could be a bottleneck with an ingest heavy workload. To alleviate this, it is possible to use a special feature that makes it possible to spread an ingest heavy workload directly across all worker nodes (bypassing the co-ordinator if necessary) to essentially bypass any co-ordinator bottleneck. Every node in the cluster has an up-to-date copy of the cluster metadata, and by extension, knows what data lives on which worker node. The details are available here <a href=\"https:\/\/www.citusdata.com\/blog\/2022\/06\/17\/citus-11-goes-fully-open-source\/#any-node\">Citus 11 for Postgres goes fully open source, with query from any node (citusdata.com)<\/a> but hopefully the following picture \u201cpaints a thousand words\u201d;<\/p>\n<p><img decoding=\"async\" class=\"alignnone wp-image-7328\" src=\"https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-content\/uploads\/sites\/52\/2023\/12\/a-diagram-of-a-workflow-description-automatically.png\" alt=\"Figure 8 : DML From any Node\" width=\"1063\" height=\"564\" srcset=\"https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-content\/uploads\/sites\/52\/2023\/12\/a-diagram-of-a-workflow-description-automatically.png 1063w, https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-content\/uploads\/sites\/52\/2023\/12\/a-diagram-of-a-workflow-description-automatically-300x159.png 300w, https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-content\/uploads\/sites\/52\/2023\/12\/a-diagram-of-a-workflow-description-automatically-1024x543.png 1024w, https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-content\/uploads\/sites\/52\/2023\/12\/a-diagram-of-a-workflow-description-automatically-768x407.png 768w\" sizes=\"(max-width: 1063px) 100vw, 1063px\" \/><\/p>\n<p>Note that COPY and any DML is supported, including INSERT, UPDATE, DELETE and of course SELECT. Making Azure Cosmos DB for PostgreSQL a scalable solution not just for reads but <em>also<\/em> for writes, which is unlike many cloud services where you can typically only scale your database for reads.<\/p>\n<h3>Query Performance:<\/h3>\n<p>Now that we have demonstrated the ability to scale ingestion, we will move on to querying the data. For the purposes of this test, I have used a dataset with 2.5x more data. This larger dataset consists of 25000 NMI with 10 years of daily data. The table size is now 116GB and contains ~91 million rows (25,000 * 3652).<\/p>\n<p>We will run four queries as part of these tests.<\/p>\n<table>\n<tbody>\n<tr>\n<td>Query<\/td>\n<td>Description<\/td>\n<td>Notes<\/td>\n<\/tr>\n<tr>\n<td>1<\/td>\n<td>Aggregate of the entire table using \u2018count\u2019<\/td>\n<td>Scans the entire table<\/td>\n<\/tr>\n<tr>\n<td>2<\/td>\n<td>Aggregate of the entire table using \u2018sum\u2019<\/td>\n<td>Scans the entire table<\/td>\n<\/tr>\n<tr>\n<td>3<\/td>\n<td>Aggregate 2 columns for 10 random NMI<\/td>\n<td>Using random to avoid caching<\/td>\n<\/tr>\n<tr>\n<td>4<\/td>\n<td>Monthly rollup using latest interval_date data<\/td>\n<td>Manages \u2018duplicate\u2019 data for \u2018restated\u2019 intervals<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Here are the actual queries used;<\/p>\n<p>Query 1<\/p>\n<p style=\"padding-left: 40px;\">select count(1) from meter_data<\/p>\n<p>Query 2<\/p>\n<p style=\"padding-left: 40px;\">select sum(nmi_num) from meter_data<\/p>\n<p>Query 3<\/p>\n<p style=\"padding-left: 40px;\">select nmi, date_trunc(&#8216;month&#8217;, interval_date) as ym , sum(nmi_num), avg(nmi_num)<\/p>\n<p style=\"padding-left: 40px;\">from meter_data<\/p>\n<p style=\"padding-left: 40px;\">where nmi in (select nmi from meter_data order by random() limit 10)<\/p>\n<p style=\"padding-left: 40px;\">group by nmi, ym<\/p>\n<p style=\"padding-left: 40px;\">order by nmi, ym<\/p>\n<p>Query 4<\/p>\n<p style=\"padding-left: 40px;\">select nmi,<\/p>\n<p style=\"padding-left: 40px;\">date_trunc(&#8216;month&#8217;, interval_date) AS year_month,<\/p>\n<p style=\"padding-left: 40px;\">sum(single_val) AS aggregated_value,<\/p>\n<p style=\"padding-left: 40px;\">current_timestamp AS update_datetime<\/p>\n<p style=\"padding-left: 40px;\">from (<\/p>\n<p style=\"padding-left: 40px;\">select distinct on (nmi, interval_date)<\/p>\n<p style=\"padding-left: 40px;\">nmi,<\/p>\n<p style=\"padding-left: 40px;\">interval_date,<\/p>\n<p style=\"padding-left: 40px;\">unnest(interval_val) AS single_val<\/p>\n<p style=\"padding-left: 40px;\">from<\/p>\n<p style=\"padding-left: 40px;\">meter_data<\/p>\n<p style=\"padding-left: 40px;\">where<\/p>\n<p style=\"padding-left: 40px;\">nmi in (select distinct nmi from meter_data)<\/p>\n<p style=\"padding-left: 40px;\">and<\/p>\n<p style=\"padding-left: 40px;\">interval_date between &#8216;2023-09-01&#8217; and &#8216;2023-09-30&#8217;<\/p>\n<p style=\"padding-left: 40px;\">order by<\/p>\n<p style=\"padding-left: 40px;\">nmi, interval_date DESC, update_datetime DESC<\/p>\n<p style=\"padding-left: 40px;\">) subquery<\/p>\n<p style=\"padding-left: 40px;\">group by<\/p>\n<p style=\"padding-left: 40px;\">nmi,<\/p>\n<p style=\"padding-left: 40px;\">year_month<\/p>\n<p style=\"padding-left: 40px;\">order by<\/p>\n<p style=\"padding-left: 40px;\">nmi,<\/p>\n<p style=\"padding-left: 40px;\">year_month<\/p>\n<p>For the tests, we\u2019ll use 2 clusters or configurations and run the queries from pgadmin running on a laptop;<\/p>\n<table>\n<tbody>\n<tr>\n<td>Config<\/td>\n<td>Nodes<\/td>\n<td>vCores<\/td>\n<td>RAM GB<\/td>\n<td>Effective DB Cache<\/td>\n<td>Storage TB<\/td>\n<td>Storage Max IOPS<\/td>\n<\/tr>\n<tr>\n<td>1<\/td>\n<td>1<\/td>\n<td>8<\/td>\n<td>32<\/td>\n<td>8GB<\/td>\n<td>2<\/td>\n<td>7500<\/td>\n<\/tr>\n<tr>\n<td>2<\/td>\n<td>1 + 4<\/td>\n<td>32+32<\/td>\n<td>384<\/td>\n<td>96GB<\/td>\n<td>4<\/td>\n<td>15000<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>&nbsp;<\/p>\n<p>This will allow us to test two scenarios. One where the data set is much larger than the cache and one where the cache is almost the same size as the table. For databases, it is a well-known fact that for read workloads, more cache generally results in better performance. Simply because RAM is faster that disk.<\/p>\n<p>With config #1, we\u2019re making use of single node Azure Cosmos DB for PostgreSQL configuration, where the single node acts as both a co-ordinator and worker. This configuration is great for testing and non production environments. Details here <a href=\"https:\/\/learn.microsoft.com\/en-us\/azure\/cosmos-db\/postgresql\/resources-compute&quot; \\l &quot;single-node-cluster\">https:\/\/learn.microsoft.com\/en-us\/azure\/cosmos-db\/postgresql\/resources-compute#single-node-cluster<\/a><\/p>\n<p>At most, 8GB of table data will be cached in RAM, so when we run queries over this larger dataset it will be forced to perform IO to retrieve un-cached data. This is more representative of a real-world scenario where database sizes typically far exceed available system RAM.<\/p>\n<p>With config #2, we have a relatively small 5 node cluster with double the IOPS. Due to the RAM increase, up to 96GB of RAM can be used to cache table data.<\/p>\n<p>Heap and Columnar Storage :<\/p>\n<p>One of the nifty features of Azure Cosmos DB for PostgreSQL is the ability to store tables\/partitions in a columnar format. The details are here <a href=\"https:\/\/learn.microsoft.com\/en-us\/azure\/cosmos-db\/postgresql\/concepts-columnar\">Columnar table storage &#8211; Azure Cosmos DB for PostgreSQL<\/a>. It is trivial to manage also as we have included some UDF that do exactly that.<\/p>\n<p>This provides 3 main benefits;<\/p>\n<ol>\n<li style=\"list-style-type: none;\">\n<ol>\n<li>Table sizes can be significantly reduced where data is compressible<\/li>\n<li>IO can be significantly reduced as a result of the reduction in storage for tables<\/li>\n<li>Aggregate queries using can see significant performance improvements where only a subset of table columns are queried<\/li>\n<\/ol>\n<\/li>\n<\/ol>\n<p>For the tests, I have created 2 tables. The first one \u2018meter_data\u2019 is stored using regular PostgreSQL \u2018Heap\u2019 based storage, that is, it is a row based format. The second table \u2018col_meter_data\u2019 contains exactly the same data as the first but uses the option to store the data in \u2018Columnar\u2019 format. It is simple to manage as we have included some UDF to make life easier. Such as, changing specific time partitions to be converted from \u2018Heap\u2019 to \u2018Columnar\u2019.<\/p>\n<p style=\"padding-left: 40px;\"><span style=\"font-family: 'courier new', courier, monospace;\">call alter_old_partitions_set_access_method(<\/span><\/p>\n<p style=\"padding-left: 40px;\"><span style=\"font-family: 'courier new', courier, monospace;\">&#8216;col_meter_data&#8217;,<\/span><\/p>\n<p style=\"padding-left: 40px;\"><span style=\"font-family: 'courier new', courier, monospace;\">&#8216;2023-01-01&#8217;, &#8216;columnar&#8217;);<\/span><\/p>\n<p>When I ran the command on the \u2018col_meter_data\u2019 table, it converted all partitions older than the specified date \u20182023-01-01\u2019 to be stored as \u2018columnar\u2019. The more recent partitions are left as \u2018heap\u2019 and they can be operated on as normal. Once partitions have been converted to \u2018columnar\u2019 there are a few <a href=\"https:\/\/learn.microsoft.com\/en-us\/azure\/cosmos-db\/postgresql\/concepts-columnar#limitations\">limitations<\/a> like being append only. Here is a snippet of the output from running the UDF as detailed above;<\/p>\n<p><img decoding=\"async\" class=\"alignnone wp-image-7329\" src=\"https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-content\/uploads\/sites\/52\/2023\/12\/word-image-7320-9.png\" alt=\"Figure 9 : Converting Partitions to Columnar\" width=\"1318\" height=\"537\" srcset=\"https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-content\/uploads\/sites\/52\/2023\/12\/word-image-7320-9.png 1318w, https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-content\/uploads\/sites\/52\/2023\/12\/word-image-7320-9-300x122.png 300w, https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-content\/uploads\/sites\/52\/2023\/12\/word-image-7320-9-1024x417.png 1024w, https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-content\/uploads\/sites\/52\/2023\/12\/word-image-7320-9-768x313.png 768w\" sizes=\"(max-width: 1318px) 100vw, 1318px\" \/><\/p>\n<p>After the function has finished, the table size is now 30GB, down from 117GB. Almost \u00bc the size of the \u2018heap\u2019 stored data. Pretty impressive for sure. As mentioned above, there are some <a href=\"https:\/\/learn.microsoft.com\/en-us\/azure\/cosmos-db\/postgresql\/concepts-columnar#limitations\">limitations<\/a> that exist today with the columnar storage feature but for this use case, or any use case when, after a specified period of time, there will be no further updates to data this is a great solution for reducing storage requirements and increasing query performance in general and for specific analytical aggregate queries as we will see in the testing.<\/p>\n<p>Another useful UDF we have provided, that simplifies partition management is \u2018drop_old_time_partitions\u2019. Called like this;<\/p>\n<p style=\"padding-left: 40px;\"><span style=\"font-family: 'courier new', courier, monospace;\">call drop_old_time_partitions(&#8216;meter_data&#8217;, now() &#8211; interval &#8216;3 years&#8217;);<\/span><\/p>\n<p>It would result in any\/all partitions older than 3 years being dropped. The nice thing about this is that it doesn\u2019t generate any transaction load on the server. It is simply a metadata operation.<\/p>\n<p>For reference, all the utility functions mentioned in this article are detailed here ;<\/p>\n<p><a href=\"https:\/\/docs.citusdata.com\/en\/stable\/develop\/api_udf.html?highlight=shard_count#citus-utility-functions\">https:\/\/docs.citusdata.com\/en\/stable\/develop\/api_udf.html?highlight=shard_count#citus-utility-functions<\/a><\/p>\n<p>Ok, back to the testing. We ran all 4 queries on each of the 2 tables on both configurations. Both the heap table \u2018meter_data\u2019 and the columnar table \u2018col_meter_data\u2019. The results are in the following table;<\/p>\n<table>\n<tbody>\n<tr>\n<td><strong>Query<\/strong><\/td>\n<td><strong>Storage<\/strong><\/td>\n<td><strong>Config<\/strong><\/td>\n<td><strong>Run 1<\/strong><\/td>\n<td><strong>Run 2<\/strong><\/td>\n<td><strong>Run 3<\/strong><\/td>\n<td><strong>Run 4<\/strong><\/td>\n<td><strong>Run 5<\/strong><\/td>\n<td><strong>Average (Seconds)<\/strong><\/td>\n<\/tr>\n<tr>\n<td>1<\/td>\n<td>Heap 1<\/td>\n<td>1<\/td>\n<td>29.9<\/td>\n<td>30.7<\/td>\n<td>31.3<\/td>\n<td>26.8<\/td>\n<td>51.8<\/td>\n<td>34.1<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td>Columnar 1<\/td>\n<td>1<\/td>\n<td>3.2<\/td>\n<td>2.6<\/td>\n<td>2.4<\/td>\n<td>2.4<\/td>\n<td>9.1<\/td>\n<td>3.94<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td>Heap 2<\/td>\n<td>2<\/td>\n<td>7.3<\/td>\n<td>1.2<\/td>\n<td>1.2<\/td>\n<td>1.2<\/td>\n<td>1.2<\/td>\n<td>2.42<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td>Columnar 2<\/td>\n<td>2<\/td>\n<td>0.9<\/td>\n<td>0.5<\/td>\n<td>0.6<\/td>\n<td>0.6<\/td>\n<td>0.8<\/td>\n<td>0.68<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<\/tr>\n<tr>\n<td>2<\/td>\n<td>Heap 1<\/td>\n<td>1<\/td>\n<td>201<\/td>\n<td>155<\/td>\n<td>156<\/td>\n<td>155<\/td>\n<td>156<\/td>\n<td>164.6<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td>Columnar 1<\/td>\n<td>1<\/td>\n<td>14<\/td>\n<td>5.5<\/td>\n<td>5.3<\/td>\n<td>5.1<\/td>\n<td>5<\/td>\n<td>6.98<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td>Heap 2<\/td>\n<td>2<\/td>\n<td>131<\/td>\n<td>23<\/td>\n<td>7.7<\/td>\n<td>6.4<\/td>\n<td>6.9<\/td>\n<td>35<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td>Columnar 2<\/td>\n<td>2<\/td>\n<td>2.6<\/td>\n<td>0.8<\/td>\n<td>0.6<\/td>\n<td>0.7<\/td>\n<td>0.9<\/td>\n<td>1.12<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<\/tr>\n<tr>\n<td>3<\/td>\n<td>Heap 1<\/td>\n<td>1<\/td>\n<td>76<\/td>\n<td>106<\/td>\n<td>112<\/td>\n<td>132<\/td>\n<td>121<\/td>\n<td>109.4<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td>Columnar 1<\/td>\n<td>1<\/td>\n<td>34<\/td>\n<td>33<\/td>\n<td>18<\/td>\n<td>18<\/td>\n<td>40<\/td>\n<td>28.6<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td>Heap 2<\/td>\n<td>2<\/td>\n<td>2.1<\/td>\n<td>1.7<\/td>\n<td>1.7<\/td>\n<td>1.7<\/td>\n<td>1.7<\/td>\n<td>1.78<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td>Columnar 2<\/td>\n<td>2<\/td>\n<td>2.5<\/td>\n<td>2.2<\/td>\n<td>2.2<\/td>\n<td>1.9<\/td>\n<td>2.2<\/td>\n<td>2.2<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<\/tr>\n<tr>\n<td>4<\/td>\n<td>Heap 1<\/td>\n<td>1<\/td>\n<td>117<\/td>\n<td>108<\/td>\n<td>137<\/td>\n<td>137<\/td>\n<td>121<\/td>\n<td>124<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td>Columnar 1<\/td>\n<td>1<\/td>\n<td>91<\/td>\n<td>86<\/td>\n<td>90<\/td>\n<td>83<\/td>\n<td>83<\/td>\n<td>86.6<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td>Heap 2<\/td>\n<td>2<\/td>\n<td>7.7<\/td>\n<td>7.7<\/td>\n<td>8.1<\/td>\n<td>7.8<\/td>\n<td>7.8<\/td>\n<td>7.82<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td>Columnar 2<\/td>\n<td>2<\/td>\n<td>7.5<\/td>\n<td>7.4<\/td>\n<td>7.3<\/td>\n<td>7.3<\/td>\n<td>7.2<\/td>\n<td>7.34<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>&nbsp;<\/p>\n<p>Here is a chart that summarizes the data from the table above, using the average column;<\/p>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-content\/uploads\/sites\/52\/2023\/12\/query-perf-columnar.png\"><img decoding=\"async\" class=\"alignnone size-full wp-image-7354\" src=\"https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-content\/uploads\/sites\/52\/2023\/12\/query-perf-columnar.png\" alt=\"Image query perf columnar\" width=\"954\" height=\"580\" srcset=\"https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-content\/uploads\/sites\/52\/2023\/12\/query-perf-columnar.png 954w, https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-content\/uploads\/sites\/52\/2023\/12\/query-perf-columnar-300x182.png 300w, https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-content\/uploads\/sites\/52\/2023\/12\/query-perf-columnar-768x467.png 768w\" sizes=\"(max-width: 954px) 100vw, 954px\" \/><\/a><\/p>\n<p>&nbsp;<\/p>\n<h3>Test Summary:<\/h3>\n<p>Two things are clear from these tests;<\/p>\n<ul>\n<li style=\"list-style-type: none;\">\n<ul>\n<li>Columnar storage outperforms heap storage by a large margin in certain scenarios<\/li>\n<li>Distributing such workloads leads to impressive performance gains<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<p>Given the enormous scalability on offer with Azure Cosmos DB for PostgreSQL, we believe that it is no longer necessary to send or copy your data to a downstream analytics solution to perform the typical aggregations associated with \u2018Analytics\u2019. You have the option of running these aggregates in real time, on a schedule or even via a database trigger. Not only will this solution be capable of scaling to handle truly massive amounts of data, but it can also be considerably less complex and costly than a \u2018split\u2019 system handling transactions and analytics in separate databases.<\/p>\n<p>Article Summary:<\/p>\n<p>Ok, there you have it ! A managed open source distributed PostgreSQL service, that scales beyond practically any other relational database and it can handle;<\/p>\n<ul>\n<li style=\"list-style-type: none;\">\n<ul>\n<li>Transactions (OLTP)<\/li>\n<li>Timeseries<\/li>\n<li>IoT<\/li>\n<li>Near real-time analytics<\/li>\n<li>Columnar storage<\/li>\n<li>Geospatial<\/li>\n<li>Schema flexibility with JSON datatypes<\/li>\n<li>ANSI SQL<\/li>\n<li>And even the often touted, less frequently delivered, Hybrid Transactional and Analytical Processing (HTAP)<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<p style=\"padding-left: 40px;\">\u2026.and you don\u2019t have to give up on all the features you want from your database like ACID Compliance, Foreign Keys, Joins, Stored Procedures etc.<\/p>\n<h3>About Azure Cosmos DB<\/h3>\n<p>Azure Cosmos DB is a fully managed and serverless distributed database for modern app development, with SLA-backed speed and availability, automatic and instant scalability, and support for open-source PostgreSQL, MongoDB, and Apache Cassandra.\u00a0<a href=\"https:\/\/cosmos.azure.com\/try\/\" target=\"_blank\" rel=\"noopener\">Try Azure Cosmos DB for free here.<\/a>\u00a0To stay in the loop on Azure Cosmos DB updates, follow us on\u00a0<a href=\"https:\/\/twitter.com\/AzureCosmosDB\" target=\"_blank\" rel=\"noopener\">Twitter<\/a>,\u00a0<a href=\"https:\/\/www.youtube.com\/AzureCosmosDB\" target=\"_blank\" rel=\"noopener\">YouTube<\/a>, and\u00a0<a href=\"https:\/\/www.linkedin.com\/company\/azure-cosmos-db\/\" target=\"_blank\" rel=\"noopener\">LinkedIn<\/a>.<\/p>\n<h3 id=\"try-azure-cosmos-db-free-with-azure-ai-advantage\">Try Azure Cosmos DB free with Azure AI Advantage<\/h3>\n<p>Sign up for the <a href=\"https:\/\/aka.ms\/AzureAIAdvantageSignupForm\" target=\"_blank\" rel=\"noopener\">Azure AI Advantage!\u00a0<\/a>The Azure AI Advantage offer is for existing Azure AI and GitHub Copilot customers who want to use Azure Cosmos DB as part of their solution stack. With this offer, you get <span class=\"ui-provider a b c d e f g h i j k l m n o p q r s t u v w x y z ab ac ae af ag ah ai aj ak\" dir=\"ltr\">40,000 free RUs, equivalent of up to $6,000 in savings.<\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Introduction : For many organizations around the world, the capture of consumption data is a common yet critical requirement. Some of these organizations struggle to manage the sheer volumes of data that they must capture and process. No wonder! We conservatively estimate that there would petabytes of such data globally and it is likely growing [&hellip;]<\/p>\n","protected":false},"author":133363,"featured_media":7330,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1838],"tags":[],"class_list":["post-7320","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-postgresql"],"acf":[],"blog_post_summary":"<p>Introduction : For many organizations around the world, the capture of consumption data is a common yet critical requirement. Some of these organizations struggle to manage the sheer volumes of data that they must capture and process. No wonder! We conservatively estimate that there would petabytes of such data globally and it is likely growing [&hellip;]<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-json\/wp\/v2\/posts\/7320","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-json\/wp\/v2\/users\/133363"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-json\/wp\/v2\/comments?post=7320"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-json\/wp\/v2\/posts\/7320\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-json\/wp\/v2\/media\/7330"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-json\/wp\/v2\/media?parent=7320"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-json\/wp\/v2\/categories?post=7320"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-json\/wp\/v2\/tags?post=7320"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}