I’m preparing a series of post and samples on how to properly load data into Azure SQL using Azure Databricks / Apache Spark that I will start to publish very soon, but I realized today that there is a pre-requisite that in many cases, especially by developers new to the data space, is overlooked: good table design.
Wait! If you’re not a Apache Spark user you might think this post is not for you. Please read on, it will be just a couple of minutes, and you will find something help also for you, I promise.
By good table design, I don’t mean, in this case, normalization, research of the best data type or any other well-known technique…no, nothing like that. They are still absolutely useful and encouraged, but let’s leave them aside for now, and let’s focus on something much simpler.
Simpler but that, in the case I used to build the aforementioned samples, had an impact of 300%. Right, 300%. By changing a very simple thing I could improve (or worsen, depending on where you are starting from) performance by 3 times.
Handling Strings the Right Way
The focal point to understand is that if we overlook something, maybe very small, when data has grown that little mistake or less-than-good decision may have its overhead applied to all the rows in your table. As a result of this it may happen that something negligible in terms of performance impact will instead become very bad once the data amount start to increase. This is the main challenge with database. Data grows and changes.
In the specific sample I want to discuss now, it turns out that using a correct data type for storing string data matters a lot. If you are a .NET developer you probably are already used to take advantage of StringBuilder
instead of simpler String
objects if you need to create or manipulate a string for several thousand or millions of times. StringBuilder
is much more optimized to do this, even if you could obtain the same result just by using String
objects.
In Azure SQL you can choose to use varchar(max)
or varchar(n)
. With varchar(max)
you can store up to 2GB of data. With varchar(n)
you can store up to n
bytes and anyway no more than 8000. The same logic applies to nvarchar
(with limit now set to max 4000 chars as they use 2 bytes per char), but in this case strings will use UTF-16 encoding.
Modeling, who cares?
If you are a developer approaching the data world, you might have the impression that specifying the size of a string is something that only in ancient times made sense. More or less like a nibble, if you ever heard of this curious half-byte thing.
Now that we have gigabytes almost for free, specifying the string size should not matter. With this mindset, a table like the following would look absolutely adequate (for simplicity I’m using a table from TPC-H benchmark, so pardon the weird columns names):
create table [dbo].[LINEITEM] ( [L_ORDERKEY] [int] not null, [L_PARTKEY] [int] not null, [L_SUPPKEY] [int] not null, [L_LINENUMBER] [int] not null, [L_QUANTITY] [decimal](15, 2) not null, [L_EXTENDEDPRICE] [decimal](15, 2) not null, [L_DISCOUNT] [decimal](15, 2) not null, [L_TAX] [decimal](15, 2) not null, [L_RETURNFLAG] [nvarchar](max) not null, [L_LINESTATUS] [nvarchar](max) not null, [L_SHIPDATE] [date] not null, [L_COMMITDATE] [date] not null, [L_RECEIPTDATE] [date] not null, [L_SHIPINSTRUCT] [nvarchar](max) not null, [L_SHIPMODE] [nvarchar](max) not null, [L_COMMENT] [nvarchar](max) not null, [L_PARTITION_KEY] [int] not null )
All the strings in the table are treated like strings in any other language (C#, Python and similar) where you assume strings can have an almost infinite length.
In the database space things are quite different. To reduce I/O to a minimum (as accessing data outside RAM is still the slowest operation you can think of), several optimizations are done, so that with just one I/O you can read many rows at once. That’s why, in fact, the concept of data page exists.
Well, you should!
If a string is known not to be potentially infinite, some other optimizations can be done. So, if we know what could be the potential maximum – or even the exact size – of a string, we could specify it, helping the database to optimize things a bit. For example, like in the following table, which is exactly the table shown before but with more precise string type definition:
create table [dbo].[LINEITEM] ( [L_ORDERKEY] [int] not null, [L_PARTKEY] [int] not null, [L_SUPPKEY] [int] not null, [L_LINENUMBER] [int] not null, [L_QUANTITY] [decimal](15, 2) not null, [L_EXTENDEDPRICE] [decimal](15, 2) not null, [L_DISCOUNT] [decimal](15, 2) not null, [L_TAX] [decimal](15, 2) not null, [L_RETURNFLAG] [char](1) not null, [L_LINESTATUS] [char](1) not null, [L_SHIPDATE] [date] not null, [L_COMMITDATE] [date] not null, [L_RECEIPTDATE] [date] not null, [L_SHIPINSTRUCT] [varchar](25) not null, [L_SHIPMODE] [varchar](10) not null, [L_COMMENT] [varchar](44) not null, [L_PARTITION_KEY] [int] not null )
This last table is much better from a physical design point of view. And you can see this yourself by loading data using Azure Databricks, for example.
Loading data into the first table will require something like 7 minutes for 9 GB of data, while on the second, exactly the same data will require only 2.5 minutes.
Loading data into a table that has been better designed (from a physical modeling point of view) is 3 time faster then loading data in a table not so well optimized.
Given that we only had to correctly set maximum string length, I’d say the optimization is totally worth the effort!
Conclusion
I did my experiments on Azure SQL Hyperscale Gen5 8vCore and with Azure Databricks 6.6 (Spark 2.4.5, Scala 2.11), 4 Workers each with 4 nodes, for a total of 16 workers that were loading data in parallel into Azure SQL.
So, the first step to make sure you can load your data as fast as possible is to create the table using the most suitable data type, especially when you are dealing with strings. In the past we used nibbles to spare memory as much as possible, as it was scarce.
Now we live in the time of plenty, but we generate and operate on huge amounts of data, so every byte wasted means wasting CPU cycles, network bandwidth, I/O bandwidth, cache and memory. Wasting something tiny – in this specific case the bad design impacted only for 0.000005 seconds on a single row – for 59,692,838 times (that’s the number of rows in my sample) still result in a huge impact, not matter what. So, let’s start by not doing that: you’ll never know when your table will reach the size where even a single byte will be critical, so better be prepared.
There is also another nice aspect of this little optimization we just have done. As a result of a better physical design, reads will also be much faster too, exactly for the same reasons explained before. So spending just a few minutes to think about our model instead of just throwing some code here and there is a win-win.
Keep this in mind for your next table!
Photo by panumas nikhomkhai from Pexels
I have always put the proper size for varchar. Unless I need a text datatype but with ability to do a LIKE on it. So I would use varchar (max) instead of TEXT or MEDIUMTEXT etc.
Spark in override mode makes most of the columns as ncharvar max, that needs to be optimized
Uh, that definitely needs to be fixed. I guess it happens only when you don’t specify “truncate” to “true”, is that right? I’m also assuming you are using the latest connector (https://github.com/microsoft/sql-spark-connector). Please make sure you open a GitHub Issue so that the team can track and fix it. Thanks!
Wanted to use that, spark 3 and maven were not supported, so using vanilla with jdbc
I see… so the problem is in the JDBC connector not in the SQL Spark one I posted? Just trying to understand where is the code that needs to be fixed