CTO Choices – What happens when you need to choose your database?
Building a robust technology stack from scratch is a challenging task. CTO Choices is a series of articles aimed at CTOs, Tech founders, and project creators. They help you ask yourself the right question to pick up your tools. We try hard to give you honest, enduring, and agnostic advice. That is why – to let our true self speak – the last section of each article will present a heavily opinionated choice. And you will have the right questions to assess if this choice is good for you. After an exploration of the criteria to pick a Frontend framework, let’s look at another important component of your architecture: the database.
Almost any application needs to keep track of some sort of state. For persistent state, you will probably have a database. Now, you have one of the most complicated choice of your architecture to make: which database to pick?
You may not see it at first glance, but choosing a database is a complex task for two main reasons. First, there is dozen dimensions to think about – we will see them in a minute. Then, is one of the choices that is the most complicated to change afterwards. Changing your main database while having a product in production is an extremely complex task. You better have made the right choice the first time 😊.
What are your constrains?
Before going into a list of criteria, you need to understand what you’re building. Do you have performance constrains? Do you have the slightest idea of the volume of data and the volume growth over the next 3 to 5 years (in your biggest dreams)? Applications are born with quite different data requirements and/or constrains. While I will give you my personal choice later in this article, some projects may need special attention.
That’s why I recommend doing a “ballpark” exercise of estimating the amount of data per user/customer and their growth over time. You may also want to understand if you need specific features within your database – like manipulating geospatial data or doing AI right within the DB – or later in the architecture – like multi-region deployments, or if you have specific requirements – like performance constrains. We will dig in more constrains considerations later, however, this first exercise should give you some hints.
A lot of criteria to consider.
So, how to make your own choice while picking a database? It may be not your first criteria, but as a developer, I am looking at developer experience: Can I execute the database in a container or locally? Can I query the DB both from CLI and from an UI? I’m also looking at the DevOps side of the database: can we automate all the deployment – from creating a database, to updating a schema or creating new users -? Can we use the database in our automated test pipeline? There is a stark difference between databases on these topics.
There is also a related question to ask: is there a client for the programming language you will be using? You may have a great DB choice, but if it’s only working with few languages, it may be complicated to pick it. You can still create a service to expose DB features through an API, or wrap a native library, but this overhead may be used for other purposes.
Read and write patterns
Most of the data stores in use have a read/write ratio largely in favor of reads. When the numbers of reads is consequent, we tend to add a cache layer in front of our database – like Redis. However, some constraints like consistency or heavy analytical queries on the transactional database may refrain this addition. You the need to evaluate the maximum capacity of your database: is there a maximum number of concurrent connections? Do you have read-only replication easily available?
Write scalability is an interesting and complex problem area. From sequential IDs to sharding, there is a lot of techniques that you can leverage in both your database engine choice and your data architecture. Most of the (relational) databases are deployed on a single-node, but you need to think what you’ll need in 1, 2 or 3 years from now. Does your hosting provider will be able to grow this server up to what you need in 3 years?
The answer here may be pragmatic: choose something now and accept a redesign or a redeployment later. But ensure you have the proper analytics in your DB in place, so you don’t hit the ceiling too late for this redesign.
Few years ago, my team had to move a huge database from one datacenter to another (over 200 Tb). The fastest solution was to buy hundred of hard drives, copy data to them, ship them to the new location and plug them one by one in a “load server”. Even with shipping and handling times, the transfer rate was better than transferring the data over Internet.
I’ve already touched a bit the scalability criteria in the previous section. Scalability is also about deploying your database across regions – for High-availability or regional performance reasons. For example, some opensource databases are great in a single node deployment, but requires paid add-ons when you need to scale to several nodes.
When you pickup a database, you may not know where you’ll host it now or in 2 years. Hence, having multiple hosting capabilities is probably a clever idea. We can evaluate a database hosting solution with several categories:
- Developer and Continuous Integration hosting: Can you run the database locally, on your development machine? Can it be spin up easily and in an automated way if you want to run integration tests in a DevOps pipeline?
- Cloud hosting: As a developer, I’ll favor a PaaS hosting over deploying my database on my own server.
- Local deployment: While I’m mostly using cloud hosting, there is still a lof of scenarios where local deployment make sense. Some innovative databases are only available as a cloud service. For databases supporting local deployment, you may want to have a detailed look at the options, like support for Kubernetes deployment.
- IoT/Edge deployment: Not all projects need an Edge deployment. If you (may) need it in the future, it’s interesting to learn about the options there. Having different database engines for Edge and cloud deployment can be a solid choice, and sometimes you can have the same engine running both in the cloud and at the edge.
At some point in the future, I can bet that you’ll need to run some analytics on your database. There is enough to say about this topic for several articles. And yet, there are simple things to take into consideration. For example, is your database compatible with data ingestion and analytics tools you’re planning to use? Is there any analytics capabilities embedded in the database? Some databases – or hosting providers – also leverage AI to give you recommendations about your database (regarding performance, security, …)!
You need to take a loving care of your database. That means doing proper backups (and testing them!), but also monitoring – for operations reasons but also for debugging purposes –, database engine version upgrade. Even if you’re not an Ops/DBA person, there are important points to take into consideration. I’ve seen some horror stories when some opensource NoSQL databases went wrong in production. Understand me correctly: there was nothing inherently wrong with these databases. The team was simply not prepared for handling a production incident.
The scope of “operations” is also covering some “good” incidents. Let’s say you have a traffic spike. You may need to scale up your DB for a few hours. Can you do it with your database (and hosting)?
Security & Compliance
Every project is concerned by these two criteria. If the database you pick need to have some features around them, the minimum acceptable level will considerably vary depending on your business, your industry and the type of data you’re handling. Being exhaustive is therefore quite complicated. Here is few questions to help you:
- Does your database support a robust user and RBAC management?
- Does your database support encryption at rest? In transit? While processing the data?
- Is there special Personally identifiable information (PII) features?
- Is there data exploration/labelling features to identify and manage sensitive data?
- Can you trace everything that is happening to a server? To a specific database object?
Some database engines are known to cost the price of a small apartment. Other ones are open source, and so they are free, right?
As for many other components in your architecture, calculating the true cost (or “TCO”) is a bit more complex. In my previous startup, we were using MySQL (pre-MariaDB area), but we were paying an addon to get replication capabilities. Elasticsearch is a great opensource product that you can start using right away. If you require Security and observability features, then you need to switch to a paid plan (or get it through the Azure Marketplace). It’s easy to lean towards free offerings – opensource or not. However, it’s your duty to understand the costs across the lifetime of your usage to make an informed decision.
How many databases solution you need?
Sometimes, having only one database in your architecture is not the best option. And with managed databases in the cloud, adding a second storage option in your architecture is easier than before.
When I’m designing an architecture, I’m starting with one data store. Adding a second one is only done when there is a clear use case, or when some criteria need to be met. For example, I’m almost never storing files in the database directly, I use the Blob Storage for storing these images, videos, documents, etc. For some of my apps, I’ve even used only the blob storage as a database, due to read-write performance constraints, scale and zero need for querying data. When adding a new data store, you need to go through all your criteria to validate this store. Sometimes optimizing for few milliseconds in query time will hurt several important criteria areas, making the addition of this datastore a bad decision.
Opinionated choice: Azure SQL and Cosmos DB
Disclaimer 1: I work for Microsoft, and I’ve been in this ecosystem for more than 15 years. Yes it’s an opinionated choice 😉. Disclaimer 2: Azure also offers fully-managed PostgreSQL, MySQL and MariaDB solutions. These database can be a great choice for your project. As always in this series of articles, we invite you to take the criteria detailed above to make your own decision.
I know, “Azure SQL”. You think you know this product, which is coming from an old world with SQL Server and .NET. However, I bet you’ll learn few things if you keep reading!
Azure SQL can be used in a wide range of languages through ODBC, and natively in Java, Node.js, PHP, Python, Ruby and .NET, and there is even an Apache Spark connector. With the free and open-source Azure Data Studio, you have an editor to work on your Database projects, even on Linux while running SQL Server locally in a container.
Doing so not only give you an editor with autocompletion, but also support for Database projects, that allows you to package your app, support schema compare and update in a CI/CD system and even create Notebooks to explore your data with SQL (or Spark, Python and Powershell). Oh and did I mention the SandDance extension for data exploration?
You can of course use SQL CLI if you want to automate things or are in love with your Windows Terminal (or bash on Linux or WSL2). Did I mention that you can run, manage and develop with SQL Server on a Linux machine?
You’re looking for a NoSQL or a Graph database? Try Azure SQL! With it’s JSON and Graph capabilities, it may suit your needs, even for non-relational storage.
From Memory-optimized tables to multi-region highly-available deployment, SQL Server and Azure SQL offers you a wide range of scalability features that should help support a wide range of workloads.
Do you have more than 4 Tb on a single database or do you need very high performance or instant backup? Switch to Azure SQL Hyperscale to unlock your database with independently scalable compute and storage resources, while retaining your knowledge and code. Just need a database for a few years a month? Leverage Azure SQL Serverless offering, that auto shutdown and restart your database when you open a connection.
I’m surprised by the analytics platform of most of the startups I know. The data is scattered across a wide range of data stores, and the reporting tools are either quite expensive or very low-tech. While not part of Azure SQL, Power BI is a great reporting tool – and the desktop version is free. Combined with an Azure SQL database, you’re already ready for a lot of analytics needs, while maintaining all the security features.
If you need machine learning capabilities, run your models directly within the Azure SQL engine! And if you land into the Big Data world, you can reuse your knowledge with Azure Synapse Analytics.
Security & Operations
I’m not an expert in this domain. I’ve just stumbled upon a lot of cool features – at least for me 😊- during my career. For example, dynamic data masking allows you to mask some columns only for specific users. A credit cart number column can only reveal the last 4 digits for a call center operator while the payment processing service has access to the full number, while this being enforced – and thus audited- in a single place: the database. Row-level security allows you to hide some rows for some users, while secure enclaves protect the data even from other processes running in the machine.
Do you love blockchain or use blockchain in your architecture? With Azure SQL Database ledger, you can create an auditable and immutable log of your data contained in a SQL Table… of offer a trusted off-chain storage for your blockchain operations.
Pricing for Azure SQL and SQL Server can be complicated, and I’m not a licensing specialist. However, there are few things to know about it. First: SQL Server can be free! Yes, totally free. All developers’ tiers are free, and if you need a development or test database hosted on Azure, you can benefit from the Dev & Test pricing. Express Edition is also free including for production workloads with a limit of 10Gb of data per database. Azure-based pricing can also quite competitive: a replicated database – with the secondary replica accessible and up – starts around $14 USD – 11€/Month (At time of writing). Yes, that’s a pricing for a database deployed in Azure two regions with an automatic region failover. Oh, and if you’re deploying Azure SQL on an Edge device, it can be as low as $60USD (51€) per year.
Comsos DB as a specialized secondary store
Cosmos DB is an Azure-only multi-model data store created for performance – with a SLA on read and write times – and multi-master cross-region deployments. With it’s multiple APIs support, you can use native MongoDB, Gremlin, or Cassandra SDKs to query and store data within a Cosmos DB collection.
While Cosmos DB doesn’t check all my previous criteria – you can’t deploy Cosmos DB on premises – it can be a great solution for some scenario thanks to a set of features: true serverless offering (pay per query), auto scaling, change feed, integration with other Azure services (like Azure Functions) and more recently a Linux Emulator.
Some startups are already using Cosmos DB in production. You can read about Corlife using CosmosDB for analytical purposes in this article.
What criteria are YOU using to pick your frontend stack?
Creating Startups DevBlog is all about sharing real-world experiences. So we want to hear from you! Do you have another approach for choosing your tech stack? Do you use other criteria to refine your choices? You should share them in the comments sections!