Including Data in a SQL Server Database Project

Article
02/02/2012

In many cases a database needs to be deployed with data pre-populated in some tables. Loading static data into tables that are referenced from other tables is a common case – examples might be customer types, address types, data states, etc. In some cases these data values will never change, in other cases they may be initial values that can be added to by applications. This class of data is variously called reference data, seed data, domain data or static data. Regardless of what you call it, this data must be present before the database can be used. We can think of this as extending the contract between your database project and the database to include deployment of data as well as schema.

SSDT uses a declarative approach to defining schema, which has the advantage that it is idempotent - you don’t need to concern yourself with the state of the target database when you deploy your design. Ideally, the same approach would apply to deploying data, which in this context can be thought of as part of the schema – rather like enum values that are defined within application code. Ideally, SSDT would allow you to supply data values for any table as part of the definition of the table. These data values would be part of the schema model so would be validated against the table definition, included in refactoring and schema comparisons, and cause appropriate data deployment scripts to be generated. While SSDT does not support defining data in the schema yet it’s certainly on our radar.

So what to do in the meantime? Well, if you use SQL Server Database Projects you can augment the schema deployment with pre- and post-deployment scripts. While these scripts are not declarative and don’t participate in the build, they can be included in the scope of refactoring. And while these scripts are imperative there are ways to write data population scripts so that they are idempotent. This is important as you need these scripts to deploy their data regardless of the content of the tables at the time they are executed and with minimum impact on the integrity of surrounding data.

You can write a composite INSERT, UPDATE, and DELETE script operating over a temp table to do this but SQL Server 2008 added MERGE support which can be used to merge data into a target table in a more compact manner. By including an appropriate MERGE script for each reference data table in your post-deployment script you can deploy data to any number of tables declaratively. And for smaller data sets, with less than a thousand rows per table (the sweet spot for most reference data), MERGE allows you to define the data inline in the script without needing to load it into a temp table. Let’s look at an example…

Let's populate an AddressType table to be used by reference to describe the purpose of addresses held in other referencing tables (not defined).

First let's create a new SQL database project in SSDT and add the reference data table definition below to it. (You can add a new Table using the item template or write this into a build script).

CREATE TABLE [AddressType] (
[AddressTypeID] INT NOT NULL PRIMARY KEY,
[Name] NVARCHAR (50) NOT NULL,
);

Next add a post-deployment script and copy the data population script below into it. Note the difference between the build action property value on the object definition script and the post-deployment script. You’re allowed only one active post-deployment script per project.

-- Reference Data for AddressType
MERGE INTO AddressType AS Target
USING (VALUES
(0, N'Undefined'),
(1, N'Billing'),
(2, N'Home'),
(3, N'Main Office'),
(4, N'Primary'),
(5, N'Shipping'),
(6, N'Archive')
)
AS Source (AddressTypeID, Name)
ON Target.AddressTypeID = Source.AddressTypeID
-- update matched rows
WHEN MATCHED THEN
UPDATE SET Name = Source.Name
-- insert new rows
WHEN NOT MATCHED BY TARGET THEN
INSERT (AddressTypeID, Name)
VALUES (AddressTypeID, Name)
-- delete rows that are in the target but not the source
WHEN NOT MATCHED BY SOURCE THEN
DELETE

As you can see it’s a compact syntax, with the data nicely encapsulated in one part of the statement.

The example above assumes a closed set of values, so it includes a delete clause that removes any extra rows that may have been added to the table. You might feel this is unnecessary if the table is suitably protected, but it does tighten the contract. Alternatively, you could modify this behavior if deletion is an issue or you need to cater for the possibility that extra rows may already have been referenced.

Here’s what it looks like in SSDT. The screen shot below was taken after publishing the project to the database. The post-deployment script is selected and visible in the upper tab and properties window, and the deployed data is visible in the Data Editor in the lower tab, opened by selecting View Data on the AddressType table in the SQL Server Object Explorer.

Some variations of this approach include:

Spiking initial values into a table to which applications can add additional entries.
In this case exclude the delete clause, then the MERGE will guarantee the defined values exist but ignore others already added.
Use with a table that uses the IDENTITY clause to assign key values for application-added data.
In this case you will want to reserve a range of values for your initial data and define a starting value for the IDENTITY clause beyond that range. The example below assigns values for AddressTypeID starting at 100.
[AddressTypeID] INT IDENTITY (100, 1) NOT NULL PRIMARY KEY,
You will also need to enable identity insert as follows:
SET IDENTITY_INSERT AddressType ON
GO
-- your MERGE code goes here
SET IDENTITY_INSERT AddressType OFF
GO
You need to be aware that IDENTITY INSERT has global scope, so a database should either be taken offline or otherwise protected for the duration of the operation.
Use with reference tables that cross-reference each other.
In this case assign explicit key values from the referenced tables as foreign key values in referencing tables – just make sure to order the script so that the referenced tables are populated first.
Wrap the scripts into one or more procedures
A best-practice packaging variation is to place the script within the body of a stored procedure defined as part of the normal database schema and then execute the procedure from the post-deployment script. This has the advantage of providing additional build validation for the script – for example, if a column populated by the script is deleted from the table an error will be reported. You can then choose to either leave the procedure in the database enabling it to be run periodically to ‘refresh’ the data, or drop the procedure immediately after executing it post-deployment.

While SSDT has no built-in solution for including data as part of the schema yet, using MERGE from a post-deployment script works with all SQL Server and SQL Azure databases since SQL Server 2008, is straightforward to implement and gets the job done.

Including Data in a SQL Server Database Project

Additional resources