Use Linked server to Synapse SQL to implement Polybase-like scenarios in Managed Instance
Azure Synapse SQL endpoint is a serverless query engine that enables you to read the files on Azure storage. You can leverage Synapse SQL in Azure SQL Managed Instance with linked services to implement Polybase-like scenarios.
Polybase technology enables SQL Server and SQL DW to directly query a large amount of big data stored on the external storage. You can run standard SQL queries over different files (such as CSV or parquet) that are placed in Azure Storage or Hadoop without need to load the data in SQL database tables. Polybase is currently not available in Azure SQL (database or managed instance). You can vote for this feature request on Azure feedback site. If you need to load data from Azure storage you need to use OPENROWSET(BULK) over Azure storage that works only with CSV and can read a single file. As an alternative, you can load external data into a table using some external process such as Azure Data Factory.
Azure Synapse Analytics is a big data analytics service that enables you to query and analyze data on Azure storage. With Synapse Analytics workspace, you are getting one SQL serverless query endpoint that enables you analyze your data using T-SQL language. There is no upfront cost for this default serverless endpoint. You are charged per each query depending on the amount of data that the query process (few dollars per TB processed). This looks like a perfect match for Polybase-like scenarios because it can queries the files on Azure Data Lake. However, this functionality is not available in Azure SQL.
Although you cannot use Polybase on Azure SQL, you have a workaround if you are using Azure SQL Managed Instance and you need to implement Polybase-like scenario on Azure storage. Managed Instance enables you to run SQL queries on Synapse SQL query endpoint using Linked servers.
Since Synapse SQL query endpoint is T-SQL compliant SQL endpoint, you can create Linked service to Synapse SQL and run remote queries from Managed Instance. In this article you will see how to implement that kind of solution.
Azure Synapse Analytics – SQL endpoint
Synapse SQL endpoint in Synapse is serverless and stateless SQL query engine that allows you to query your data in query-as-a-service model. You just need to send a T-SQL query and get the results.
If you don’t have Synapse Analytics workspace, you can easily deploy it using Azure portal or this Deployment template. Synapse workspace automatically deploys one serverless Synapse SQL endpoint that is everything we need for this kind of analysis. With SQL endpoint you are getting the functionally that looks like standalone Polybase service, but it is much more.
You are not reserving resources or paying for default Synapse Workspace if you are not using it. Synapse SQL endpoint in the workspace is charged per query. If you are not executing queries on default endpoint you are not paying anything. There is no additional fixed cost of default Synapse Workspace.
In Synapse SQL we can query the files on Azure Data Lake storage using OPENROWSET function:
select * from openrowset(bulk 'https://****.blob.core.windows.net/year=*/*/*.parquet', format='parquet') as cases
OPENROWSET function very similar to the OPENROWSET function that you use in SQL Server and Azure SQL with some enhancements:
- It can read both csv and parquet files on Azure storage
- It can reference set of files using wildcards (*.parquet or /year=*/month=*/*.parquet)
- It can automatically get the schema from underlying parquet files without need to specify format file
In addition to OPENROWSET, you can create external table on top of set of files on azure storage and query them via standard external table interface. The following script creates external table on a set of CSV files placed on the paths that match the pattern csv/population/year=*/month=* on Azure storage referenced via DATA_SOURCE:
CREATE EXTERNAL TABLE csv.population ( country_code VARCHAR (5), country_name VARCHAR (100), year smallint, population bigint ) WITH ( LOCATION = 'csv/population/year=*/month=*/*.csv', DATA_SOURCE = AzureDataSource, FILE_FORMAT = QuotedCsvWithHeader );
External tables in Synapse SQL are very similar to Polybase external tables. You can see how to create objects in SynapseSQL database in this setup script.
This is very useful set of features and in the next few sections we will see how to use them in Azure SQL Managed Instance.
Linked servers in Managed Instance
Linked servers in Managed Instance enable you to link remote SQL Server or Azure SQL endpoint and send remote queries or join remote tables with the local ones. Since default SQL Serverless query endpoint in Azure Synapse Analytics is T-SQL compatible query engine, you can create linked server pointing to Synapse SQL in the workspace created.
EXEC master.dbo.sp_addlinkedserver @server = N'SynapseSQL',@srvproduct=N'', @provider=N'SQLNCLI11', @datasrc=N'mysynapseworkspace-ondemand.sql.azuresynapse.net', @catalog=N'master'; EXEC master.dbo.sp_addlinkedsrvlogin @rmtsrvname=N'SynapseSQL', @useself=N'False', @locallogin=NULL, @rmtuser=N'<your user name here>',@rmtpassword='<your password>' GO EXEC master.dbo.sp_serveroption @server=N'SynapseSQL', @optname=N'remote proc transaction promotion', @optvalue=N'false' GO EXEC master.dbo.sp_serveroption @server=N'SynapseSQL', @optname=N'rpc', @optvalue=N'true' GO EXEC master.dbo.sp_serveroption @server=N'SynapseSQL', @optname=N'rpc out', @optvalue=N'true' GO
You can also use UI in SSMS to create this linked server and just these parameters. Once you setup your linked server to Synapse SQL default query endpoint, you can start reading external data.
Executing queries on Azure storage data
Managed Instance has EXEC function that enables you to execute T-SQL query on remote linked server. We can use this function to send a query that will be executed on SynapseSQL endpoint and return results of the scan. This following query returns results of the query that is reading the file content from Azure Data Lake storage and return the result set to client via Managed Instance endpoint:
SET QUOTED_IDENTIFIER OFF EXEC(" select * from openrowset(bulk 'https://pandemicdatalake.blob.core.windows.net/public/curated/covid-19/ecdc_cases/latest/ecdc_cases.parquet', format='parquet') as cases ") AT SynapseSQL
NOTE: Make sure that you have configured RPC options in the Linked server configuration using `sp_serveroption` procedure in the setup script above, or you will get the messages like: Server ‘SynapseSQL’ is not configured for RPC. You need to set ‘rpc’ and ‘rpc out’ options to value true.
You can create complex ad-hoc reports and fetch the data via SynapseSQL to Managed Instance as shown in the following example:
SET QUOTED_IDENTIFIER OFF EXEC(" use SampleDB; SELECT nyc.filename() AS [filename],COUNT_BIG(*) AS [rows] FROM OPENROWSET( BULK 'parquet/taxi/year=2017/month=9/*.parquet', DATA_SOURCE = 'SqlOnDemandDemo', FORMAT='PARQUET' ) nyc GROUP BY nyc.filename(); ") AT SynapseSQL;
This query will read a set of parquet files and return the results to the client. Computation will be done on remote SynapseSQL endpoint without affecting the workload on your Managed Instance.
NOTE: If you want to run this query, make sure that you have created database SampleDB on your Synapse SQL endpoint and that you have created data source using this setup script.
Querying remote external tables
If you have created external table in SynapseSQL that references the files on Azure storage, you can use 4-part names in Managed Instance.
NOTE: External tables in Azure SQL cannot be used to query Synapse SQL tables.
The following query uses 4-part name reference to read data from external table placed in SampleDb query in SynapseSQL linked server:
select * from SynapseSQL.SampleDB.csv.population
You can join this remote external table with database tables. The following example shows how to join Countries table that contains the latest recorded population with an external table containing yearly population data to find the countries in Asia that has the highest population increase since 2014:
select TOP 5 FormalName, Population2020 = LatestRecordedPopulation, Population2014 = p.population, [change %] = 1 - 1.0 * p.population/LatestRecordedPopulation from Application.Countries c join SynapseSQL.SampleDB.csv.population p on c.CountryName = p.country_name where p.year = 2015 and continent = 'Asia' order by 1 - 1.0 * p.population/LatestRecordedPopulation desc
Managed Instance will join results from database table Application.Countries with the content of CSV files referenced via Synapse SQL external table csv.population. You can also use 4-part name syntax to ingest external data into some local table. In the following example I’m loading data into a temp table:
select * into #t1 from SynapseSQL.SampleDB.csv.population; select * from #t1
4-part names with linked servers to SynapseSQL enable you to implement many Polybase like scenarios where you need to query external data, join them with local database tables, or load into local tables.
You can remove linked server to Synapse SQL endpoint using the following script:
EXEC sp_dropserver @server = N'SynapseSQL', @droplogins = 'droplogins'
Default SQL serverless endpoint in Azure Synapse Analytics might be nice workaround if you have Polybase-like scenario where you need to query and analyze large amount of externally stored data and combine it with your local table. Synapse SQL endpoint is not replacement for Polybase and do not have exactly the same features, but it can help you to easily implement solution in many cases where you need to access external data.