In my last blog post, I introduced U-SQL as the new Big Data query language for the Azure Data Lake that unifies the benefits of SQL with the expressive power of your own code. Today we are announcing the availability of the Azure Data Lake in public preview. You can now try U-SQL in Visual Studio and run it over massive amounts of data in the store and across relational store such as Azure SQL Database. In this blog post, I will dive a bit deeper into how you can extend U-SQL queries with custom code.
One of the major values of U-SQL is how easy it is to add user-specific code written in C#. Because U-SQL’s type system is based on C# and the U-SQL scalar expression language on the instances of these types is the C# expression language, it is very easy to use the power of the C# language and the .NET Framework and assemblies in U-SQL.
U-SQL’s C# integration
U-SQL’s core reliance on C# for its type system and scalar expression language provides the query writer access to the wealth of the C# and CLR libraries of classes, methods, functions, operators and types
This starts with that all C# operators except for the assignment operators (=, += etc.) are valid in U-SQL scalar expressions. In particular, all comparison operators such as ==, !=, <, >, the ternary comparison cond ? true-expression : false-expression, the null coalesce operator ?? are supported. Even lambda expressions using => can be used inside U-SQL expressions.
This very tight integration and seamless programming experience is syntactically also enabled by U-SQL being integrated with the C# Roslyn compiler. In fact, the U-SQL integration is probably the most complex application of the C# Roslyn compiler platform.
There are several ways how C# code can be used to extend U-SQL expressions:
- Provide inline C# expressions in your U-SQL script: This often makes sense if a small set of C# methods need to be applied to process one of the scalar values. E.g., a string type method or a math function.
- Write user-defined functions in a C# assembly and reference them in the U-SQL script: This is preferred for more complex functions, if the logic of the function requires the full power of C# beyond its expression language, such as procedural logic or recursion.
- Write user-defined aggregators in a C# assembly and reference them in the U-SQL script: By providing user-defined aggregators, custom aggregation logic can be plugged into U-SQL’s processing of aggregation with a GROUP BY clause.
- Write user-defined operators in a C# assembly and reference them in the U-SQL script: User-defined Operators (UDO) are U-SQL’s custom-coded rowset operators. They are written in C# and provide the ability to generate, process and consume rowsets.
For the user-defined functions, aggregators and operators, the C# assembly will have to be loaded into the U-SQL metadata catalog with CREATE ASSEMBLY (U-SQL) and then referenced in the script with REFERENCE ASSEMBLY. The Azure Data Lake Tools for Visual Studio make the registration process a breeze and even provides a so-called code behind experience where you just write the code and submit the script and the tool takes care of all the plumbing.
In today’s blog post, let’s look at the example I introduced a month ago and dive a bit deeper into the inline C# expressions and the experience of writing user-defined C# expressions and functions with U-SQL.
U-SQL inline C# expressions
Let’s quickly recall the sample: The data is twitter history of all my tweets, retweets and mentions in a CSV file that lives in a file in my Azure Data Lake Store:
Going back to our example, I now want to add additional information about the people mentioned in the tweets and extend my aggregation to return how often people in my tweet network are authoring tweets and how often they are being mentioned. In the previous post showed you a simplified version of the following U-SQL script (please read up on the philosophy and statement processing of U-SQL in the previous post):
The highlighted parts of the script above show some of the places where U-SQL expects and accepts C# expressions. As you can see, you can use the C# expressions in the EXTRACT’s and OUPUT’s USING clause, in the SELECT clause and WHERE clause, as well as in the GROUP BY clause, ORDER BY clause and EXPLODE function, although in this example we just refer to a column name in the two later cases.
An important aspect of the integration is that the C# expressions have full access to the scalar values inside the query expression in a seamless way due to the fact that they are typed with C# types.
The EXTRACT’s and OUPUT’s USING clauses in line 6 and 32 take C# expressions resulting in a user-defined operator instance. The two built-in expressions above are calls to factory methods that return an extractor and outputter instance respectively. We will discuss user-defined operators in a future blog post in more details.
Let’s look at the C# expression from line 8 in a bit more details:
new SQL.ARRAY<string>(tweet.Split(‘ ‘).Where(x => x.StartsWith(“@”)))
This is a great example of the use of C#: The U-SQL built-in type SQL.ARRAY<T> is actually a C# object type that provides the expected SQL/Hive capabilities without the side-effect update functions of the existing C# Array type. You just use the C#’s new operator to create a new instance. The instance gets created by simply applying one of the many string operations on the column tweet that has been defined with the string type to break it into words. No more wondering where you get a certain string-type functionality as in normal SQL dialects: you have the full power of the CLR at your fingertips!
It gets even better. The Split method returns an IEnumerable<string>. So any further processing, such as filtering only the mentions from the tweet words can be done with a LINQ expression and using a lambda expression as the predicate. Now try that with your standard SQL language!
Let’s also look at the WHERE clause in line 21. Again, I can simply provide the C# expression referring to the columns in the rowset. The expression in this context has to result in a value of type bool. Now C# has two-valued logic and not three-valued logic as SQL. Thus the comparison r != null will return true if r is not null or false if it is null. Furthermore, by using the C# logical operator &&, I get the guarantee of C#’s execution order being preserved and more importantly, the short-cutting that will not execute the right comparison if the first will evaluate to false.
U-SQL and Visual Studio’s Code-Behind Capabilities
In the previous blog post I showed the code behind capabilities that will automatically deploy the code in the associated .cs file to the service on submission. In order to be able to refer to the methods and types and functions in U-SQL, the classes have to be defined as public and the objects need to be defined as static public.
What the tool does is actually the following steps:
- The .cs file will be compiled into an assembly file.
- The user’s U-SQL Script gets augmented with a header that adds a CREATE ASSEMBLY statement that will create the assembly files binary content in your U-SQL metadata catalog.
- It also adds a cleanup at the end of the script that will remove the registered assembly with a DROP ASSEMBLY statement.
The following shows the modified script that the tool submits with the code behind header:
U-SQL Assemblies
I can also deploy and register the code as an assembly in my U-SQL metadata catalog myself explicitly. This allows me and other people to use the code in future scripts. It is also the preferred way to manage your user-defined functions, aggregators and operators, if you have more complex code that you want to maintain separately, where you may want to include existing code that may have been written in other contexts (like your XML or JSON libraries) or even call out to external executables (again a topic for a later blog post).
Since I have shown the code of how to use the assembly in the previous post, let me talk a bit more about the assembly management in U-SQL.
Similar to relational databases like SQL Server, U-SQL provides a metadata catalog and supports the standard database objects. One of the objects, is an assembly metadata object. By using the CREATE ASSEMBLY statement, you can register an assembly in the database. Assemblies are objects scoped to a database and the assembly DLL file gets placed into the assembly folder inside the relevant database folder inside the catalog folder in your primary Azure Data Lake Storage account.
In addition to storing your assembly, you can specify additional files that will be stored together with your assembly and will be included when you reference the assemblies. The CREATE ASSEMBLY statement syntax grammar looks like this (see the U-SQL language reference documentation for more details):
Create_Assembly_Statement :=
‘CREATE’ ‘ASSEMBLY’ [‘IF’ ‘NOT’ ‘EXISTS’] Assembly_Name
‘FROM’ Assembly_Source
[‘WITH’ ‘ADDITIONAL_FILES’ ‘=’
‘(‘ Assembly_Additional_File_List ‘)’].Assembly_Name := Quoted_or_Unquoted_Identifier.
Assembly_Source :=
Static_String_Expression | lexical_binary_value.
Speaking of referencing an assembly, U-SQL has a small set of pre-loaded System assemblies and namespaces, including System and System.Linq. The set is kept small to keep the compilation time and the job’s resource utilization lower. If you want to refer to a different system assembly, you can just include them with the following statement that adds System.Xml:
REFERENCE SYSTEM ASSEMBLY [System.Xml];
What’s next?
The public preview of the Azure Data Lake has been opened today. So please sign up for joining the Azure Data Lake Analytics Preview, and give us your feedback!
In a future blog post I will dive deeper into user-defined operators. Please let me know in the comments or via the feedback link above, what other U-SQL topics you want to learn about, such as file sets, the metadata objects, federated queries etc.
Also, head over to our Azure Data Lake documentation center to read more and watch our Channel 9 video series on the Azure Data Lake.
Michael Rys (@MikeDoesBigData), Principal Program Manager, Microsoft Big Data Michael has been doing data processing and query languages since the 1980s. Among other things he has been representing Microsoft on the XQuery and SQL design committees and has taken SQL Server beyond relational with XML, Geospatial and Semantic Search. Currently he is working on Big Data query languages such as SCOPE and U-SQL when he is not enjoying time with his family under water or at autocross. |
0 comments