{"id":7025,"date":"2017-03-24T19:35:45","date_gmt":"2017-03-24T19:35:45","guid":{"rendered":"https:\/\/blogs.msdn.microsoft.com\/premier_developer\/?p=7025"},"modified":"2019-03-07T09:30:03","modified_gmt":"2019-03-07T16:30:03","slug":"getting-started-with-azure-data-lake","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/premier-developer\/getting-started-with-azure-data-lake\/","title":{"rendered":"Getting Started with Azure Data Lake"},"content":{"rendered":"<p>Application Development Manager, <a href=\"https:\/\/www.linkedin.com\/in\/jasonvenema\/\"><strong>Jason Venema<\/strong><\/a>, takes a plunge into Azure Data Lake, Microsoft\u2019s hyperscale repository for big data analytic workloads in the cloud.\u00a0 Data Lake makes it easy to store data of any size, shape, and speed, and do all types of processing and analytics across platforms and languages.<\/p>\n<hr align=\"center\" size=\"3\" width=\"100%\" \/>\n<p>I\u2019m not a data guy. Truth be told, I\u2019d take writing C# or Javascript over SQL any day of the week. When the <a href=\"https:\/\/azure.microsoft.com\/en-us\/solutions\/data-lake\/\">Azure Data Lake<\/a> service was <a href=\"https:\/\/azure.microsoft.com\/en-us\/blog\/introducing-azure-data-lake\/\">announced at Build 2015<\/a>, it didn\u2019t have much of an impact on me. Recently, though, I had the opportunity to spend some hands-on time with Azure Data Lake and discovered that you don\u2019t have to be a data expert to get started analyzing large datasets.<\/p>\n<h3>What\u2019s a Data Lake?<\/h3>\n<p>For most non-data experts (like me) it\u2019s probably not obvious what a data lake is. At a high level, it\u2019s reasonable to think of a data lake as akin to a data warehouse, but without any formal definition of requirements or schema. Data of any kind can be stored in the data lake regardless of structure, volume or velocity of ingestion. For a more in-depth definition, you can check out <a href=\"https:\/\/www.blue-granite.com\/blog\/bid\/402596\/top-five-differences-between-data-lakes-and-data-warehouses\">Top Five Differences between Data Lakes and Data Warehouses<\/a>, which is a great article written by a colleague of mine.<\/p>\n<p>In Azure, the Data Lake service is actually composed of two pieces: Azure Data Lake Store (ADLS) and Azure Data Lake Analytics (ADLA). No, the Data Lake Store is not a store where you can buy data lakes. The word <i>store<\/i> in this case is used in the sense of <i>storage<\/i>. In essence, ADLS is a storage service that is <a href=\"https:\/\/docs.microsoft.com\/en-us\/azure\/data-lake-store\/data-lake-store-comparison-with-blob-storage\">optimized for big data analytics workloads<\/a>. You can store massive amounts of data in the lake as-is without having to worry about transforming it all to a pre-defined schema.<\/p>\n<p>After the data is in ADLS, you use ADLA to run analytics queries against it. One of the most compelling features of ADLA is that you don\u2019t need to spin up (and pay for) a cluster of servers in order to run queries against your data. Instead, there is a bank of servers already provisioned that you can simply take advantage of to execute your query. <a href=\"https:\/\/azure.microsoft.com\/en-us\/pricing\/details\/data-lake-analytics\/\">You only pay for the time it takes for your query to complete<\/a>.<\/p>\n<p>I decided to try out the service by uploading a single large data file into ADLS and running a few simple queries against it.<\/p>\n<h3>Getting Started: Creating the Data Lake<\/h3>\n<p>Creating a data lake is straightforward through the Azure portal. I first created the data lake store by opening the Marketplace and searching for it.<\/p>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/wp-content\/uploads\/sites\/31\/2019\/04\/clip_image00213.png\"><img decoding=\"async\" style=\"padding-top: 0px; padding-left: 0px; padding-right: 0px; border: 0px;\" title=\"clip_image002\" src=\"https:\/\/devblogs.microsoft.com\/wp-content\/uploads\/sites\/31\/2019\/04\/clip_image002_thumb11.png\" alt=\"clip_image002\" width=\"1028\" height=\"395\" border=\"0\" \/><\/a><\/p>\n<p>The only information you need to provide is the name of the store, the resource group, location (ADLS is currently supported in the Central US and East US 2 regions) and the encryption settings. I chose not to encrypt my data, but in a production setting you would want to encrypt.<\/p>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/wp-content\/uploads\/sites\/31\/2019\/04\/clip_image0044.png\"><img decoding=\"async\" style=\"padding-top: 0px; padding-left: 0px; padding-right: 0px; border: 0px;\" title=\"clip_image004\" src=\"https:\/\/devblogs.microsoft.com\/wp-content\/uploads\/sites\/31\/2019\/04\/clip_image004_thumb2.png\" alt=\"clip_image004\" width=\"331\" height=\"484\" border=\"0\" \/><\/a><\/p>\n<p>Next, I created the Data Lake Analytics service. The process is similar, with the exception that you are also required to select an ADLS store to associate it to. I chose the store that I had just created.<\/p>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/wp-content\/uploads\/sites\/31\/2019\/04\/clip_image0063.png\"><img decoding=\"async\" style=\"padding-top: 0px; padding-left: 0px; padding-right: 0px; border: 0px;\" title=\"clip_image006\" src=\"https:\/\/devblogs.microsoft.com\/wp-content\/uploads\/sites\/31\/2019\/04\/clip_image006_thumb2.png\" alt=\"clip_image006\" width=\"563\" height=\"484\" border=\"0\" \/><\/a><\/p>\n<p>Now that I had my data lake, it was time to load some data into it.<\/p>\n<h3>Putting Data into the Lake<\/h3>\n<p>There are a few different options for getting data into ADLS. If the data is already in an <a href=\"https:\/\/wiki.apache.org\/hadoop\/HDFS\/\">HDFS (Hadoop Distributed File System)<\/a> store, you can use tools like <a href=\"https:\/\/docs.microsoft.com\/en-us\/azure\/data-lake-store\/data-lake-store-data-transfer-sql-sqoop\">Sqoop<\/a> or <a href=\"https:\/\/docs.microsoft.com\/en-us\/azure\/data-lake-store\/data-lake-store-copy-data-wasb-distcp\">DistCp<\/a>. If you want to move data on a schedule, another option is Azure Data Factory. For data that\u2019s in Azure blob storage, you can use a CLI tool called <a href=\"https:\/\/docs.microsoft.com\/en-us\/azure\/data-lake-store\/data-lake-store-copy-data-azure-storage-blob\">AdlCopy<\/a>.<\/p>\n<p>In my case, I had downloaded a single 1.4GB CSV file containing <a href=\"https:\/\/catalog.data.gov\/dataset\/crimes-2001-to-present-398a4\">Chicago crime data from 2001 to the present from data.gov<\/a> for my data set. Now, I realize this is not \u201cbig data\u201d by anyone\u2019s standards, and ADLS is capable of handling petabyte-sized files (in fact, there\u2019s no upper limit on the amount of data you can store there), but for my purposes it would suffice. Since the file was on my laptop, it was simplest just to upload it to the store using the Azure portal.<\/p>\n<p>From the ADLS management blade, click on the <i>Data Explorer <\/i>button.<\/p>\n<p><img decoding=\"async\" class=\"alignnone size-full wp-image-35851\" src=\"http:\/\/devblogs.microsoft.com\/premier-developer\/wp-content\/uploads\/sites\/31\/2017\/03\/lake1.png\" alt=\"\" width=\"644\" height=\"335\" srcset=\"https:\/\/devblogs.microsoft.com\/premier-developer\/wp-content\/uploads\/sites\/31\/2017\/03\/lake1.png 644w, https:\/\/devblogs.microsoft.com\/premier-developer\/wp-content\/uploads\/sites\/31\/2017\/03\/lake1-300x156.png 300w\" sizes=\"(max-width: 644px) 100vw, 644px\" \/><\/p>\n<p>The Data Explorer blade lets you quickly see what\u2019s in the store. You can create folders and modify access lists just like in a regular file system. There is also a convenient <i>Upload <\/i>button that you can use to upload new data. I used it to upload my CSV file to the store.<\/p>\n<p><img decoding=\"async\" class=\"alignnone size-full wp-image-35852\" src=\"http:\/\/devblogs.microsoft.com\/premier-developer\/wp-content\/uploads\/sites\/31\/2017\/03\/lake2.jpg\" alt=\"\" width=\"644\" height=\"231\" srcset=\"https:\/\/devblogs.microsoft.com\/premier-developer\/wp-content\/uploads\/sites\/31\/2017\/03\/lake2.jpg 644w, https:\/\/devblogs.microsoft.com\/premier-developer\/wp-content\/uploads\/sites\/31\/2017\/03\/lake2-300x108.jpg 300w\" sizes=\"(max-width: 644px) 100vw, 644px\" \/><\/p>\n<p>Once complete, I could see my file in the Azure portal in a folder that I created and named \u201c2017\u201d.<\/p>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/wp-content\/uploads\/sites\/31\/2019\/04\/clip_image0123.png\"><img decoding=\"async\" style=\"padding-top: 0px; padding-left: 0px; padding-right: 0px; border: 0px;\" title=\"clip_image012\" src=\"https:\/\/devblogs.microsoft.com\/wp-content\/uploads\/sites\/31\/2019\/04\/clip_image012_thumb2.png\" alt=\"clip_image012\" width=\"644\" height=\"171\" border=\"0\" \/><\/a><\/p>\n<h3>Querying the Data using U-SQL<\/h3>\n<p>The crime data file that I used contained 22 columns of data. I could easily figure out the schema of the data by using the File Preview feature of the ADLS service. Simply clicking on a file in the store opens the preview window.<\/p>\n<p><img decoding=\"async\" class=\"alignnone size-full wp-image-35854\" src=\"http:\/\/devblogs.microsoft.com\/premier-developer\/wp-content\/uploads\/sites\/31\/2017\/03\/lake3.jpg\" alt=\"\" width=\"1028\" height=\"236\" srcset=\"https:\/\/devblogs.microsoft.com\/premier-developer\/wp-content\/uploads\/sites\/31\/2017\/03\/lake3.jpg 1028w, https:\/\/devblogs.microsoft.com\/premier-developer\/wp-content\/uploads\/sites\/31\/2017\/03\/lake3-300x69.jpg 300w, https:\/\/devblogs.microsoft.com\/premier-developer\/wp-content\/uploads\/sites\/31\/2017\/03\/lake3-768x176.jpg 768w, https:\/\/devblogs.microsoft.com\/premier-developer\/wp-content\/uploads\/sites\/31\/2017\/03\/lake3-1024x235.jpg 1024w\" sizes=\"(max-width: 1028px) 100vw, 1028px\" \/><\/p>\n<p>I decided that I wanted to write queries to answer 4 questions about this data:<\/p>\n<ol>\n<li>How many rows are there in the dataset?<\/li>\n<li>What are the most\/least frequent locations that crimes occur?<\/li>\n<li>What are the most\/least frequent types of crime that occur?<\/li>\n<li>How has the crime rate changed from 2001 to the present?<\/li>\n<\/ol>\n<p>The Azure Data Lake team has created a language called U-SQL that makes it easy to write queries against your data. The U-SQL language is similar to the familiar SQL syntax, but allows you to intermix C# statements to make extracting, transforming and writing the data more flexible.<\/p>\n<p>Everything I needed to do in my queries was possible using the built-in extractors but if my data had been in a more exotic format, I could have written my own extractor in C#. Incidentally, a quick tip is that U-SQL keywords need to be all caps, whereas C# uses normal casing rules.<\/p>\n<p>Before I could create any U-SQL scripts, I first needed to install the <a href=\"https:\/\/docs.microsoft.com\/en-us\/azure\/data-lake-analytics\/data-lake-analytics-data-lake-tools-get-started\">Data Lake Tools for Visual Studio<\/a>. Once installed, I opened Visual Studio and from the <i>File -&gt; New Project<\/i> menu, I chose <i>U-SQL Project<\/i>.<\/p>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/wp-content\/uploads\/sites\/31\/2019\/04\/clip_image0162.png\"><img decoding=\"async\" style=\"padding-top: 0px; padding-left: 0px; padding-right: 0px; border: 0px;\" title=\"clip_image016\" src=\"https:\/\/devblogs.microsoft.com\/wp-content\/uploads\/sites\/31\/2019\/04\/clip_image016_thumb2.png\" alt=\"clip_image016\" width=\"644\" height=\"412\" border=\"0\" \/><\/a><\/p>\n<p>The result was a solution containing a single project and an empty U-SQL file. I used the <a href=\"https:\/\/docs.microsoft.com\/en-us\/azure\/data-lake-analytics\/data-lake-analytics-data-lake-tools-get-started#develop-u-sql-scripts\">documentation site<\/a> to get an idea for what the structure of a basic U-SQL script looks like. I pasted the code into Visual Studio and made a few modifications.<\/p>\n<p>First, I updated the list of columns to extract using the schema I discovered from previewing my file earlier. Then I updated the source location in the EXTRACT statement to point to my crime data file in ADLS, and the destination location in the OUTPUT statement to point to a folder in the same ADLS store where I wanted the query results file to be written. Finally, I updated the SELECT statement to simply count the number of rows in the file. This would provide then answer to the first question that I posed above.<\/p>\n<p>@crimes =<\/p>\n<p>EXTRACT Id int,<\/p>\n<p>CaseNumber string,<\/p>\n<p>Date DateTime,<\/p>\n<p>Block string,<\/p>\n<p>Iucr string,<\/p>\n<p>PrimaryType string,<\/p>\n<p>Description string,<\/p>\n<p>LocationDescription string,<\/p>\n<p>Arrest bool,<\/p>\n<p>Domestic bool,<\/p>\n<p>Beat int?,<\/p>\n<p>District int?,<\/p>\n<p>Ward int?,<\/p>\n<p>CommunityArea int?,<\/p>\n<p>FBICode string,<\/p>\n<p>XCoordinate int?,<\/p>\n<p>YCoordinate int?,<\/p>\n<p>Year int?,<\/p>\n<p>UpdatedOn DateTime,<\/p>\n<p>Latitude float?,<\/p>\n<p>Longitude float?,<\/p>\n<p>Location string<\/p>\n<p>FROM &#8220;adl:\/\/crimedata.azuredatalakestore.net\/2017\/Crimes_-_2001_to_present.csv&#8221;<\/p>\n<p>USING Extractors.Csv(skipFirstNRows:1);<\/p>\n<p>@res =<\/p>\n<p>SELECT COUNT(*) AS Count<\/p>\n<p>FROM @crimes;<\/p>\n<p>OUTPUT @res<\/p>\n<p>TO &#8220;adl:\/\/crimedata.azuredatalakestore.net\/2017\/Results\/Count.csv&#8221;<\/p>\n<p>USING Outputters.Csv();<\/p>\n<p>I used the <a href=\"https:\/\/msdn.microsoft.com\/en-us\/library\/azure\/mt764098.aspx\">built-in CSV extractor<\/a>, and passed the <i>skipFirstNRows <\/i>parameter to skip the header row that was present in my dataset. I then built the solution to make sure my syntax was correct. This opened a new tab in Visual Studio that displayed the amount of time it took to compile the query.<\/p>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/wp-content\/uploads\/sites\/31\/2019\/04\/clip_image0183.png\"><img decoding=\"async\" style=\"padding-top: 0px; padding-left: 0px; padding-right: 0px; border: 0px;\" title=\"clip_image018\" src=\"https:\/\/devblogs.microsoft.com\/wp-content\/uploads\/sites\/31\/2019\/04\/clip_image018_thumb2.png\" alt=\"clip_image018\" width=\"554\" height=\"484\" border=\"0\" \/><\/a><\/p>\n<h3>Running the Query<\/h3>\n<p>With my query written and compiled, I was now ready to execute it against my data. For testing against a scaled-down version of your data, it\u2019s possible to run a U-SQL query locally and avoid the cost associated with running it in Azure. It\u2019s simple to go back and forth using the dropdown at the top of the query authoring window. I chose to go straight to running it in the cloud.<\/p>\n<p><img decoding=\"async\" class=\"alignnone size-full wp-image-35855\" src=\"http:\/\/devblogs.microsoft.com\/premier-developer\/wp-content\/uploads\/sites\/31\/2017\/03\/lake4.png\" alt=\"\" width=\"644\" height=\"206\" srcset=\"https:\/\/devblogs.microsoft.com\/premier-developer\/wp-content\/uploads\/sites\/31\/2017\/03\/lake4.png 644w, https:\/\/devblogs.microsoft.com\/premier-developer\/wp-content\/uploads\/sites\/31\/2017\/03\/lake4-300x96.png 300w\" sizes=\"(max-width: 644px) 100vw, 644px\" \/><\/p>\n<p>To run the query, I simply clicked the <i>Submit <\/i>button next to the dropdown. You can also click the down arrow next to the button to open the Advanced options screen, which allows you to select the amount of parallelism (i.e. the number of servers) you want to use when executing your query. There is a cost vs. speed trade-off here, and the value you choose depends on a number of factors including how parallelizable your data set is. I went with the default of 5, although this turned out to be overkill for my dataset.<\/p>\n<p><img decoding=\"async\" class=\"alignnone size-full wp-image-35856\" src=\"http:\/\/devblogs.microsoft.com\/premier-developer\/wp-content\/uploads\/sites\/31\/2017\/03\/lake5.png\" alt=\"\" width=\"644\" height=\"351\" srcset=\"https:\/\/devblogs.microsoft.com\/premier-developer\/wp-content\/uploads\/sites\/31\/2017\/03\/lake5.png 644w, https:\/\/devblogs.microsoft.com\/premier-developer\/wp-content\/uploads\/sites\/31\/2017\/03\/lake5-300x164.png 300w\" sizes=\"(max-width: 644px) 100vw, 644px\" \/><\/p>\n<p>After you submit the job, ADLA takes care of running it on a pool of servers that are available to you through the service. This is a nice feature, because there\u2019s no need for you to spin up and run a set of servers yourself. Once you submit the job, it\u2019s placed in a queue until the necessary compute resources become available (usually only a few seconds).<\/p>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/wp-content\/uploads\/sites\/31\/2019\/04\/clip_image024.png\"><img decoding=\"async\" style=\"padding-top: 0px; padding-left: 0px; padding-right: 0px; border: 0px;\" title=\"clip_image024\" src=\"https:\/\/devblogs.microsoft.com\/wp-content\/uploads\/sites\/31\/2019\/04\/clip_image024_thumb.png\" alt=\"clip_image024\" width=\"463\" height=\"484\" border=\"0\" \/><\/a><\/p>\n<p>The first time I did this, I got an error message because my data had some empty columns that I hadn\u2019t <a href=\"https:\/\/msdn.microsoft.com\/en-us\/library\/azure\/mt764123.aspx\">declared as nullable types<\/a> in the EXTRACT statement. The error message made it very clear what I had done wrong.<\/p>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/wp-content\/uploads\/sites\/31\/2019\/04\/clip_image026.png\"><img decoding=\"async\" style=\"padding-top: 0px; padding-left: 0px; padding-right: 0px; border: 0px;\" title=\"clip_image026\" src=\"https:\/\/devblogs.microsoft.com\/wp-content\/uploads\/sites\/31\/2019\/04\/clip_image026_thumb.png\" alt=\"clip_image026\" width=\"644\" height=\"259\" border=\"0\" \/><\/a><\/p>\n<p>See that <i>Download <\/i>button in the top-right corner? This is a great feature that allows you to download the entire runtime of your query on the failed vertex (a.k.a. server) for debugging locally. In my case, the error message was enough for me to figure out what the problem was without doing that, but for debugging complex problems it can be a real life-saver.<\/p>\n<p>After I made all of my numeric types nullable in my U-SQL query, I submitted the job again and this time it completed successfully.<\/p>\n<p><img decoding=\"async\" class=\"alignnone size-full wp-image-35857\" src=\"http:\/\/devblogs.microsoft.com\/premier-developer\/wp-content\/uploads\/sites\/31\/2017\/03\/lake6.png\" alt=\"\" width=\"1028\" height=\"441\" srcset=\"https:\/\/devblogs.microsoft.com\/premier-developer\/wp-content\/uploads\/sites\/31\/2017\/03\/lake6.png 1028w, https:\/\/devblogs.microsoft.com\/premier-developer\/wp-content\/uploads\/sites\/31\/2017\/03\/lake6-300x129.png 300w, https:\/\/devblogs.microsoft.com\/premier-developer\/wp-content\/uploads\/sites\/31\/2017\/03\/lake6-768x329.png 768w, https:\/\/devblogs.microsoft.com\/premier-developer\/wp-content\/uploads\/sites\/31\/2017\/03\/lake6-1024x439.png 1024w\" sizes=\"(max-width: 1028px) 100vw, 1028px\" \/><\/p>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/wp-content\/uploads\/sites\/31\/2019\/04\/clip_image0301.png\"><img decoding=\"async\" style=\"padding-top: 0px; padding-left: 0px; padding-right: 0px; border: 0px;\" title=\"clip_image030\" src=\"https:\/\/devblogs.microsoft.com\/wp-content\/uploads\/sites\/31\/2019\/04\/clip_image030_thumb1.png\" alt=\"clip_image030\" width=\"1028\" height=\"491\" border=\"0\" \/><\/a><\/p>\n<h3>Analyzing the Results<\/h3>\n<p>The simplest way to analyze the results of my query was to use the ADLS File Preview feature in the Azure portal. Remember, the results of my query are simply being written as a new file in the store. From the portal, I could see that my output file existed (success!) and that the total number of rows in the file was 6,270,269.<\/p>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/wp-content\/uploads\/sites\/31\/2019\/04\/clip_image032.png\"><img decoding=\"async\" style=\"padding-top: 0px; padding-left: 0px; padding-right: 0px; border: 0px;\" title=\"clip_image032\" src=\"https:\/\/devblogs.microsoft.com\/wp-content\/uploads\/sites\/31\/2019\/04\/clip_image032_thumb.png\" alt=\"clip_image032\" width=\"644\" height=\"164\" border=\"0\" \/><\/a><\/p>\n<p>Emboldened by this small success, I went on to create similar U-SQL queries to answer the other 3 questions. For the most part, this was a simple matter of updating the SELECT and OUTPUT statements. For example, here\u2019s the relevant portion of the query that retrieves crime incident counts by type:<\/p>\n<p><span style=\"font-family: Calibri; font-size: medium;\">@res =<\/span><\/p>\n<p><span style=\"font-family: Calibri; font-size: medium;\">\u00a0\u00a0\u00a0 SELECT PrimaryType, COUNT(*) AS PrimaryTypeCount<\/span><\/p>\n<p><span style=\"font-family: Calibri; font-size: medium;\">\u00a0\u00a0\u00a0 FROM @crimes<\/span><\/p>\n<p><span style=\"font-family: Calibri; font-size: medium;\">\u00a0\u00a0\u00a0 GROUP BY PrimaryType;<\/span><\/p>\n<p><span style=\"font-family: Calibri; font-size: medium;\">OUTPUT @res<\/span><\/p>\n<p><span style=\"font-family: Calibri; font-size: medium;\">\u00a0\u00a0\u00a0 TO &#8220;adl:\/\/crimedata.azuredatalakestore.net\/2017\/TopCrimeTypes.csv&#8221;<\/span><\/p>\n<p><span style=\"font-family: Calibri; font-size: medium;\">\u00a0\u00a0\u00a0 ORDER BY PrimaryTypeCount DESC<\/span><\/p>\n<p><span style=\"font-family: Calibri; font-size: medium;\">USING Outputters.Csv();<\/span><\/p>\n<p>The U-SQL job execution visualization lets you see basic statistics about your query\u2019s performance. In my case, I could see that the vast majority of time was spent in the extract phase which was spread across 2 vertices and took less than a minute to complete. This visualization also made it clear that 5 vertices was overkill for my data.<\/p>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/wp-content\/uploads\/sites\/31\/2019\/04\/clip_image034.png\"><img decoding=\"async\" style=\"padding-top: 0px; padding-left: 0px; padding-right: 0px; border: 0px;\" title=\"clip_image034\" src=\"https:\/\/devblogs.microsoft.com\/wp-content\/uploads\/sites\/31\/2019\/04\/clip_image034_thumb.png\" alt=\"clip_image034\" width=\"644\" height=\"472\" border=\"0\" \/><\/a><\/p>\n<h3><\/h3>\n<h3>Visualizing with Power BI<\/h3>\n<p>For small result sets, it\u2019s easy enough to look at the results of your queries by opening the CSV output files in ADLS. I decided to take it a step further by importing those files into Power BI and creating a report. I won\u2019t go into detail on how to do that in this post, but <a href=\"https:\/\/docs.microsoft.com\/en-us\/azure\/data-lake-store\/data-lake-store-power-bi\">the process is well documented<\/a>. The result was the dashboard below.<\/p>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/wp-content\/uploads\/sites\/31\/2019\/04\/clip_image0361.jpg\"><img decoding=\"async\" style=\"padding-top: 0px; padding-left: 0px; padding-right: 0px; border: 0px;\" title=\"clip_image036\" src=\"https:\/\/devblogs.microsoft.com\/wp-content\/uploads\/sites\/31\/2019\/04\/clip_image036_thumb1.jpg\" alt=\"clip_image036\" width=\"1028\" height=\"525\" border=\"0\" \/><\/a><\/p>\n<p>Sweet success. Maybe I\u2019m more of a data guy than I thought?<\/p>\n<h3>Conclusion<\/h3>\n<p>The Azure Data Lake service made it easy for me (a self-professed non-data-guy) to quickly perform analysis on large amounts of data without having to worry about managing (and paying for) my own cluster of machines. I stored a large data file as-is in Data Lake Store, and created a few simple U-SQL queries to extract and process that data across a pool of available compute resources. I then viewed the results of my queries by examining the output files and by quickly connecting the Data Lake Store to Power BI so I could create a great looking report.<\/p>\n<hr align=\"center\" size=\"3\" width=\"100%\" \/>\n<p><a href=\"https:\/\/blogs.msdn.com\/b\/premier_developer\/archive\/2014\/09\/15\/welcome.aspx\"><strong>Premier Support for Developers<\/strong><\/a> provides strategic technology guidance, critical support coverage, and a range of essential services to help teams optimize development lifecycles and improve software quality.\u00a0 Contact your Application Development Manager (ADM) or <a href=\"https:\/\/blogs.msdn.microsoft.com\/premier_developer\/contact-us\/\">email us<\/a> to learn more about what we can do for you.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Application Development Manager, Jason Venema, takes a plunge into Azure Data Lake, Microsoft\u2019s hyperscale repository for big data analytic workloads in the cloud.\u00a0 Data Lake makes it easy to store data of any size, shape, and speed, and do all types of processing and analytics across platforms and languages. I\u2019m not a data guy. Truth [&hellip;]<\/p>\n","protected":false},"author":582,"featured_media":37840,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[25],"tags":[24,155,3],"class_list":["post-7025","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-azure","tag-azure","tag-azure-data-lake","tag-team"],"acf":[],"blog_post_summary":"<p>Application Development Manager, Jason Venema, takes a plunge into Azure Data Lake, Microsoft\u2019s hyperscale repository for big data analytic workloads in the cloud.\u00a0 Data Lake makes it easy to store data of any size, shape, and speed, and do all types of processing and analytics across platforms and languages. I\u2019m not a data guy. Truth [&hellip;]<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/premier-developer\/wp-json\/wp\/v2\/posts\/7025","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/premier-developer\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/premier-developer\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/premier-developer\/wp-json\/wp\/v2\/users\/582"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/premier-developer\/wp-json\/wp\/v2\/comments?post=7025"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/premier-developer\/wp-json\/wp\/v2\/posts\/7025\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/premier-developer\/wp-json\/wp\/v2\/media\/37840"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/premier-developer\/wp-json\/wp\/v2\/media?parent=7025"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/premier-developer\/wp-json\/wp\/v2\/categories?post=7025"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/premier-developer\/wp-json\/wp\/v2\/tags?post=7025"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}