{"id":22840,"date":"2019-04-24T09:55:23","date_gmt":"2019-04-24T16:55:23","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/dotnet\/?p=22840"},"modified":"2021-09-29T12:16:59","modified_gmt":"2021-09-29T19:16:59","slug":"introducing-net-for-apache-spark","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/dotnet\/introducing-net-for-apache-spark\/","title":{"rendered":"Introducing .NET for Apache\u00ae Spark\u2122 Preview"},"content":{"rendered":"<p><!-- Place this tag in your head or just before your close body tag. -->\n<script async defer src=\"https:\/\/buttons.github.io\/buttons.js\"><\/script><\/p>\n<p><span class=\"TextRun BCX1 SCXW18145513\" lang=\"EN-US\" xml:lang=\"EN-US\" data-contrast=\"auto\"><span class=\"NormalTextRun BCX1 SCXW18145513\">Today <\/span><\/span><span class=\"TextRun BCX1 SCXW18145513\" lang=\"EN-US\" xml:lang=\"EN-US\" data-contrast=\"auto\"><span class=\"NormalTextRun BCX1 SCXW18145513\">at <\/span><\/span><a href=\"https:\/\/databricks.com\/sparkaisummit\/north-america\"><span class=\"FieldRange BCX1 SCXW18145513\"><span class=\"TextRun Underlined BCX1 SCXW18145513\" lang=\"EN-US\" xml:lang=\"EN-US\" data-contrast=\"none\"><span class=\"NormalTextRun BCX1 SCXW18145513\">Spark + AI summit<\/span><\/span><\/span><\/a> <span class=\"TextRun BCX1 SCXW18145513\" lang=\"EN-US\" xml:lang=\"EN-US\" data-contrast=\"auto\"><span class=\"NormalTextRun BCX1 SCXW18145513\">we are <\/span><\/span><span class=\"TextRun BCX1 SCXW18145513\" lang=\"EN-US\" xml:lang=\"EN-US\" data-contrast=\"auto\"><span class=\"NormalTextRun BCX1 SCXW18145513\">excited<\/span><\/span><span class=\"TextRun BCX1 SCXW18145513\" lang=\"EN-US\" xml:lang=\"EN-US\" data-contrast=\"auto\"><span class=\"NormalTextRun BCX1 SCXW18145513\"> to <\/span><\/span><span class=\"TextRun BCX1 SCXW18145513\" lang=\"EN-US\" xml:lang=\"EN-US\" data-contrast=\"auto\"><span class=\"NormalTextRun BCX1 SCXW18145513\">announce<\/span><\/span> <a href=\"http:\/\/aka.ms\/dotnetsparkpreview\">.NET for Apache Spark<\/a><span class=\"TextRun BCX1 SCXW18145513\" lang=\"EN-US\" xml:lang=\"EN-US\" data-contrast=\"auto\"><span class=\"NormalTextRun BCX1 SCXW18145513\">.<\/span><\/span> <span class=\"TextRun BCX1 SCXW18145513\" lang=\"EN-US\" xml:lang=\"EN-US\" data-contrast=\"auto\"><span class=\"NormalTextRun BCX1 SCXW18145513\">Spark is a popular open<\/span><\/span><span class=\"TextRun BCX1 SCXW18145513\" lang=\"EN-US\" xml:lang=\"EN-US\" data-contrast=\"auto\"><span class=\"NormalTextRun BCX1 SCXW18145513\"> source <\/span><\/span><span class=\"TextRun BCX1 SCXW18145513\" lang=\"EN-US\" xml:lang=\"EN-US\" data-contrast=\"auto\"><span class=\"NormalTextRun BCX1 SCXW18145513\">distributed process<\/span><\/span><span class=\"TextRun BCX1 SCXW18145513\" lang=\"EN-US\" xml:lang=\"EN-US\" data-contrast=\"auto\"><span class=\"NormalTextRun BCX1 SCXW18145513\">ing engine<\/span><\/span><span class=\"TextRun BCX1 SCXW18145513\" lang=\"EN-US\" xml:lang=\"EN-US\" data-contrast=\"auto\"><span class=\"NormalTextRun BCX1 SCXW18145513\"> for an<\/span><\/span><span class=\"TextRun BCX1 SCXW18145513\" lang=\"EN-US\" xml:lang=\"EN-US\" data-contrast=\"auto\"><span class=\"NormalTextRun BCX1 SCXW18145513\">alytics over large data sets.\u00a0Spark can be used for processing batches of data, real-time streams, machine learning, and ad-hoc query.<\/span><\/span><\/p>\n<p><a href=\"http:\/\/aka.ms\/dotnetsparkpreview\">.NET for Apache Spark<\/a> is aimed at making Apache\u00ae Spark\u2122 accessible to .NET developers across all Spark APIs.\u00a0So far Spark has been accessible through Scala, Java, Python and R but not .NET.<\/p>\n<p>We plan to develop\u00a0.NET for Apache Spark in the open (as a<a tabindex=\"-1\" title=\"https:\/\/dotnetfoundation.org\" href=\"https:\/\/dotnetfoundation.org\" target=\"_blank\" rel=\"noreferrer noopener\"> .NET Foundation<\/a> member project) along with the Spark and .NET community to ensure that developers get the best of both worlds.<\/p>\n<p><img decoding=\"async\" class=\"\" src=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2019\/04\/dotnetsparklogo-6.png\" alt=\".NET Spark Logo\" width=\"906\" height=\"334\" border=\"2\" \/><\/p>\n<p style=\"text-align: center;\"><span style=\"font-size: 14pt;\"><a href=\"https:\/\/github.com\/dotnet\/spark\">https:\/\/github.com\/dotnet\/spark\u00a0<\/a><a class=\"github-button\" href=\"https:\/\/github.com\/dotnet\/spark\" aria-label=\"Star dotnet\/spark on GitHub\" data-size=\"large\" data-show-count=\"false\">Star<\/a><\/span><\/p>\n<p>The remainder of this post provides more specifics on the following topics:<\/p>\n<ul>\n<li><a href=\"#dotnetspark\">What is .NET For Apache Spark?<\/a><\/li>\n<li><a href=\"#getstarted\">Getting Started with .NET for Apache Spark<\/a><\/li>\n<li><a href=\"#performance\">.NET for Apache Spark performance<\/a><\/li>\n<li><a href=\"#whatnext\">What\u2019s next with .NET For Apache Spark<\/a><\/li>\n<li><a href=\"#wrapup\">Wrap Up<\/a><\/li>\n<\/ul>\n<h3><a id=\"dotnetspark\"><\/a>What is .NET for Apache Spark?<\/h3>\n<p>.<a href=\"http:\/\/aka.ms\/dotnetsparkpreview\">NET for Apache Spark<\/a> provides high performance APIs for using Spark from C# and F#. With this .NET APIs, you can access all aspects of Apache Spark including Spark SQL, DataFrames, Streaming, MLLib etc. .NET for Apache Spark lets you reuse all the knowledge, skills, code, and libraries you already have as a .NET developer.<\/p>\n<p>The C#\/ F# language binding to Spark will be written on a new Spark interop layer which offers easier extensibility. This new layer of Spark interop was written keeping in mind best practices for language extension and optimizes for interop and performance. Long term this extensibility can be used for adding support for other languages in Spark.<\/p>\n<p>You can learn more details about this <a href=\"https:\/\/issues.apache.org\/jira\/browse\/SPARK-26257\">work through this proposal<\/a>.<\/p>\n<p><img decoding=\"async\" class=\"\" src=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2019\/04\/dotnetsparkarchitecture-1.png\" alt=\".NET Spark Performance\" width=\"377\" height=\"316\" \/><\/p>\n<p>.NET for Apache Spark is compliant with .NET Standard 2.0 and can be used on Linux, macOS, and Windows, just like the rest of .NET. .NET for Apache Spark is available by default in Azure HDInsight, and can be installed in Azure Databricks and more.<\/p>\n<h3><a id=\"getstarted\"><\/a>Getting Started with .NET for Apache Spark<\/h3>\n<div class=\"highlight highlight-source-cs\">\n<p>Before you can get started with .NET for Apache Spark, you do need to install a few things. Follow <a href=\"https:\/\/github.com\/dotnet\/spark#get-started\">these steps<\/a> to get started with <a href=\"http:\/\/aka.ms\/dotnetsparkpreview\">.NET for Apache Spark<\/a><\/p>\n<p>Once setup, you can start programming Spark applications in .NET with three easy steps.<\/p>\n<p>In our first .NET Spark application we will write a basic Spark pipeline which counts the occurrence of each word in a text segment.<\/p>\n<pre class=\"\">\/\/ 1. Create a Spark session\r\nvar spark = SparkSession\r\n    .Builder()\r\n    .AppName(\"word_count_sample\")\r\n    .GetOrCreate();\r\n\r\n\/\/ 2. Create a DataFrame\r\nDataFrame dataFrame = spark.Read().Text(\"input.txt\");\r\n\r\n\/\/ 3. Manipulate and view data\r\nvar words = dataFrame.Select(Split(dataFrame[\"value\"], \" \").Alias(\"words\"));\r\n\r\nwords.Select(Explode(words[\"words\"])\r\n    .Alias(\"word\"))\r\n    .GroupBy(\"word\")\r\n    .Count()\r\n    .Show();\r\n<\/pre>\n<h3><a id=\"performance\"><\/a>.NET for Apache Spark performance<\/h3>\n<p>We are pleased to say that the first preview version of .NET for Apache Spark performs well on the popular <a href=\"http:\/\/www.tpc.org\/tpch\/\">TPC-H benchmark<\/a>. The TPC-H benchmark consists of a suite of business oriented queries.\u00a0The chart below illustrates the performance of .NET Core versus Python and Scala on the TPC-H query set.<\/p>\n<p><img decoding=\"async\" class=\"\" src=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2019\/04\/performance.png\" alt=\".NET Spark Performance\" width=\"820\" height=\"615\" \/><\/p>\n<p>The chart above shows the per query performance of .NET for Apache Spark versus Python and Scala. .NET for Apache Spark performs well against Python and Scala .\u00a0Furthermore, in cases where UDF performance is critical such as query 1 where 3B rows of non-string data is passed between the JVM and the CLR .NET for Apache Spark is 2x faster than Python.<\/p>\n<p>It\u2019s also important to call out that this is our first preview of .NET for Apache Spark and we aim to further invest in improving and benchmarking performance (e.g. Arrow optimizations). You can follow our instructions to benchmark this on our GitHub repo.<\/p>\n<h3><a id=\"whatnext\"><\/a>What\u2019s next with .NET For Apache Spark<\/h3>\n<p>Today marks the first step in our journey. Following are some features on our near-term roadmap. Please follow the <a href=\"https:\/\/github.com\/dotnet\/spark\/blob\/master\/ROADMAP.md\">full roadmap<\/a> on our <a href=\"https:\/\/github.com\/dotnet\/spark\">GitHub repo<\/a>.<\/p>\n<ul>\n<li>Simplified getting started experience, documentation and samples<\/li>\n<li>Native integration with developer tools such as Visual Studio, Visual Studio Code, Jupyter notebooks<\/li>\n<li>.NET support for user-defined aggregate functions<\/li>\n<li>.NET idiomatic APIs for C# and F# (e.g., using LINQ for writing queries)<\/li>\n<li>Out of the box support with Azure Databricks, Kubernetes etc.<\/li>\n<li>Make .NET for Apache Spark part of Spark Core. <a style=\"background-color: #f7f7f9; font-size: 1rem;\" href=\"https:\/\/issues.apache.org\/jira\/browse\/SPARK-27006\">You can follow this\nprogress here.<\/a><\/li>\n<\/ul>\n<p>See something missing on this list, please drop us a comment below<\/p>\n<h3><a id=\"wrapup\"><\/a>Wrap Up<\/h3>\n<p>.NET for Apache Spark is our first step in making .NET a great tech stack for building Big Data applications.<\/p>\n<p>We need your help to shape the future of .NET for Apache Spark, we look forward to seeing what you \u00a0build with .NET for Apache Spark. You can provide reach out to us through our GitHub repo.<\/p>\n<p><span style=\"font-size: 14pt;\"><a href=\"https:\/\/github.com\/dotnet\/spark\">https:\/\/github.com\/dotnet\/spark\u00a0<\/a><a class=\"github-button\" href=\"https:\/\/github.com\/dotnet\/spark\" aria-label=\"Star dotnet\/spark on GitHub\" data-size=\"large\" data-show-count=\"true\">Star<\/a><\/span><\/p>\n<p><em>This blog is authored by Rahul Potharaju, Ankit Asthana, Tyson Condie, Terry Kim, Dan Moseley, Michael Rys and the rest of the .NET for Apache Spark team.\u00a0<\/em><\/p>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>Today at Spark + AI summit we are excited to announce .NET for Apache Spark. Spark is a popular open source distributed processing engine for analytics over large data sets.\u00a0Spark can be used for processing batches of data, real-time streams, machine learning, and ad-hoc query. .NET for Apache Spark is aimed at making Apache\u00ae Spark\u2122 [&hellip;]<\/p>\n","protected":false},"author":3194,"featured_media":58792,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[685],"tags":[4,2855,2854,93,2853],"class_list":["post-22840","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-dotnet","tag-net","tag-analytics","tag-big-data","tag-machine-learning","tag-spark"],"acf":[],"blog_post_summary":"<p>Today at Spark + AI summit we are excited to announce .NET for Apache Spark. Spark is a popular open source distributed processing engine for analytics over large data sets.\u00a0Spark can be used for processing batches of data, real-time streams, machine learning, and ad-hoc query. .NET for Apache Spark is aimed at making Apache\u00ae Spark\u2122 [&hellip;]<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/posts\/22840","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/users\/3194"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/comments?post=22840"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/posts\/22840\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/media\/58792"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/media?parent=22840"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/categories?post=22840"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/tags?post=22840"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}