Today, we announce the release of version 1.0 of .NET for Apache® Spark™, an open source package that brings .NET development to the Apache® Spark™ platform. This release is possible due to the combined efforts of Microsoft and the open source community. Version 1.0 includes support for .NET applications targeting .NET Standard 2.0 or later. Access to the Apache® Spark™ DataFrame APIs (versions 2.3, 2.4 and 3.0) and the ability to write Spark SQL and create user-defined functions (UDFs) are also included in the release.
The following code snippet is an example of using Spark to produce a word count from a document (browse the full sample here):
var docs = spark.Read().Option("header", true).Csv("documents.csv");
var filCol = Functions.Col("file");
var words = docs
.Select(
fileCol,
// "a b c" => ["a", "b", "c"]
Functions.Split(
Functions.Col("words"), " ")
.Alias("wordList"))
// flatten into one row per word
.Select(
fileCol,
// 1: ["a", "b", "c"] => 1: "a", 2: "b", 3: "c"
Functions.Explode(
Functions.Col("wordList"))
.Alias("word"))
.GroupBy(fileCol, Functions.Lower(Functions.Col("word")))
.Count();
Background
.NET for Apache® Spark™ launched two years ago to address increasing demand from the .NET community for an easier way to build big data applications. A recent survey confirmed the biggest motivation to use the package is to take advantage of existing .NET development skills and resources, including the enormous .NET ecosystem of existing libraries and frameworks. The team is committed to the continuous evolution of the product to integrate the latest features and keep the API current with the latest Spark versions. For more about the history of the project and key contributors, read the full announcement.
Get Started
There are several options to get started. First, read the full .NET for Apache Spark 1.0 announcement. Then you can:
- Browse our online .NET for Apache Spark documentation
- Take the tutorial: Get started with .NET for Apache Spark
- Submit jobs to run on Azure and analyze data in real-time notebooks using .NET for Apache Spark with Azure Synapse Analytics
- Visit and consider contributing to our open source repository
Meanwhile our company moved from Apache Spark (java) to Flink. Even courses (pluralsight) are comparing Hadoop to 3G, Spark to 4G and Flink to 5G. Is there plan for 5G in .NET? Is there some project trying to port Flink stuff to .NET?
Hi Dusan, if you have good use cases where you prefer Flink, I would suggest to file a feature request at the Azure Synapse uservoice. Once we see an increase in demand we can look into it.
Is there support yet for delta lake? If not, is that on the roadmap?
Yes, it supports Delta Lake.
The acquisition process is still not easy.
A lot of moving parts that need to be stitched together before the proper coding.
@saint4eva: How are you planning on using it? We provide support for it out of the box in Azure HDInsight and Azure Synapse Spark pools. If you like to see it in Databricks I suggest to reach out to Databricks (if they see customer demand, I am sure they will consider it).