Announcing Version 1.0 of .NET for Apache Spark

Jeremy Likness

Principal Program Manager - .NET AI experience

Today, we announce the release of version 1.0 of .NET for Apache® Spark™, an open source package that brings .NET development to the Apache® Spark™ platform. This release is possible due to the combined efforts of Microsoft and the open source community. Version 1.0 includes support for .NET applications targeting .NET Standard 2.0 or later. Access to the Apache® Spark™ DataFrame APIs (versions 2.3, 2.4 and 3.0) and the ability to write Spark SQL and create user-defined functions (UDFs) are also included in the release.

The .NET Bot

The following code snippet is an example of using Spark to produce a word count from a document (browse the full sample here):

var docs = spark.Read().Option("header", true).Csv("documents.csv");
var filCol = Functions.Col("file");
var words = docs
    .Select(
        fileCol,
        // "a b c" => ["a", "b", "c"]
        Functions.Split(
            Functions.Col("words"), " ")
        .Alias("wordList"))
    // flatten into one row per word
    .Select(
        fileCol,
        // 1: ["a", "b", "c"] => 1: "a", 2: "b", 3: "c"
        Functions.Explode(
            Functions.Col("wordList"))
        .Alias("word"))
    .GroupBy(fileCol, Functions.Lower(Functions.Col("word")))
    .Count();

Background

.NET for Apache® Spark™ launched two years ago to address increasing demand from the .NET community for an easier way to build big data applications. A recent survey confirmed the biggest motivation to use the package is to take advantage of existing .NET development skills and resources, including the enormous .NET ecosystem of existing libraries and frameworks. The team is committed to the continuous evolution of the product to integrate the latest features and keep the API current with the latest Spark versions. For more about the history of the project and key contributors, read the full announcement.

Get Started

There are several options to get started. First, read the full .NET for Apache Spark 1.0 announcement. Then you can:

Browse our online .NET for Apache Spark documentation
Take the tutorial: Get started with .NET for Apache Spark
Submit jobs to run on Azure and analyze data in real-time notebooks using .NET for Apache Spark with Azure Synapse Analytics
Visit and consider contributing to our open source repository

Author

Jeremy Likness

Principal Program Manager - .NET AI experience

Jeremy is a Principal Product Manager at Microsoft, responsible for the AI experience in .NET. He's also managed minimal APIs, ASP.NET's authentication/authorization capabilities and .NET data products including Entity Framework.

6 comments

Discussion is closed. Login to edit/delete existing comments.

Dusan F November 1, 2020

Meanwhile our company moved from Apache Spark (java) to Flink. Even courses (pluralsight) are comparing Hadoop to 3G, Spark to 4G and Flink to 5G. Is there plan for 5G in .NET? Is there some project trying to port Flink stuff to .NET?
- Michael Rys December 2, 2020
  
  Hi Dusan, if you have good use cases where you prefer Flink, I would suggest to file a feature request at the Azure Synapse uservoice. Once we see an increase in demand we can look into it.
kirts October 28, 2020

Is there support yet for delta lake? If not, is that on the roadmap?
- Jeremy Likness Author October 29, 2020
  
  Yes, it supports Delta Lake.
saint4eva October 28, 2020

The acquisition process is still not easy.

A lot of moving parts that need to be stitched together before the proper coding.
- Michael Rys December 2, 2020
  
  @saint4eva: How are you planning on using it? We provide support for it out of the box in Azure HDInsight and Azure Synapse Spark pools. If you like to see it in Databricks I suggest to reach out to Databricks (if they see customer demand, I am sure they will consider it).