{"id":25758,"date":"2019-12-16T09:00:04","date_gmt":"2019-12-16T16:00:04","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/dotnet\/?p=25758"},"modified":"2020-01-23T12:22:36","modified_gmt":"2020-01-23T19:22:36","slug":"an-introduction-to-dataframe","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/dotnet\/an-introduction-to-dataframe\/","title":{"rendered":"An Introduction to DataFrame"},"content":{"rendered":"<p>Last month, we announced .NET support for Jupyter notebooks, and showed how to use them to work with .NET for Apache Spark and ML.NET. Today, we&#8217;re announcing the preview of a <a href=\"https:\/\/www.nuget.org\/packages\/Microsoft.Data.Analysis\/\" rel=\"noopener noreferrer\" target=\"_blank\">DataFrame<\/a> type for .NET to make data exploration easy. If you&#8217;ve used Python to manipulate data in notebooks, you&#8217;ll already be familiar with the concept of a DataFrame. At a high level, it is an in-memory representation of structured data. In this blog post, I&#8217;m going to give an overview of this new type and how you can use it from Jupyter notebooks. To play along, fire up a .NET Jupyter Notebook in a <a href=\"https:\/\/mybinder.org\/v2\/gh\/dotnet\/interactive\/master?urlpath=lab\" rel=\"noopener noreferrer\" target=\"_blank\">browser<\/a>.<\/p>\n<h2>How to use DataFrame?<\/h2>\n<p><code>DataFrame<\/code> stores data as a collection of columns. Let&#8217;s populate a <code>DataFrame<\/code> with some sample data and go over the major features. The full sample can be found on Github(<a href=\"https:\/\/github.com\/dotnet\/interactive\/blob\/467828bae501022734fc5d0d077bef8d36bb0bf8\/NotebookExamples\/csharp\/Samples\/DataFrame-Getting%20Started.ipynb\" rel=\"noopener noreferrer\" target=\"_blank\">C#<\/a> and <a href=\"https:\/\/github.com\/dotnet\/interactive\/blob\/467828bae501022734fc5d0d077bef8d36bb0bf8\/NotebookExamples\/fsharp\/Samples\/DataFrame-Getting%20Started.ipynb\" rel=\"noopener noreferrer\" target=\"_blank\">F#<\/a>). To follow along in your browser, click <a href=\"https:\/\/mybinder.org\/v2\/gh\/dotnet\/interactive\/master?urlpath=lab\" rel=\"noopener noreferrer\" target=\"_blank\">here<\/a> and navigate to <em>csharp\/Samples\/DataFrame-Getting Started.ipynb<\/em>(or <em>fsharp\/Samples\/DataFrame-Getting Started.ipynb<\/em>). To get started, let&#8217;s import the <a href=\"https:\/\/www.nuget.org\/packages\/Microsoft.Data.Analysis\/\" rel=\"noopener noreferrer\" target=\"_blank\">Microsoft.Data.Analysis<\/a> package and namespace into our .NET Jupyter Notebook (make sure you&#8217;re using the C# or F# kernel):<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2019\/12\/Microsoft.Data_.Analysis.png\" alt=\"Microsoft.Data.Analysis package\" \/><\/p>\n<p>Let&#8217;s make three columns to hold values of types <code>DateTime<\/code>, <code>int<\/code> and <code>string<\/code>.<\/p>\n<pre><code class=\"csharp\">PrimitiveDataFrameColumn&lt;DateTime&gt; dateTimes = new PrimitiveDataFrameColumn&lt;DateTime&gt;(\"DateTimes\"); \/\/ Default length is 0.\nPrimitiveDataFrameColumn&lt;int&gt; ints = new PrimitiveDataFrameColumn&lt;int&gt;(\"Ints\", 3); \/\/ Makes a column of length 3. Filled with nulls initially\nStringDataFrameColumn strings = new StringDataFrameColumn(\"Strings\", 3); \/\/ Makes a column of length 3. Filled with nulls initially\n<\/code><\/pre>\n<p><code>PrimitiveDataFrameColumn<\/code> is a generic column that can hold primitive types such as <code>int<\/code>, <code>float<\/code>, <code>decimal<\/code> etc. A <code>StringDataFrameColumn<\/code> is a specialized column that holds <code>string<\/code> values. Both the column types can take a <code>length<\/code> parameter in their contructors and are filled with <code>null<\/code> values initially. Before we can add these columns to a <code>DataFrame<\/code> though, we need to append three values to our <code>dateTimes<\/code> column. This is because the <code>DataFrame<\/code> constructor expects all its columns to have the same length.<\/p>\n<pre><code class=\"csharp\">\/\/ Append 3 values to dateTimes\ndateTimes.Append(DateTime.Parse(\"2019\/01\/01\"));\ndateTimes.Append(DateTime.Parse(\"2019\/01\/01\"));\ndateTimes.Append(DateTime.Parse(\"2019\/01\/02\"));\n<\/code><\/pre>\n<p>Now we&#8217;re ready to create a <code>DataFrame<\/code> with three columns.<\/p>\n<pre><code class=\"csharp\">DataFrame df = new DataFrame(dateTimes, ints, strings); \/\/ This will throw if the columns are of different lengths\n<\/code><\/pre>\n<p>One of the benefits of using a notebook for data exploration is the interactive REPL. We can enter <code>df<\/code> into a new cell and run it to see what data it contains. For the rest of this post, we&#8217;ll work in a .NET Jupyter environment. All the sample code will work in a regular console app as well though.<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2019\/12\/ArrayPrint.png\" alt=\"Array Print\" \/><\/p>\n<p>We immediately see that the formatting of the output can be improved. Each column is printed as an array of values and we don&#8217;t see the names of the columns. If <code>df<\/code> had more rows and columns, the output would be hard to read. Fortunately, in a Jupyter environment, we can write custom formatters for types. Let&#8217;s write a formatter for <code>DataFrame<\/code>.<\/p>\n<pre><code class=\"csharp\">using Microsoft.AspNetCore.Html;\nFormatter&lt;DataFrame&gt;.Register((df, writer) =&gt;\n{\n    var headers = new List&lt;IHtmlContent&gt;();\n    headers.Add(th(i(\"index\")));\n    headers.AddRange(df.Columns.Select(c =&gt; (IHtmlContent) th(c.Name)));\n    var rows = new List&lt;List&lt;IHtmlContent&gt;&gt;();\n    var take = 20;\n    for (var i = 0; i &lt; Math.Min(take, df.Rows.Count); i++)\n    {\n        var cells = new List&lt;IHtmlContent&gt;();\n        cells.Add(td(i));\n        foreach (var obj in df.Rows[i])\n        {\n            cells.Add(td(obj));\n        }\n        rows.Add(cells);\n    }\n\n    var t = table(\n        thead(\n            headers),\n        tbody(\n            rows.Select(\n                r =&gt; tr(r))));\n\n    writer.Write(t);\n}, \"text\/html\");\n<\/code><\/pre>\n<p>This snippet of code register a new <code>DataFrame<\/code> formatter. All subsequent evaluations of <code>df<\/code> in a notebook will now output the first 20 rows of a <code>DataFrame<\/code> along with the column names. In the future, the <code>DataFrame<\/code> type and other libraries that target Jupyter as one of their environments will be able to ship with their formatters.<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2019\/12\/PrintDataFrame.gif\" alt=\"Print DataFrame\" \/><\/p>\n<p>Sure enough, when we re-evaluate <code>df<\/code>, we see that it contains the three columns we created previously. The formatting makes it much easier to inspect our values. There&#8217;s also a helpful <code>index<\/code> column in the output to quickly see which row we&#8217;re looking at. Let&#8217;s modify our data by indexing into <code>df<\/code>:<\/p>\n<pre><code class=\"csharp\">df[0, 1] = 10; \/\/ 0 is the rowIndex, and 1 is the columnIndex. This sets the 0th value in the Ints columns to 10\n<\/code><\/pre>\n<p><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2019\/12\/DataFrameIndexing.png\" alt=\"DataFrameIndexing\" \/><\/p>\n<p>We can also modify the values in the columns through indexers defined on <code>PrimitiveDataFrameColumn<\/code> and <code>StringDataFrameColumn<\/code>:<\/p>\n<pre><code class=\"csharp\">\/\/ Modify ints and strings columns by indexing\nints[1] = 100;\nstrings[1] = \"Foo!\";\n<\/code><\/pre>\n<p><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2019\/12\/ColumnIndexers.png\" alt=\"ColumnIndexers\" \/><\/p>\n<p>One caveat to keep in mind here is the data type of the value passed in to the indexers. We passed in the right data types to the column indexers in our sample: an integer value of <code>100<\/code> to <code>ints[1]<\/code> and a string <code>\"Foo!\"<\/code> to <code>string[1]<\/code>. If the data types don&#8217;t match, an exception will be thrown. For cases where the type of data in the columns is not obvious, there is a handy <code>DataType<\/code> property defined on each column. The <code>Info<\/code> method displays the <code>DataType<\/code> and <code>Length<\/code> properties of each column:<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2019\/12\/DataType.png\" alt=\"Info\" \/><\/p>\n<p>The <code>DataFrame<\/code> and <code>DataFrameColumn<\/code> classes expose a number of useful APIs: binary operations, computations, joins, merges, handling missing values and more. Let&#8217;s look at some of them:<\/p>\n<pre><code class=\"csharp\">\/\/ Add 5 to Ints through the DataFrame\ndf[\"Ints\"].Add(5, inPlace: true);\n<\/code><\/pre>\n<p><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2019\/12\/Add.png\" alt=\"Add\" \/><\/p>\n<pre><code class=\"csharp\">\/\/ We can also use binary operators. Binary operators produce a copy, so assign it back to our Ints column \ndf[\"Ints\"] = (ints \/ 5) * 100;\n<\/code><\/pre>\n<p><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2019\/12\/BinaryOperations.png\" alt=\"BinaryOperations\" \/><\/p>\n<p>All binary operators are backed by functions that produces a copy by default. The <code>+<\/code> operator, for example, calls the <code>Add<\/code> method and passes in <code>false<\/code> for the <code>inPlace<\/code> parameter. This lets us elegantly manipulate data using operators without worrying about modifying our existing values. For when in place semantics are desired, we can set the <code>inPlace<\/code> parameter to <code>true<\/code> in the binary functions.<\/p>\n<p>In our sample, <code>df<\/code> has <code>null<\/code> values in its columns. <code>DataFrame<\/code> and <code>DataFrameColumn<\/code> offer an API to fill <code>nulls<\/code> with values.<\/p>\n<pre><code class=\"csharp\">df[\"Ints\"].FillNulls(-1, inPlace: true);\ndf[\"Strings\"].FillNulls(\"Bar\", inPlace: true);\n<\/code><\/pre>\n<p><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2019\/12\/FillNulls.png\" alt=\"Fill Nulls\" \/><\/p>\n<p><code>DataFrame<\/code> exposes a <code>Columns<\/code> property that we can enumerate over to access our columns and a <code>Rows<\/code> property to access our rows. We can index <code>Rows<\/code> to access each row. Here&#8217;s an example that accesses the first row:<\/p>\n<pre><code class=\"csharp\">DataFrameRow row0 = df.Rows[0];\n<\/code><\/pre>\n<p><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2019\/12\/BadRowAccess.png\" alt=\"Access Rows\" \/><\/p>\n<p>To inspect our values better, let&#8217;s write a formatter for <code>DataFrameRow<\/code> that displays values in a single line.<\/p>\n<pre><code class=\"csharp\">using Microsoft.AspNetCore.Html;\nFormatter&lt;DataFrameRow&gt;.Register((dataFrameRow, writer) =&gt;\n{\n    var cells = new List&lt;IHtmlContent&gt;();\n    cells.Add(td(i));\n    foreach (var obj in dataFrameRow)\n    {\n        cells.Add(td(obj));\n    }\n\n    var t = table(\n        tbody(\n            cells));\n\n    writer.Write(t);\n}, \"text\/html\");\n<\/code><\/pre>\n<p><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2019\/12\/RowAccess.gif\" alt=\"Access Rows\" \/><\/p>\n<p>To enumerate over all the rows in a <code>DataFrame<\/code>, we can write a simple for loop. <code>DataFrame.Rows.Count<\/code> returns the number of rows in a <code>DataFrame<\/code> and we can use the loop index to access each row.<\/p>\n<pre><code class=\"csharp\">for (long i = 0; i &lt; df.Rows.Count; i++)\n{\n       DataFrameRow row = df.Rows[i];\n}\n<\/code><\/pre>\n<p>Note that each row is a view of the values in the <code>DataFrame<\/code>. Modifying the values in the <code>row<\/code> object modifies the values in the <code>DataFrame<\/code>. We do however lose type information on the returned <code>row<\/code> object. This is a consequence of <code>DataFrame<\/code> being a loosely typed data structure.<\/p>\n<p>Let&#8217;s wrap up our <code>DataFrame<\/code> API tour by looking at the <code>Filter<\/code>, <code>Sort<\/code>, <code>GroupBy<\/code> methods:<\/p>\n<pre><code class=\"csharp\">\/\/ Filter rows based on equality\nPrimitiveDataFrameColumn&lt;bool&gt; boolFilter = df[\"Strings\"].ElementwiseEquals(\"Bar\");\nDataFrame filtered = df.Filter(boolFilter);\n<\/code><\/pre>\n<p><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2019\/12\/DataFrameFilter.png\" alt=\"DataFrame Filter\" \/><\/p>\n<p><code>ElementwiseEquals<\/code> returns a <code>PrimitiveDataFrameColumn&lt;bool&gt;<\/code> filled with a <code>true<\/code> for every row that equals <code>\"Bar\"<\/code> in the <code>Strings<\/code> column, and a <code>false<\/code> when it doesn&#8217;t equal <code>\"Bar\"<\/code>. In the <code>df.Filter<\/code> call, each row corresponding to a <code>true<\/code> value in <code>boolFilter<\/code> selects a row out of <code>df<\/code>. The resulting <code>DataFrame<\/code> contains only these rows.<\/p>\n<pre><code class=\"csharp\">\/\/ Sort our dataframe using the Ints column\nDataFrame sorted = df.Sort(\"Ints\");\n\/\/ GroupBy \nGroupBy groupBy = df.GroupBy(\"DateTimes\");\n<\/code><\/pre>\n<p><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2019\/12\/SortAndGroupBy.png\" alt=\"Sort And GroupBy\" \/><\/p>\n<p>The <code>GroupBy<\/code> method takes in the name of a column and creates groups based on unique values in the column. In our sample, the <code>DateTimes<\/code> column has two unique values, so we expect one group to be created for <code>2019-01-01 00:00:00Z<\/code> and one for <code>2019-01-02 00:00:00Z<\/code>.<\/p>\n<pre><code class=\"csharp\">\/\/ Count of values in each group\nDataFrame groupCounts = groupBy.Count();\n\/\/ Alternatively find the sum of the values in each group in Ints\nDataFrame intGroupSum = groupBy.Sum(\"Ints\");\n<\/code><\/pre>\n<p><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2019\/12\/GroupBySum.png\" alt=\"GroupBy Sum\" \/><\/p>\n<p>The <code>GroupBy<\/code> object exposes a set of methods that can called on each group. Some examples are <code>Max()<\/code>, <code>Min()<\/code>, <code>Count()<\/code> etc. The <code>Count()<\/code> method counts the number of values in each group and return them in a new <code>DataFrame<\/code>. The <code>Sum(\"Ints\")<\/code> method sums up the values in each group.<\/p>\n<p>Finally, when we want to work with existing datasets, <code>DataFrame<\/code> exposes a <code>LoadCsv<\/code> method.<\/p>\n<pre><code class=\"csharp\">DataFrame csvDataFrame = DataFrame.LoadCsv(\"path\/to\/file.csv\");\n<\/code><\/pre>\n<h2>Charting<\/h2>\n<p>Another cool feature of using a <code>DataFrame<\/code> in a .NET Jupyter environment is charting. <a href=\"https:\/\/fslab.org\/XPlot\/\">XPlot.Plotly<\/a> is one option to render charts. We can import the <code>XPlot.Plotly<\/code> namespace into our notebook and create interactive visualizations of the data in our <code>DataFrame<\/code>. Let&#8217;s populate a <code>PrimitiveDataFrameColumn&lt;double&gt;<\/code> with a normal distribution and plot a histogram of the samples:<\/p>\n<pre><code class=\"csharp\">#r \"nuget:MathNet.Numerics,4.9.0\"\nusing XPlot.Plotly;\nusing System.Linq;\nusing MathNet.Numerics.Distributions;\n\ndouble mean = 0;\ndouble stdDev = 0.1;\nMathNet.Numerics.Distributions.Normal normalDist = new Normal(mean, stdDev);\n\nPrimitiveDataFrameColumn&lt;double&gt; doubles = new PrimitiveDataFrameColumn&lt;double&gt;(\"Normal Distribution\", normalDist.Samples().Take(1000));\ndisplay(Chart.Plot(\n    new Graph.Histogram()\n    {\n        x = doubles,\n        nbinsx = 30\n    }\n));\n<\/code><\/pre>\n<p><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2019\/12\/Chart.gif\" alt=\"Chart\" \/><\/p>\n<p>We first create a <code>PrimitiveDataFrameColumn&lt;double&gt;<\/code> by drawing 1000 samples from a normal distribution and then plot a histogram with 30 bins. The resulting chart is interactive! Hovering over the chart reveals the underlying data and lets us inspect each value precisely.<\/p>\n<h2>Summary<\/h2>\n<p>We&#8217;ve only explored a subset of the features that <code>DataFrame<\/code> exposes. <code>Append<\/code>, <code>Join<\/code>, <code>Merge<\/code>, and <code>Aggregations<\/code> are supported. Each column also implements <code>IEnumerable&lt;T?&gt;<\/code>, so users can write LINQ queries on columns. The custom <code>DataFrame<\/code> formatting code we wrote has a simple example. The complete source code(and documentation) for <code>Microsoft.Data.Analysis<\/code> <a href=\"https:\/\/github.com\/dotnet\/corefxlab\/tree\/master\/src\/Microsoft.Data.Analysis\">lives on GitHub<\/a>. In a follow up post, I&#8217;ll go over how to use <code>DataFrame<\/code> with ML.NET and .NET for Spark. The decision to use column major backing stores (the Arrow format in particular) allows for zero-copy in .NET for Spark User Defined Functions (UDFs)!<\/p>\n<p>We always welcome the community&#8217;s feedback! In fact, please feel free to contribute to the <a href=\"https:\/\/github.com\/dotnet\/corefxlab\/tree\/master\/src\/Microsoft.Data.Analysis\">source code<\/a>. We&#8217;ve made it easy for users to create new column types that derive from <code>DataFrameColumn<\/code> to add new functionality. Support for structs such as <code>DateTime<\/code> and user defined structs is also not as complete as primitive types such as <code>int<\/code>, <code>float<\/code> etc. We believe this preview package allows the community to do data analysis in .NET. Try out <a href=\"https:\/\/www.nuget.org\/packages\/Microsoft.Data.Analysis\/\">DataFrame<\/a> in a <a href=\"https:\/\/mybinder.org\/v2\/gh\/dotnet\/interactive\/master?urlpath=lab\">.NET Jupyter Notebook<\/a> and let us know what you think!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Last month, we announced .NET support for Jupyter notebooks, and showed how to use them to work with .NET for Apache Spark and ML.NET. Today, we&#8217;re announcing the preview of a DataFrame type for .NET to make data exploration easy. If you&#8217;ve used Python to manipulate data in notebooks, you&#8217;ll already be familiar with the [&hellip;]<\/p>\n","protected":false},"author":11778,"featured_media":58792,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[685,196,756,636,688,691],"tags":[2855,30,43,437,107,117],"class_list":["post-25758","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-dotnet","category-dotnet-core","category-csharp","category-fsharp","category-machine-learning","category-ml-dotnet","tag-analytics","tag-announcement","tag-bcl","tag-data","tag-open-source","tag-releases"],"acf":[],"blog_post_summary":"<p>Last month, we announced .NET support for Jupyter notebooks, and showed how to use them to work with .NET for Apache Spark and ML.NET. Today, we&#8217;re announcing the preview of a DataFrame type for .NET to make data exploration easy. If you&#8217;ve used Python to manipulate data in notebooks, you&#8217;ll already be familiar with the [&hellip;]<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/posts\/25758","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/users\/11778"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/comments?post=25758"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/posts\/25758\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/media\/58792"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/media?parent=25758"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/categories?post=25758"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/tags?post=25758"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}