Scale-out computing on DevLabs

Today we're launching several new Technical Computing (TC) projects on DevLabs. These projects give you a chance to learn about some of the technologies being developed as part of the Technical Computing initiative, to gain early access to code, and to provide feedback for several TC-related innovative projects.

Last May, I blogged about the Technical Computing initiative at Microsoft, an initiative that's leading to technologies which will empower the world's most important problem solvers to best utilize computing resources. These domain specialists often either develop code themselves as a necessary aspect of their work, or they rely on other developers to build the software that makes their work possible. The TC initiative gives those developers and domain specialists ground-breaking developer tools and infrastructure to do their best work.

The TC initiative has made some important first steps since its inception. Visual Studio 2010 includes built-in support for developing, debugging, and tuning multi-core and manycore applications and has seen impressive adoption within a wide-variety of industries and domains. In November, we announced Service Pack 1 for HPC Server 2008 R2, which integrates Windows Azure compute cycles, allowing massively parallel applications to easily scale from the cluster to the cloud. And this is just the beginning. The teams involved in the TC initiative are working hard on impressive new solutions to bring all that modern and future computing has to offer to developers, domain specialists, and IT professionals alike.

Today's new TC projects take the next steps in this journey.

TPL Dataflow - Enabling parallel and concurrent .NET applications

.NET 4 saw the introduction of the Task Parallel Library (TPL), parallel loops, concurrent data structures, Parallel LINQ (PLINQ), and more, all of which were collectively referred to as Parallel Extensions to the .NET Framework. TPL Dataflow is a new member of that family, layering on top of tasks, concurrent collections, and more to enable the development of powerful and efficient .NET-based concurrent systems built using dataflow concepts. The technology relies on techniques based on in-process message passing and asynchronous pipelines and is heavily inspired by the Visual C++ 2010 Asynchronous Agents Library and DevLab's Axum language. TPL Dataflow provides solutions for buffering and processing data, building systems that need high-throughput and low-latency processing of data, and building agent/actor-based systems. TPL Dataflow was also designed to smoothly integrate with the new asynchronous language functionality in C# and Visual Basic I previously blogged about.

Below, you can see an example of an "agent" using dataflow blocks in C# to safely, asynchronously, and efficiently process incoming requests.

Dryad - Supporting data-intensive computing applications

Pioneered in Microsoft Research, Dryad, DSC, and DryadLINQ are a set of technologies that support data-intensive computing applications on Windows HPC Server 2008 R2 Service Pack 1. These technologies enable efficient processing of large volumes of data in many types of applications, including data-mining applications, image and stream processing, and various kinds of intense scientific computations. Dryad and DSC run on the cluster to support data-intensive computing and manage data that is partitioned across the cluster, while DryadLINQ allows developers to build data- and compute-intensive .NET applications using the familiar LINQ programming model.

Here you can see the code to loading textual log data using Dryad. That data is merged and processed on a cluster, and then the results are streamed back to the client for display.

public static IEnumerable<string> GeoIp(string logStream, string geoStream)
{
DistributedData<string> logLinesTable = DistributedData.OpenAsText(logStream);
DistributedData<string> geoIpTable = DistributedData.OpenAsText(geoStream);

// Join the two tables on the common key (IP Address)
IEnumerable<string> joined = logLinesTable.Join(geoIpTable,
l1 => l1.Split(' ').First(),
l2 => l2.Split(' ').First(),
(l1, l2) => l2).AsEnumerable();

return joined;
}

public static void Main()
{
// Load log and geo data into DSC
Console.WriteLine("Loading data");
File.ReadLines("log.txt").AsDistributed().ExecuteAsText("hpcdsc://localhost/Samples/log");
File.ReadLines("geo.txt").AsDistributed().ExecuteAsText("hpcdsc://localhost/Samples/geo");

// Run the query
Console.WriteLine("Running query");
IEnumerable<string> results =
GeoIp("hpcdsc://localhost/Samples/log", "hpcdsc://localhost/Samples/geo");

// Print out the results
Console.WriteLine("Displaying results");

foreach (var entry in results) Console.WriteLine(entry);

}

Sho - Putting the power of data analysis flexible prototyping in your hands

Also begun in Microsoft Research, Sho provides those who are working on technical computing workloads an interactive environment for data analysis and scientific computing. It lets you seamlessly connect scripts written in IronPython with .NET libraries, enabling fast and flexible prototyping. The environment includes powerful and efficient libraries for linear algebra and data visualization, both of which can be used from any .NET language, as well as a feature-rich interactive shell for rapid development. Sho comes with packages for large-scale parallel computing (via Windows HPC Server and Windows Azure), statistics, and optimization, as well as an extensible package mechanism that makes it easy for you to create and share your own packages.

As you can see in the below screenshot, Sho provides an interactive REPL (read/execute/print loop) that allows you to write code and see results textually and graphically immediately.

Try Them Out

Our goal moving forward is to add additional Technical Computing projects in pre-beta states to DevLabs in order to get your early feedback and insight and to help drive these technologies in the right direction. We look forward to hearing from you.

Namaste!