{"id":2878,"date":"2017-04-12T20:03:50","date_gmt":"2017-04-12T20:03:50","guid":{"rendered":"https:\/\/www.microsoft.com\/reallifecode\/?p=2878"},"modified":"2020-03-23T16:13:29","modified_gmt":"2020-03-23T23:13:29","slug":"reproducible-data-science-analysis-experimental-data-pachyderm-azure","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/ise\/reproducible-data-science-analysis-experimental-data-pachyderm-azure\/","title":{"rendered":"Reproducible Data Science &#8211; Analysis of Experimental Data with Pachyderm and Azure"},"content":{"rendered":"<p>Inside a nondescript building in a Vancouver industrial park, <a href=\"http:\/\/generalfusion.com\/\">General Fusion<\/a> is creating something that could forever change how the world produces energy. Since its founding in 2002, the Canadian cleantech company has risen to the forefront of the race to harness fusion energy which has the potential to supply the world with almost limitless, clean, safe, and on-demand energy. General Fusion\u2019s approach, called magnetized target fusion, involves injecting a ring of superheated hydrogen gas (plasma) into an enclosure, then quickly collapsing the enclosure. The compression of the plasma ignites a reaction that produces a large amount of energy with only helium as a byproduct.<\/p>\n<p>General Fusion operates several plasma injectors which superheat hydrogen gas to millions of degrees, creating a \u201ccompact toroid plasma\u201d \u2014 a magnetized ring of ionized gas which lasts for a few milliseconds and is the fuel for a fusion energy power plant. A single plasma injector fires hundreds of such shots in a day, each shot generating gigabytes of data from sensors measuring things like temperature, density, and magnetic field strength. Through an iterative process, plasma physicists at General Fusion analyze this data and adjust their experiments accordingly with the goal of creating the world\u2019s first fusion power plant.<\/p>\n<p>With over one hundred thousand plasma shots spanning seven years, their total data set is approaching 100TB in size. General Fusion\u2019s existing on-premise data infrastructure was limiting their ability to effectively analyze all of this data. To ensure that full scientific use could be made of this data, they were in need of a data platform that could:<\/p>\n<ol>\n<li>Enable big data analytics across thousands of experiments and terabytes of data<\/li>\n<li>Reliably produce data and analysis when sensor calibrations or algorithms are updated<\/li>\n<li>Preserve different versions of results for different versions of calibrations or algorithmns<\/li>\n<li>Enable collaboration with other plasma physicists and the sharing of scientific progress<\/li>\n<li>Scale as the quantity of data increases<\/li>\n<\/ol>\n<p>Engineers at General Fusion were quickly drawn to a data platform that met all their requirements: <a href=\"http:\/\/www.pachyderm.io\/\">Pachyderm<\/a>. Pachyderm is an open source data analytics platform that is deployed on top of <a href=\"https:\/\/github.com\/kubernetes\/kubernetes\">Kubernetes<\/a>, a container orchestration framework. We partnered with General Fusion to develop and deploy their new Pachyderm-based data infrastructure to Microsoft Azure. This post walks through General Fusion\u2019s new data architecture and how we deployed it to Azure.<\/p>\n<h1 id=\"background\">Background<\/h2>\n<p>Pachyderm offers a modern alternative to Hadoop and other distributed data processing platforms by using containers as its core processing primitive. In comparison to Hadoop where MapReduce jobs are specified as Java classes, Pachyderm users create containers to process the data. As a result, users can employ and combine any tools, languages, or libraries (e.g. R, Python, OpenCV, CNTK, etc.) to process their data. Pachyderm will take care of injecting data into the container and parallelizing the workload by replicating the container, giving each instance a different set of the data to process.<\/p>\n<p>Pachyderm also provides provenance and version control of data, allowing users to view diffs and promote collaboration with other consumers of the data. With provenance, data is tracked through all transformations and analyses. This tracking enables data to be traced as it travels through a dataflow and allows for a dataflow to be replayed with its original inputs to each processing step.<\/p>\n<p>Pachyderm itself is built on top of Kubernetes and deploys all of the data processing containers within Kubernetes. Originally developed by Google and hosted by the <a href=\"https:\/\/www.cncf.io\/\">Cloud Native Computing Foundation<\/a>, Kubernetes is an open sourced platform for managing containerized applications. Kubernetes deploys any number of container replicas across your node cluster and takes care of, among other things, replication, auto-scaling, load balancing, and service discovery.<\/p>\n<h1 id=\"requirements\">Requirements<\/h2>\n<p>In addition to the node resources given by the Kubernetes cluster, Pachyderm requires:<\/p>\n<ul>\n<li>Azure Blob Storage for its backing data store<\/li>\n<li>A data disk for a metadata store<\/li>\n<\/ul>\n<p>Shown below is a diagram of General Fusion\u2019s infrastructure. In addition to the resources listed above, deployed within the same virtual network are:<\/p>\n<ul>\n<li>An ingress virtual machine running General Fusion\u2019s custom applications that push plasma injector data into Pachyderm<\/li>\n<li><a href=\"https:\/\/azure.microsoft.com\/services\/documentdb\/\">DocumentDB<\/a> used by General Fusion\u2019s custom application to store metadata<\/li>\n<li><a href=\"https:\/\/azure.microsoft.com\/services\/load-balancer\/\">Software load balancers (L4)<\/a> exposing the Kubernetes master node and the ingress virtual machine to the public Internet and an internal load balancer for the Kubernetes agent nodes.<\/li>\n<\/ul>\n<p>Instead of the data disk needed for Pachyderm, we have chosen to provision a <a href=\"https:\/\/www.gluster.org\/\">GlusterFS<\/a> cluster for higher availability. As a data disk can only be mounted to a single node at a time, downtime with that node would result in an outage to the metadata store until Kubernetes can move the workload to another node. For simplicity, the following blog post will explain deployment using a data disk for Pachyderm\u2019s metadata store.<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2017\/04\/architecture-1.png\" alt=\"Image architecture 1\" width=\"1009\" height=\"426\" class=\"aligncenter size-full wp-image-13060\" srcset=\"https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2017\/04\/architecture-1.png 1009w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2017\/04\/architecture-1-300x127.png 300w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2017\/04\/architecture-1-768x324.png 768w\" sizes=\"(max-width: 1009px) 100vw, 1009px\" \/><\/p>\n<p>To get started with deploying Kubernetes and Pachyderm on Azure, first ensure that you have the following tools:<\/p>\n<ul>\n<li>Microsoft Azure subscription<\/li>\n<li><a href=\"https:\/\/docs.microsoft.com\/azure\/xplat-cli-install\">Azure CLI<\/a> &gt; 0.10.7 &#8211; command line interface for accessing Microsoft Azure<\/li>\n<li><a href=\"https:\/\/docs.docker.com\/engine\/installation\/\">Docker CLI<\/a> &gt; 1.12.3 &#8211; command line interface for Docker<\/li>\n<li><a href=\"https:\/\/github.com\/stedolan\/jq\/wiki\/Installation\">jq<\/a> &#8211; lightweight and flexible command-line JSON processor<\/li>\n<li><a href=\"https:\/\/pachyderm.readthedocs.io\/en\/stable\/getting_started\/local_installation.html#pachctl\">pachctl<\/a> &#8211; command line interface for Pachyderm<\/li>\n<\/ul>\n<h1 id=\"setup-and-preparation\">Setup and Preparation<\/h2>\n<h2 id=\"kubernetes\">Kubernetes<\/h2>\n<p>The easiest method to deploy Kubernetes on Microsoft Azure is through the <a href=\"https:\/\/azure.microsoft.com\/services\/container-service\/\">Azure Container Service<\/a>. As this is well covered in their <a href=\"https:\/\/docs.microsoft.com\/azure\/container-service\/container-service-deployment\">documentation<\/a>, we\u2019ll leave it to you to provision a Kubernetes cluster.<\/p>\n<h2 id=\"pachyderm\">Pachyderm<\/h2>\n<h3 id=\"azure-storage-account\">Azure Storage Account<\/h3>\n<p>Let\u2019s start by provisioning an Azure Storage Account through the CLI:<\/p>\n<pre class=\"lang:default decode:true\">AZURE_RESOURCE_GROUP=\"pachyderm-rg \"\r\nAZURE_LOCATION=\"westus2\"\r\nAZURE_STORAGE_NAME=\"pachydermstrg\"\r\n\r\nazure storage account create ${AZURE_STORAGE_NAME} --location ${AZURE_LOCATION} --resource-group ${AZURE_RESOURCE_GROUP} --sku-name LRS --kind Storage<\/pre>\n<h3 id=\"data-disk\">Data Disk<\/h3>\n<p>Unfortunately, there is currently no capability to create an empty, unattached data disk; in the meantime, we\u2019ll need a workaround. I\u2019ve created a <a href=\"https:\/\/github.com\/jpoon\/azure-create-vhd\">Docker image<\/a> to help with creating and formatting a data disk and uploading it to Azure Storage; the entire process takes about 3 minutes for a 10GB data disk.<\/p>\n<pre class=\"lang:default decode:true \">CONTAINER_NAME=\"pach\" \r\nSTORAGE_NAME=\"pach-disk.vhd\" \r\nSTORAGE_SIZE=\"10\" \r\nAZURE_STORAGE_KEY=`azure storage account keys list ${AZURE_STORAGE_NAME} --resource-group ${AZURE_RESOURCE_GROUP} --json | jq .[0].value -r` \r\ndocker run -it jpoon\/azure-create-vhd ${AZURE_STORAGE_NAME} ${AZURE_STORAGE_KEY} disks ${STORAGE_NAME} ${STORAGE_SIZE}G`<\/pre>\n<p><code><\/code><\/p>\n<div class=\"language-sh highlighter-rouge\"><\/div>\n<h3 id=\"deploy-pachyderm\">Deploy Pachyderm<\/h3>\n<p>Deploying Pachyderm is a one-line command through <code class=\"highlighter-rouge\">pachctl<\/code>:<\/p>\n<div class=\"language-sh highlighter-rouge\">\n<pre class=\"highlight\"><code>pachctl deploy microsoft <span class=\"k\">${<\/span><span class=\"nv\">CONTAINER_NAME<\/span><span class=\"k\">}<\/span> <span class=\"k\">${<\/span><span class=\"nv\">AZURE_STORAGE_NAME<\/span><span class=\"k\">}<\/span> <span class=\"k\">${<\/span><span class=\"nv\">AZURE_STORAGE_KEY<\/span><span class=\"k\">}<\/span> <span class=\"s2\">\"<\/span><span class=\"k\">${<\/span><span class=\"nv\">STORAGE_VOLUME_URI<\/span><span class=\"k\">}<\/span><span class=\"s2\">\"<\/span> <span class=\"s2\">\"<\/span><span class=\"k\">${<\/span><span class=\"nv\">STORAGE_SIZE<\/span><span class=\"k\">}<\/span><span class=\"s2\">\"<\/span>\r\n<\/code><\/pre>\n<\/div>\n<p>Altogether, here\u2019s what the script and the output of the script looks like:<\/p>\n<pre><code>#!\/bin\/bash\r\n\r\n# ----------------------\r\n# Deploy Pachyderm on Microsoft Azure\r\n# https:\/\/gist.github.com\/jpoon\/c4b781c5eeb395b9ea8452b42cb993a4\r\n# ----------------------\r\n\r\n# Parameters\r\n# -------\r\n\r\nAZURE_RESOURCE_GROUP=\"my-resource-group\"\r\nAZURE_LOCATION=\"westus2\"\r\nAZURE_STORAGE_NAME=\"mystrg\"\r\nCONTAINER_NAME=\"pach\"\r\nSTORAGE_NAME=\"pach-disk.vhd\"\r\nSTORAGE_SIZE=\"10\"\r\n\r\n# Helpers\r\n# -------\r\n\r\nexitWithMessageOnError () {\r\n if [ ! $? -eq 0 ]; then\r\n echo \"An error has occurred during deployment.\"\r\n echo $1\r\n exit 1\r\n fi\r\n}\r\n\r\nhash azure 2&gt;\/dev\/null\r\nexitWithMessageOnError \"Missing azure-cli, please install azure-cli, if already installed make sure it can be reached from current environment.\"\r\n\r\nhash docker 2&gt;\/dev\/null\r\nexitWithMessageOnError \"Missing docker, please install docker, if already installed make sure it can be reached from current environment.\"\r\n\r\nhash pachctl 2&gt;\/dev\/null\r\nexitWithMessageOnError \"Missing pachctl, please install pachctl, if already installed make sure it can be reached from current environment.\"\r\n\r\n# Print Versions\r\n# -----\r\n\r\necho -n \"Using azure-cli \"\r\nazure -v\r\n\r\necho -n \"Using docker \"\r\ndocker -v | awk '{print $3}' | sed '$s\/.$\/\/'\r\n\r\necho -n \"Using pachctl \"\r\npachctl version | awk 'NR==2{print $2}'\r\n\r\n\r\n# Create Storage\r\n# -------\r\n\r\necho --- Provision Microsoft Azure Storage Account\r\nazure config mode arm\r\nazure group create --name ${AZURE_RESOURCE_GROUP} --location ${AZURE_LOCATION}\r\nazure storage account create ${AZURE_STORAGE_NAME} --location ${AZURE_LOCATION} --resource-group ${AZURE_RESOURCE_GROUP} --sku-name LRS --kind Storage\r\n\r\nAZURE_STORAGE_KEY=`azure storage account keys list ${AZURE_STORAGE_NAME} --resource-group ${AZURE_RESOURCE_GROUP} --json | jq .[0].value -r`\r\n\r\n# Create Data Disk\r\n# -------\r\n\r\necho --- Create Data Disk\r\nSTORAGE_VOLUME_URI=`docker run -it jpoon\/azure-create-vhd ${AZURE_STORAGE_NAME} ${AZURE_STORAGE_KEY} disks ${STORAGE_NAME} ${STORAGE_SIZE}G` \r\n\r\necho ${STORAGE_VOLUME_URI}\r\n\r\n# Deploy Pachyderm\r\n# -------\r\n\r\necho --- Deploy Pachyderm\r\npachctl deploy microsoft ${CONTAINER_NAME} ${AZURE_STORAGE_NAME} ${AZURE_STORAGE_KEY} \"${STORAGE_VOLUME_URI}\" \"${STORAGE_SIZE}\"\r\n\r\necho --- Done<\/code><\/pre>\n<p> <img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2017\/04\/deploy_pachyderm-300x240-1.gif\" alt=\"Image deploy pachyderm 300 215 240\" width=\"300\" height=\"240\" class=\"aligncenter size-full wp-image-11023\" \/><\/p>\n<h3 id=\"azure-resource-manager-arm-templates\">Azure Resource Manager (ARM) Templates<\/h3>\n<p>To further simplify deployment, we worked with General Fusion to build <a href=\"https:\/\/docs.microsoft.com\/azure\/azure-resource-manager\/resource-group-overview#template-deployment\">ARM templates<\/a> capable of provisioning the necessary resources and deploying the required applications through a one-line command.<\/p>\n<h1 id=\"using-pachyderm\">Using Pachyderm<\/h2>\n<p>When deploying a Kubernetes cluster through Azure Container Service, only the master node is exposed by default to the public internet. To access Pachyderm from your local developer machine, we will either need to set up port forwarding with <code class=\"highlighter-rouge\">pachctl portforward &amp;<\/code> or expose Pachyderm to the outside world through a public load balancer. In General Fusion\u2019s scenario, as the ingress virtual machine rests in the same virtual network, it can communicate to Pachyderm through the internal load balancer resting on top of the Kubernetes agent nodes.<\/p>\n<p>Once your newly provisioned Pachyderm cluster is accessible, you can test the connection using:<\/p>\n<pre class=\"lang:default decode:true\">$ pachctl version \r\nCOMPONENT    VERSION \r\npachctl \u00a0 \u00a0 \u00a01.3.4 \r\npachd \u00a0 \u00a0 \u00a0 \u00a01.3.4<\/pre>\n<p><code class=\"highlighter-rouge\">pachctl<\/code> is the current version running of the Pachyderm CLI tool on your local machine, and <code class=\"highlighter-rouge\">pachd<\/code> is the version of the Pachyderm server daemon running in the cluster.<\/p>\n<h1 id=\"conclusion\">Conclusion<\/h2>\n<p>Since General Fusion\u2019s on-premise data system was reaching its limits, their move to a Pachyderm-based data platform on Microsoft Azure has accelerated their ability to process and analyze experimental data. General Fusion\u2019s initiative demonstrates a growing importance across the industry for data provenance and versioning.<\/p>\n<p>Furthermore, the use of a container-based architecture in Pachyderm and Kubernetes has allowed General Fusion to develop their scientific analysis framework agnostic of data platform. Native support of Kubernetes in Azure Container Service has simplified their deployment process, allowing General Fusion to focus on producing and analyzing experimental plasma data rather than maintaining the infrastructure. The ability to analyze their data more effectively will help General Fusion to achieve their goal of creating clean energy, everywhere, forever.<\/p>\n<h1 id=\"opportunities-for-reuse\">Opportunities for Reuse<\/h2>\n<p>General Fusion\u2019s architecture can serve as an example of how one would deploy and use Pachyderm on Microsoft Azure. Instructions for deploying Pachyderm to Microsoft Azure are also available through <a href=\"http:\/\/docs.pachyderm.io\/en\/latest\/deployment\/azure.html\">Pachyderm\u2019s documentation<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>We partnered with General Fusion to develop and deploy their new Pachyderm-based data infrastructure to Microsoft Azure. This post walks through General Fusion\u2019s new data architecture and how we deployed it to Azure.<\/p>\n","protected":false},"author":21365,"featured_media":11024,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[11,15],"tags":[60,131,229],"class_list":["post-2878","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-big-data","category-containers","tag-azure","tag-containers","tag-kubernetes"],"acf":[],"blog_post_summary":"<p>We partnered with General Fusion to develop and deploy their new Pachyderm-based data infrastructure to Microsoft Azure. This post walks through General Fusion\u2019s new data architecture and how we deployed it to Azure.<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/posts\/2878","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/users\/21365"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/comments?post=2878"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/posts\/2878\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/media\/11024"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/media?parent=2878"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/categories?post=2878"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/tags?post=2878"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}