{"id":9301,"date":"2018-08-20T12:13:11","date_gmt":"2018-08-20T19:13:11","guid":{"rendered":"https:\/\/www.microsoft.com\/developerblog\/?p=9301"},"modified":"2020-03-19T17:05:10","modified_gmt":"2020-03-20T00:05:10","slug":"attaching-and-detaching-an-edge-node-from-a-hdinsight-spark-cluster-when-running-dataiku-data-science-studio-dss","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/ise\/attaching-and-detaching-an-edge-node-from-a-hdinsight-spark-cluster-when-running-dataiku-data-science-studio-dss\/","title":{"rendered":"Attaching and Detaching an Edge Node From a HDInsight Spark Cluster when running Dataiku Data Science Studio (DSS)"},"content":{"rendered":"<h2>Background<\/h2>\n<p>Microsoft Global Partner\u00a0<a href=\"https:\/\/www.dataiku.com\/\">Dataiku<\/a>\u00a0is the enterprise behind the Data Science Studio (DSS), a collaborative data science platform that enables companies to build and deliver their analytical solutions more efficiently.<span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:160,&quot;335559740&quot;:259}\">\u00a0<\/span><\/p>\n<p>Microsoft has been working closely with\u00a0Dataiku\u00a0for many years to bring their solutions and integrations to the Microsoft platform. In particular, Microsoft has assisted with bringing Dataiku\u2019s\u00a0Data Science Studio application to Azure HDInsight as an easy-to-install application,\u00a0as well as other data source and visualisation connectors\u00a0such as Power BI and Azure Data Lake Store.\u00a0<span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:160,&quot;335559740&quot;:259}\">\u00a0<\/span><\/p>\n<p><a href=\"https:\/\/azure.microsoft.com\/en-us\/services\/hdinsight\/\">Azure HDInsight<\/a>\u00a0is the industry-leading fully-managed cloud Apache Hadoop &amp; Spark offering on Azure which allows customers to do reliable open source analytics with an\u00a0SLA. The combined offering of DSS as an\u00a0HDInsight (HDI) application enables\u00a0Dataiku\u00a0and Azure customers to easily use data science to build big data solutions and run them at enterprise grade and scale.<span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:160,&quot;335559740&quot;:259}\">\u00a0<\/span><\/p>\n<p>Earlier this year,\u00a0Dataiku\u00a0and Microsoft joined\u00a0forces\u00a0to add extra flexibility to DSS on HDInsight, and also to\u00a0allow\u00a0Dataiku\u00a0customers to attach a persistent edge node on an\u00a0HDInsight cluster \u2013 something which was previously not a feature supported by the most recent edition of Azure HDInsight.\u00a0<span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:160,&quot;335559740&quot;:259}\">\u00a0<\/span><\/p>\n<p>Below are two diagrams (Figure 1 and 2) showing\u00a0Dataiku\u2019s\u00a0application in action for a Predictive Maintenance scenario \u2013 both the project creation with team collaboration and also the machine learning workflow and triggers involved to complete the project:<\/p>\n<p>\n<img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2018\/08\/download.png\" alt=\"Image download\" width=\"1024\" height=\"485\" class=\"aligncenter size-full wp-image-10636\" srcset=\"https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2018\/08\/download.png 1024w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2018\/08\/download-300x142.png 300w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2018\/08\/download-768x364.png 768w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/>\n<\/p>\n<p><span class=\"TextRun SCXW43214864\" lang=\"EN-GB\" xml:lang=\"EN-GB\"><span class=\"NormalTextRun SCXW43214864\">[Figure 1: A screenshot of the\u00a0<\/span><span class=\"SpellingError SCXW43214864\">Dataiku<\/span><span class=\"NormalTextRun SCXW43214864\">\u00a0Data Science Studio application for a predictive maintenance scenario]<\/span><\/span><span class=\"EOP SCXW43214864\" data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:160,&quot;335559740&quot;:259}\">\u00a0<\/span><\/p>\n<p>\n<img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2018\/08\/download-1.png\" alt=\"Image download 1\" width=\"1024\" height=\"482\" class=\"aligncenter size-full wp-image-10631\" srcset=\"https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2018\/08\/download-1.png 1024w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2018\/08\/download-1-300x141.png 300w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2018\/08\/download-1-768x362.png 768w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/>\n<\/p>\n<p><span class=\"TextRun SCXW93788603\" lang=\"EN-GB\" xml:lang=\"EN-GB\"><span class=\"NormalTextRun SCXW93788603\">[Figure 2: A screenshot of the machine learning workflow and design surface in\u00a0<\/span><span class=\"SpellingError SCXW93788603\">Dataiku\u2019s<\/span><span class=\"NormalTextRun SCXW93788603\">\u00a0Data Science Studio application]<\/span><\/span><span class=\"EOP SCXW93788603\" data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:160,&quot;335559740&quot;:259}\">\u00a0<\/span><\/p>\n<h2>Challenges and Solution<\/h2>\n<p>During this project, our joint team aimed to add extra flexibility to DSS on HDInsight\u00a0and allow\u00a0Dataiku\u00a0customers to have a persistent edge node on an\u00a0HDInsight cluster\u00a0so their projects can leverage huge amounts of compute only when they need them, but they are not tied to having the cluster always running.\u00a0<span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:160,&quot;335559740&quot;:259}\">\u00a0<\/span><\/p>\n<p>Attaching an edge node (virtual machine) to a currently running HDInsight cluster is not a feature supported by Azure HDInsight (July 2018). Standard deployment options include:<span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:160,&quot;335559740&quot;:259}\">\u00a0<\/span><\/p>\n<ol>\n<li>Provisioning an application (in this case\u00a0Dataiku\u00a0Data Science Studio) as part of the cluster (a managed edge node)\u00a0&#8211; see Figure 3 below, an architecture diagram for Option A<span data-ccp-props=\"{&quot;134233279&quot;:true,&quot;201341983&quot;:0,&quot;335559739&quot;:160,&quot;335559740&quot;:259}\">\u00a0<\/span><\/li>\n<li>Provisioning\u00a0Dataiku\u00a0DSS directly on the HDInsight head node\u00a0&#8211; see Figure 3 below, an architecture diagram for Option B<span data-ccp-props=\"{&quot;134233279&quot;:true,&quot;201341983&quot;:0,&quot;335559739&quot;:160,&quot;335559740&quot;:259}\">\u00a0<\/span><\/li>\n<\/ol>\n<p>\n<img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2018\/08\/download-2.png\" alt=\"Image download 2\" width=\"1024\" height=\"375\" class=\"aligncenter size-full wp-image-10632\" srcset=\"https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2018\/08\/download-2.png 1024w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2018\/08\/download-2-300x110.png 300w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2018\/08\/download-2-768x281.png 768w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/>\n<\/p>\n<p><i>[Figure 3: Two architecture diagrams showing the current deployment options you have with Azure HDInsight and where a partner application can be hosted and communicate with the HDInsight cluster]\u00a0<\/i><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:160,&quot;335559740&quot;:259}\">\u00a0<\/span><\/p>\n<p>Note:<span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:160,&quot;335559740&quot;:259}\">\u00a0<\/span><\/p>\n<ol>\n<li data-leveltext=\"(%1)\" data-font=\"Calibri, sans-serif\" data-listid=\"7\" data-aria-posinset=\"1\" data-aria-level=\"1\"><b>Option A<\/b>\u00a0can be setup automatically via the Azure HDInsight Applications tab in the Azure Portal. For more information see\u00a0<a href=\"https:\/\/azure.microsoft.com\/en-gb\/blog\/introducing-dataiku-s-dss-on-microsoft-azure-hdinsight-to-make-data-science-easier\/\">the blog post here<\/a><span data-ccp-props=\"{&quot;134233279&quot;:true,&quot;201341983&quot;:0,&quot;335559739&quot;:160,&quot;335559740&quot;:259}\">\u00a0<\/span><\/li>\n<\/ol>\n<ol>\n<li data-leveltext=\"(%1)\" data-font=\"Calibri, sans-serif\" data-listid=\"7\" data-aria-posinset=\"2\" data-aria-level=\"1\"><strong>Option B<\/strong> is a manual setup once you have created an\u00a0HDInsight Cluster in Azure.<span data-ccp-props=\"{&quot;134233279&quot;:true,&quot;201341983&quot;:0,&quot;335559739&quot;:160,&quot;335559740&quot;:259}\">\u00a0<\/span><\/li>\n<\/ol>\n<p>You would need first to\u00a0<a href=\"https:\/\/docs.microsoft.com\/en-us\/azure\/hdinsight\/hdinsight-hadoop-linux-use-ssh-unix\">connect to the HDInsight Cluster using SSH<\/a>\u00a0and then perform a\u00a0<a href=\"https:\/\/doc.dataiku.com\/dss\/latest\/installation\/new_instance.html\">manual DSS installation<\/a>\u00a0on the head node.<span data-ccp-props=\"{&quot;134233279&quot;:true,&quot;201341983&quot;:0,&quot;335559685&quot;:720,&quot;335559739&quot;:160,&quot;335559740&quot;:259}\">\u00a0<\/span><\/p>\n<p>In both cases\u00a0<b>there is a drawback<\/b>, as the DSS\u00a0application\u00a0instance will always\u00a0be deleted when the HDI cluster is deleted.\u00a0With Azure HDInsight the edge node is always part of the lifecycle of the cluster, as it lives within the same Azure resource boundary as the head and\u00a0all\u00a0worker nodes. Therefore, the action to delete the large amounts of compute (to save money when it is not being used) will result in the edge node being deleted as well.\u00a0<span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:160,&quot;335559740&quot;:259}\">\u00a0<\/span><\/p>\n<p>There are times when\u00a0Dataiku\u00a0customers may wish to have DSS run as a standalone machine during testing and then attach it to a large amount of compute (cluster) when they wish to submit large queries; and then also detach the edge node &#8212;\u00a0but keep the results &#8212; once finished.\u00a0<span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:160,&quot;335559740&quot;:259}\">\u00a0<\/span><\/p>\n<p>Another challenge to be tackled\u00a0during this project\u00a0was the ability to attach an edge node to different clusters being run within an organisation, easily \u2013 such as development cluster and production cluster etc.<span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:160,&quot;335559740&quot;:259}\">\u00a0<\/span><\/p>\n<h3>Attach and detach from the cluster<\/h3>\n<p><span class=\"TextRun SCXW180093656\" lang=\"EN-GB\" xml:lang=\"EN-GB\"><span class=\"NormalTextRun SCXW180093656\">Figure 4 below is an architecture diagram showing a desired scenario whereby the partner application is hosted on a separate virtual machine outside the blue HDInsight cluster network<\/span><\/span><span class=\"TextRun SCXW180093656\" lang=\"EN-GB\" xml:lang=\"EN-GB\"><span class=\"NormalTextRun SCXW180093656\">\u00a0and therefore the VM<\/span><\/span><span class=\"TextRun SCXW180093656\" lang=\"EN-GB\" xml:lang=\"EN-GB\"><span class=\"NormalTextRun SCXW180093656\">\u00a0can outlive the HDInsight cluster if it was to be deleted to save compute costs.<\/span><\/span><span class=\"EOP SCXW180093656\" data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:160,&quot;335559740&quot;:259}\">\u00a0<\/span><\/p>\n<p>\n<img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2018\/08\/download-3.png\" alt=\"Image download 3\" width=\"1024\" height=\"553\" class=\"aligncenter size-full wp-image-10633\" srcset=\"https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2018\/08\/download-3.png 1024w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2018\/08\/download-3-300x162.png 300w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2018\/08\/download-3-768x415.png 768w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/>\n<\/p>\n<p><i>[Figure 4: An architecture diagram showing a possible desired setup for the DSS application to communicate with the HDInsight Cluster without being directly within the HDInsight setup]<\/i><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:160,&quot;335559740&quot;:259}\">\u00a0<\/span><\/p>\n<p>&nbsp;<\/p>\n<h3>Connect to another cluster<\/h3>\n<p>Figure 6 below is an example architecture diagram for having a single instance of the DSS application on a virtual machine able to connect to many different HDInsight clusters running within the same VNET. For example, one can have a developer and production cluster within a single organisation\u2019s subscription and the user of DSS can choose when to connect\/disconnect from each.<\/p>\n<p><span id=\"{5750f706-733d-4d7a-9a0d-b6749bc83709}{36}\" class=\"WACAltTextDescribedBy SCXW49836083\" aria-hidden=\"true\"><\/span>\n<img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2018\/08\/download-4.png\" alt=\"Image download 4\" width=\"1024\" height=\"576\" class=\"aligncenter size-full wp-image-10634\" srcset=\"https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2018\/08\/download-4.png 1024w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2018\/08\/download-4-300x169.png 300w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2018\/08\/download-4-768x432.png 768w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/>\n<\/p>\n<p><span class=\"TextRun SCXW122844636\" lang=\"EN-GB\" xml:lang=\"EN-GB\"><span class=\"NormalTextRun SCXW122844636\">[Figure 6: An architecture diagram to show the desired outcome of the DSS application running on a virtual machine able to choose to connect and disconnect from different HDInsight clusters within the same virtual network depending on the job being run]<\/span><\/span><span class=\"EOP SCXW122844636\" data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:160,&quot;335559740&quot;:259}\">\u00a0<\/span><\/p>\n<h2>The Solution<\/h2>\n<p>Our team created a VM and added HDI edge node configuration (packages and libraries) that would allow\u00a0Dataiku\u00a0to submit spark jobs to an\u00a0HDInsight Cluster. Because the VM lives outside of the cluster boundary, it can survive the deletion of the HDInsight cluster and retain the information and results it has.<span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:160,&quot;335559740&quot;:259}\">\u00a0<\/span><\/p>\n<p>Figure 7 illustrates the configuration changes we designed to allow the VM to talk to the head node of the HDInsight cluster and submit jobs as if it was part of the cluster:\u00a0<span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:160,&quot;335559740&quot;:259}\">\u00a0<\/span><\/p>\n<p><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2018\/08\/download-5.png\" alt=\"Image download 5\" width=\"1024\" height=\"425\" class=\"aligncenter size-full wp-image-10635\" srcset=\"https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2018\/08\/download-5.png 1024w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2018\/08\/download-5-300x125.png 300w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2018\/08\/download-5-768x319.png 768w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/>\n<\/p>\n<p><span class=\"TextRun SCXW81521842\" lang=\"EN-GB\" xml:lang=\"EN-GB\"><span class=\"NormalTextRun SCXW81521842\">[Figure 7: An architecture diagram to show the DSS application on a VM outside of the HDInsight cluster that can communicate with the head node of the HDInsight cluster<\/span><\/span><span class=\"TextRun SCXW81521842\" lang=\"EN-GB\" xml:lang=\"EN-GB\"><span class=\"NormalTextRun SCXW81521842\">\u00a0because the configuration libraries are matching in both\u00a0<\/span><\/span><span class=\"TextRun SCXW81521842\" lang=\"EN-GB\" xml:lang=\"EN-GB\"><span class=\"NormalTextRun SCXW81521842\">head node and VM to communicate<\/span><\/span><span class=\"TextRun SCXW81521842\" lang=\"EN-GB\" xml:lang=\"EN-GB\"><span class=\"NormalTextRun SCXW81521842\">]<\/span><\/span><span class=\"EOP SCXW81521842\" data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:160,&quot;335559740&quot;:259}\">\u00a0<\/span><\/p>\n<h2>The Process<\/h2>\n<p>First, our team needed to attach a virtual machine with DSS installed to a running HDInsight cluster and check that queries could be submitted as if we were on a managed edge node within the cluster.<\/p>\n<p>We initially worked through the setup manually, however we also created a helper script to allow others to quickly replicate the setup, which has proven to be very useful for Dataiku customers. The setup can be run <a href=\"https:\/\/github.com\/dataiku\/dataiku-contrib\/tree\/master\/microsoft-hdi-dss-non-managed-edge-node\">manually via documentation here<\/a> or <a href=\"https:\/\/github.com\/dataiku\/dataiku-contrib\/tree\/master\/microsoft-hdi-dss-non-managed-edge-node\">from a shell script.<\/a><\/p>\n<h2>Preparing the DSS environment<\/h2>\n<p>To copy and submit commands from a different destination (to the head-node) the idea was to make the VM have exactly the same setup as the edge node within the HDInsight Cluster.<\/p>\n<p>To do this we first copied the HDP (Hortonworks Data Platform) repo file from the cluster head-node to the Dataiku DSS VM. We found these files in <em>\/etc\/apt\/sources.list.d\/HDP.list\u00a0<\/em>on the cluster head-node and copied them to <em>etc\/apt\/sources.list.d\/HDP.list<\/em> within the DSS VM inside the same virtual network for communication purposes.<\/p>\n<p>As many of the commands we were using needed administrator privileges we checked we had sufficient sudo permissions on the DSS VM by using sudo -v and entered the admin password. We also created root ssh keys which are used to authenticate and communicate between the VM and the edge node within the cluster.<\/p>\n<p>We then installed the APT keys for Hortonworks and Microsoft, as well as installing Java and Hadoop client packages on the VM and removing the initial set of Hadoop configuration directories, to avoid any conflicts (commands shown below).<\/p>\n<pre class=\"\">apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 07513CAD 417A0893\r\n\r\nrm -rf \/etc\/{hadoop,hive*,pig,spark*,tez*,zookeeper}\/conf<\/pre>\n<p>The final elements of setup lead to creating a directory that was used by the HDInsight Spark configuration and defining the environment variables for spark and python on the DSS machine to match the head-node (<em>\/etc\/environment<\/em>, or for this setup <em>DSS_HOMEDIR\/.profile<\/em>).<\/p>\n<pre class=\"\">mkdir -p \/var\/log\/sparkapp &amp;&amp; chmod 777 \/var\/log\/sparkapp\r\n\r\nAZURE_SPARK=1\r\n\r\nSPARK_MAJOR_VERSION=2\r\n\r\nPYSPARK_DRIVER_PYTHON=\/usr\/bin\/python<\/pre>\n<h2>Cluster association<\/h2>\n<p>Now that the DSS VM was correctly setup with permissions, libraries and keys (becoming a mirror of the HDInsight head-node) we needed to connect the two machines to be able to copy files to each other<\/p>\n<p>We installed the DSS VM ssh key onto the HDInsight head-node so this can be used for copying of files and connections. But also, we needed to change the network <em>\/etc\/hosts<\/em> definition file to be the same. To do this, we copied the <em>\/etc\/hosts<\/em> definition for <em>&#8220;headnodehost&#8221;<\/em> from the head node to the DSS VM and then flushed the ssh key cache afterwards.<\/p>\n<p>Finally, to complete the synchronisation of the DSS VM and the HDInsight head-node we synchronised the Hadoop base services packages, as follows:<\/p>\n<pre class=\"\">declare -a hadoopBaseServices=(\r\n\r\n\u00a0 \"hadoop\"\r\n\r\n\u00a0 \"hive\"\r\n\r\n\u00a0 \"pig\"\r\n\r\n\u00a0 \"spark\"\r\n\r\n\u00a0 \"tez\"\r\n\r\n\u00a0 \"zookeeper\"\r\n\r\n)<\/pre>\n<p>We also used the rsync command to remove the current packages from the DSS VM and synchronise the packages from the HDInsight head-node<\/p>\n<pre class=\"\">for service in ${hadoopBaseServices[@]}\r\n\r\ndo\r\n\r\n\u00a0 reg=\"\/etc\/$service.*\/conf\"\r\n\r\n\u00a0 echo \"Removing configuration for service $service maching $reg\"\r\n\r\n\u00a0 sudo find \/etc -regex $reg | xargs sudo rm -rf\r\n\r\n\u00a0 echo \"Synchronizing configuration for service $service\"\r\n\r\n\u00a0 sudo rsync -av -e \"ssh -o StrictHostKeyChecking=no -i $PRIVATEKEYPATH\" $headnode_user@$headnode_ip:\/etc\/${service}\\* \/etc\/\r\n\r\nDone<\/pre>\n<p>Before testing we realized that it was necessary to (re-)run Hadoop and Spark integration on DSS before restarting the DSS VM. You need to re-run Hadoop and\/or Spark integration when you modify the local-to-the-VM Hadoop and Spark configuration, typically when you install\/synchronize new Hadoop jars as we did earlier on in the project. For more information on this topic see the documentation here: <a href=\"https:\/\/na01.safelinks.protection.outlook.com\/?url=https%3A%2F%2Fdoc.dataiku.com%2Fdss%2Flatest%2Fhadoop%2Finstallation.html%23setting-up-dss-hadoop-integration&amp;data=02%7C01%7Camynic%40microsoft.com%7C2fc366c2d2c94315355708d5e327efa6%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636664683036276938&amp;sdata=JV4%2FlzRwznvePnBji8ojd4%2FrTTi4SZRbUl%2BPwRzEgFM%3D&amp;reserved=0\">https:\/\/doc.dataiku.com\/dss\/latest\/hadoop\/installation.html#setting-up-dss-hadoop-integration<\/a><\/p>\n<h2>Prove Connectivity<\/h2>\n<p>As the setup and synchronisation were completed, we tested the connection had worked so we could submit spark commands on the DSS VM and they would be submitted to the HDInsight cluster.<\/p>\n<p>An example command we used to check that connectivity works is<em>\u00a0hdfs dfs -ls \/<\/em> or\u00a0spark2-shell\u00a0command.<\/p>\n<p>Figure 8 and 9 below are screenshots showing our results once those commands were run for HDFS and Spark-submit respectively:<\/p>\n<p><figure id=\"attachment_9368\" aria-labelledby=\"figcaption_attachment_9368\" class=\"wp-caption alignnone\" ><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2018\/08\/sparklog_1.png\" alt=\"Image sparklog 1\" width=\"1024\" height=\"520\" class=\"aligncenter size-full wp-image-10639\" srcset=\"https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2018\/08\/sparklog_1.png 1024w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2018\/08\/sparklog_1-300x152.png 300w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2018\/08\/sparklog_1-768x390.png 768w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><figcaption id=\"figcaption_attachment_9368\" class=\"wp-caption-text\"><\/p>\n<p>[Figure 8: A screenshot of the command line interface on the DSS VM showing the connection to the head-node of the cluster was successful by running the HDFS command]<\/figcaption><\/figure><figure id=\"attachment_9369\" aria-labelledby=\"figcaption_attachment_9369\" class=\"wp-caption alignnone\" ><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2018\/08\/spark_log2.png\" alt=\"Image spark log2\" width=\"1024\" height=\"514\" class=\"aligncenter size-full wp-image-10638\" srcset=\"https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2018\/08\/spark_log2.png 1024w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2018\/08\/spark_log2-300x151.png 300w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2018\/08\/spark_log2-768x386.png 768w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><figcaption id=\"figcaption_attachment_9369\" class=\"wp-caption-text\"><\/p>\n<p> [Figure 9: A screenshot of the command line interface on the DSS VM showing the connection to the head-node of the cluster was successful by running the spark commands and creating arrays and evaluating them]<\/figcaption><\/figure><\/p>\n<h2>Shell Script for Automation<\/h2>\n<p>After we followed the manual process above \u2013 which has a lot of commands that need to be run in the command line interface and \u00a0often leads to mistakes or missing commands \u2013 it felt natural to create a script for Dataiku\u2019s customers to quickly and easily implement this scenario within the DSS setup. We have made a helper script available\u00a0<a href=\"https:\/\/github.com\/dataiku\/dataiku-contrib\/blob\/master\/microsoft-hdi-dss-non-managed-edge-node\/scripts\/hdi-edge-node-from-existing-vm.sh\">here<\/a>. To run the helper script, submit the command below with the required parameters:<\/p>\n<pre class=\"\">source hdi-edge-node-from-existing-vm.sh -headnode_ip &lt;HEADNODE_IP&gt; -headnode_user &lt;HEADNODE_USER&gt;<\/pre>\n<p>The script itself tests automatically that all required conditions are fulfilled.<\/p>\n<h2>Conclusion<\/h2>\n<p>In this blog post we demonstrated how to attach and detach edge nodes\/virtual machines from a HDInsight cluster. We went through the process of:<\/p>\n<ol>\n<li>Preparing the DSS VM environment and installing the basic packages<\/li>\n<li>Copying HDI configuration and libraries from HDI head node to the DSS VM<\/li>\n<li>Testing that job submission was running<\/li>\n<li>Creating a helper script for automation<\/li>\n<\/ol>\n<p>We also created and shared a helper script that allows users to take advantage of a stand-alone edge node running Dataiku DSS, which can communicate with a HDInsight cluster to submit Spark jobs but also persists outside of the lifecycle of the HDInsight cluster as a VM on its own. This will benefit Dataiku customers going forward as it allows them to run the Dataiku DSS application and have the flexibility to attach to and disconnect from large compute (clusters). \u00a0This saves money on compute while being able to keep\/persist the result of the big data queries within the DSS environment. This solution could also be leveraged by other companies building big data solutions for their customer in the big data space, allowing them the flexibility to repeat this scenario.<\/p>\n<p>Dataiku and Microsoft have a close relationship from the past, and one which we built upon through this project. It is our hope that the work presented here will lead to further collaboration in the future on additional technical projects using the Azure Platform.<\/p>\n<h2>Other Useful Resources<\/h2>\n<p>This solution is designed specifically for Dataiku\u2019s Data Science Studio application, however a similar process could be relevant to anyone using HDInsight for their Big Data scenarios. If you wish to follow step-by-step the process to replicate, please find repository links <a href=\"https:\/\/github.com\/dataiku\/dataiku-contrib\/tree\/master\/microsoft-hdi-dss-non-managed-edge-node\">here to a step-by-step guide<\/a> we created and <a href=\"https:\/\/github.com\/dataiku\/dataiku-contrib\/blob\/master\/microsoft-hdi-dss-non-managed-edge-node\/scripts\/hdi-edge-node-from-existing-vm.sh\">a helper script<\/a> that can be used and adapted.<\/p>\n<h4><strong>**<\/strong> <strong>Please note:<\/strong> there are some limitations and restraints to this solution. Please review these <a href=\"https:\/\/github.com\/dataiku\/dataiku-contrib\/tree\/master\/microsoft-hdi-dss-non-managed-edge-node\">here<\/a> before proceeding. <strong>Disclamer:<\/strong> This contribution is not subject to official support from Dataiku nor Microsoft and does not guarantee compatibility with future versions of HDI <strong>**<\/strong><\/h4>\n<p>To learn more about the solution described above, Azure HDInsight and Dataiku Data Science Studio please find the links below:<\/p>\n<ul>\n<li><a href=\"https:\/\/github.com\/dataiku\/dataiku-contrib\/tree\/master\/microsoft-hdi-dss-non-managed-edge-node\">Dataiku Technical Documentation for workaround<\/a><\/li>\n<li><a href=\"https:\/\/azure.microsoft.com\/en-us\/services\/hdinsight\/\">Azure HDInsight<\/a><\/li>\n<li><a href=\"https:\/\/www.dataiku.com\/\">Dataiku Data Science Studio (DSS)<\/a><\/li>\n<li><a href=\"https:\/\/azuremarketplace.microsoft.com\/en-us\/marketplace\/apps\/dataiku.dataiku-data-science-studio\">Dataiku DSS in the Azure Marketplace<\/a><\/li>\n<li><a href=\"https:\/\/azure.microsoft.com\/en-us\/blog\/introducing-dataiku-s-dss-on-microsoft-azure-hdinsight-to-make-data-science-easier\/\">Dataiku DSS as a HDInsight Application<\/a><\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>Earlier this year,\u00a0Dataiku\u00a0and Microsoft joined\u00a0forces\u00a0to add extra flexibility to DSS on HDInsight, and also to\u00a0allow\u00a0Dataiku\u00a0customers to attach a persistent edge node on an\u00a0HDInsight cluster \u2013 something which was previously not a feature supported by the most recent edition of Azure HDInsight.\u00a0\u00a0<\/p>\n","protected":false},"author":21469,"featured_media":10637,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1],"tags":[60,145,163,195],"class_list":["post-9301","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-cse","tag-azure","tag-dataiku","tag-edge-node","tag-hdinsight"],"acf":[],"blog_post_summary":"<p>Earlier this year,\u00a0Dataiku\u00a0and Microsoft joined\u00a0forces\u00a0to add extra flexibility to DSS on HDInsight, and also to\u00a0allow\u00a0Dataiku\u00a0customers to attach a persistent edge node on an\u00a0HDInsight cluster \u2013 something which was previously not a feature supported by the most recent edition of Azure HDInsight.\u00a0\u00a0<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/posts\/9301","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/users\/21469"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/comments?post=9301"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/posts\/9301\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/media\/10637"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/media?parent=9301"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/categories?post=9301"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/tags?post=9301"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}