{"id":5225,"date":"2017-03-09T12:27:50","date_gmt":"2017-03-09T17:27:50","guid":{"rendered":"https:\/\/blogs.msdn.microsoft.com\/azuregov\/?p=5225"},"modified":"2017-03-09T12:27:50","modified_gmt":"2017-03-09T17:27:50","slug":"building-a-solution-on-azure-government-part-ii-hdinsight","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/azuregov\/building-a-solution-on-azure-government-part-ii-hdinsight\/","title":{"rendered":"Building a solution on Azure Government Part II: HDInsight"},"content":{"rendered":"<p><span>In the <a href=\"https:\/\/blogs.msdn.microsoft.com\/azuregov\/2017\/03\/09\/building-a-solution-on-azure-government-part-i-cognitive-services\/\">previous post<\/a><\/span><a href=\"https:\/\/blogs.msdn.microsoft.com\/azuregov\/2017\/03\/09\/building-a-solution-on-azure-government-part-i-cognitive-services\/\"><span> I went over how to use Cognitive services <\/span><\/a><span>to get a document translated, let\u2019s do some simple text analytics on the document. Specifically, we\u2019ll get some simple word counts so we can see how frequently the key words are used.<\/span><\/p>\n<p><span>We can easily provision HDInsight clusters in the Microsoft Azure Government portal. The screen shot below shows a 4-node Spark cluster provisioned on Linux.<\/span><\/p>\n<p><span><a href=\"https:\/\/blogs.msdn.microsoft.com\/azuregov\/?attachment_id=5265\"><img decoding=\"async\" width=\"1550\" height=\"736\" class=\"aligncenter size-full wp-image-5265\" alt=\"1\" src=\"https:\/\/devblogs.microsoft.com\/wp-content\/uploads\/sites\/43\/2019\/03\/127.png\" \/><\/a><img decoding=\"async\" width=\"625\" height=\"297\" src=\"\/Users\/v-spbull\/AppData\/Local\/Temp\/msohtmlclip1\/01\/clip_image002.png\" \/><\/span><\/p>\n<p><span>When an HDInsight cluster is provisioned, an Azure storage account is provisioned for the underlying storage. Let\u2019s connect to the storage with the <\/span><a href=\"http:\/\/storageexplorer.com\/\"><span>Microsoft Azure Storage Explorer<\/span><\/a><span> so that we can upload our translated text document. The Azure Storage Explorer is an easy-to-use tool, that enables developers to easily interact with Azure Storage (i.e., blobs, queues, tables). HDInsight creates an \u201cHdiSamples\u201d directory \u2013 let\u2019s create our own sub-directory inside that called \u201cTheArtOfWar\u201d where we\u2019ll upload our text file<a href=\"https:\/\/blogs.msdn.microsoft.com\/azuregov\/?attachment_id=5255\"><img decoding=\"async\" width=\"1422\" height=\"359\" class=\"aligncenter size-full wp-image-5255\" alt=\"2\" src=\"https:\/\/devblogs.microsoft.com\/wp-content\/uploads\/sites\/43\/2019\/03\/212.png\" \/><\/a><\/span><\/p>\n<p><span>Now that our document is ready, let\u2019s go back to the Azure portal and click \u201cJupyter Notebook\u201d to open the Jupyter dashboard. We can then click \u201cNew \u2013 PySpark\u201d notebook to open a new notebook. We want to take the following steps:<\/span><\/p>\n<ul>\n<li><span>Load the data from the text file in storage<\/span><\/li>\n<li><span>Get a complete list of words<\/span><\/li>\n<li><span>Get a count of each distinct word<\/span><\/li>\n<li><span>Filter by only the top results (i.e., words that occur at least 10 times)<\/span><\/li>\n<li><span>Create a schema to hold our data structure<\/span><\/li>\n<li><span>Create a table in HDInsight with the data based on the schema<\/span><\/li>\n<\/ul>\n<p><span>The complete logic for those steps can be seen here:<\/span><\/p>\n<p><span><a href=\"https:\/\/blogs.msdn.microsoft.com\/azuregov\/?attachment_id=5245\"><img decoding=\"async\" width=\"1151\" height=\"785\" class=\"aligncenter size-full wp-image-5245\" alt=\"3\" src=\"https:\/\/devblogs.microsoft.com\/wp-content\/uploads\/sites\/43\/2019\/03\/310.png\" \/><\/a><img decoding=\"async\" width=\"624\" height=\"425\" src=\"\/Users\/v-spbull\/AppData\/Local\/Temp\/msohtmlclip1\/01\/clip_image006.png\" border=\"0\" \/><\/span><\/p>\n<p><span>Now that we have the data in an easy to use table, we can run SQL queries on the data and visualize results in HDInsight:<\/span><\/p>\n<p><span><a href=\"https:\/\/blogs.msdn.microsoft.com\/azuregov\/?attachment_id=5235\"><img decoding=\"async\" width=\"1123\" height=\"698\" class=\"aligncenter size-full wp-image-5235\" alt=\"4\" src=\"https:\/\/devblogs.microsoft.com\/wp-content\/uploads\/sites\/43\/2019\/03\/46.png\" \/><\/a><img decoding=\"async\" width=\"624\" height=\"388\" src=\"\/Users\/v-spbull\/AppData\/Local\/Temp\/msohtmlclip1\/01\/clip_image008.png\" border=\"0\" \/><\/span><\/p>\n<p><span>Now that we\u2019ve parsed our data in HDInsight, the next and final part will cover how to use Power BI to provide additional visualizations on the data.<\/span><\/p>\n<p><span>\u00a0<span>We welcome your comments and suggestions to help us continually\u00a0improve your Azure Government experience. To stay up to date on all things Azure Government, be sure to subscribe to our <a href=\"https:\/\/blogs.msdn.microsoft.com\/azuregov\/feed\/\">RSS feed<\/a> and to receive emails, click \u201cSubscribe by Email!\u201d on the <a href=\"https:\/\/blogs.msdn.microsoft.com\/azuregov\/\">Azure Government Blog<\/a>. To experience the power of Azure Government for your organization, sign up for an <a href=\"https:\/\/azuregov.microsoft.com\/trial\/azuregovtrial\">Azure Government Trial<\/a>.<\/span><\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>In the previous post I went over how to use Cognitive services to get a document translated, let\u2019s do some simple text analytics on the document. Specifically, we\u2019ll get some simple word counts so we can see how frequently the key words are used. We can easily provision HDInsight clusters in the Microsoft Azure Government [&hellip;]<\/p>\n","protected":false},"author":1789,"featured_media":4456,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[25],"tags":[75,95,328],"class_list":["post-5225","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-portalpreview","tag-azure","tag-azure-government","tag-hdinsight"],"acf":[],"blog_post_summary":"<p>In the previous post I went over how to use Cognitive services to get a document translated, let\u2019s do some simple text analytics on the document. Specifically, we\u2019ll get some simple word counts so we can see how frequently the key words are used. We can easily provision HDInsight clusters in the Microsoft Azure Government [&hellip;]<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/azuregov\/wp-json\/wp\/v2\/posts\/5225","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/azuregov\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/azuregov\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/azuregov\/wp-json\/wp\/v2\/users\/1789"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/azuregov\/wp-json\/wp\/v2\/comments?post=5225"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/azuregov\/wp-json\/wp\/v2\/posts\/5225\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/azuregov\/wp-json\/"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/azuregov\/wp-json\/wp\/v2\/media?parent=5225"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/azuregov\/wp-json\/wp\/v2\/categories?post=5225"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/azuregov\/wp-json\/wp\/v2\/tags?post=5225"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}