{"id":6667,"date":"2018-08-09T07:58:25","date_gmt":"2018-08-09T14:58:25","guid":{"rendered":"\/developerblog\/?p=6667"},"modified":"2020-03-20T07:25:26","modified_gmt":"2020-03-20T14:25:26","slug":"infrastructure-code-demand-gpu-clusters-terraform-jenkins","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/ise\/infrastructure-code-demand-gpu-clusters-terraform-jenkins\/","title":{"rendered":"Infrastructure as Code &#8211; On demand GPU clusters with Terraform &amp; Jenkins"},"content":{"rendered":"<h2><b>Background<\/b><\/h2>\n<p>Developing robust algorithms for self-driving cars requires sourcing event data from over 10 billion hours of recorded driving time. But even with 10 billion-plus hours, it can be challenging to capture rare yet critical edge cases like weather events or collisions at scale. Luckily, these rare events can often be simulated effectively.<\/p>\n<p><a href=\"http:\/\/www.cognata.com\/\">Cognata<\/a>, a startup that develops an autonomous car simulator uses patented computer vision and deep learning algorithms to automatically generate a city-wide simulation that includes buildings, roads, lane marks, traffic signs, and even trees and bushes.<\/p>\n<p>Simulating driving event data, however, requires expensive GPU and compute resources that vary depending on a given event.\u00a0In working with\u00a0<a href=\"http:\/\/www.cognata.com\/\">Cognata<\/a>, the big obstacle was going to be automation and creating on-demand resources in Azure.<\/p>\n<p>In this code story, we&#8217;ll describe how we were able to provision custom GPU rendering clusters on demand with a Jenkins pipeline and Terraform.<\/p>\n<h2><b>The Challenges<\/b><\/h2>\n<p>Cognata required a scalable architecture to render its simulations across individual customers and events using GPU. To support this need, we first investigated using Docker containers and Kubernetes. However, at the time of this post, nvidia-docker does not officially support <a href=\"https:\/\/github.com\/NVIDIA\/nvidia-docker\/wiki\/Frequently-Asked-Questions#do-you-support-running-a-gpu-accelerated-x-server-inside-the-container\">X server<\/a> and <a href=\"https:\/\/github.com\/NVIDIA\/nvidia-docker\/wiki\/Frequently-Asked-Questions#is-opengl-supported\">OpenGL<\/a> support is a beta feature.\u00a0 As an alternative to container orchestration, we architected a scalable solution using a GPU\u00a0<a href=\"https:\/\/docs.microsoft.com\/en-us\/azure\/virtual-machine-scale-sets\/virtual-machine-scale-sets-overview\">Azure Virtual Machine Scale Sets<\/a>\u00a0(VMSS) cluster and a custom application installation script for on-demand deployment.<\/p>\n<p>The complex changesets must be applied to the infrastructure with minimal human interaction and must support\u00a0building, changing, and versioning infrastructure safely and efficiently.<\/p>\n<h2><b>The solution<\/b><\/h2>\n<h2>On-demand Infrastructure as Code<\/h2>\n<p>Cognata&#8217;s applications and services are deployed in a Kubernetes cluster and the rendering application is deployed in GPU Virtual machines clusters.\u00a0 To apply &#8220;Infrastructure as code&#8221; methodology, we decided to use <a href=\"https:\/\/www.terraform.io\/\">Terraform<\/a> and <a href=\"https:\/\/jenkins.io\/\">Jenkins<\/a>.<\/p>\n<p>Terraform allows us to provision, deprovision, and orchestrate immutable infrastructure in a declarative manner; meanwhile, Jenkins pipelines offers delivery process rather than an \u201copinionated\u201d\u00a0process and allows us to analyze and optimize the process.<\/p>\n<p>Jenkins offered two big advantages:<\/p>\n<ol>\n<li>The pipelines can be triggered via the HTTP API on demand.<\/li>\n<li>Jenkins in Kubernetes is very easy. We used the Jenkins Helm chart:\n<pre class=\"lang:sh decode:true \">$ helm install --name jenkins stable\/jenkins<\/pre>\n<\/li>\n<\/ol>\n<h2>Deploying and Destroying resources<\/h2>\n<p>Creating on-demand resource required us to create detailed plans and specification of our resources. These plans are stored in a git repository such as GitHub or BitBucket to maintain versioning and continuity.<\/p>\n<p>In our case, the plans included a Virtual Machines Scale Sets (VMSS) with GPU and an installation script of Cognata&#8217;s application.\u00a0 The VMSS was created from a predefined image that included all the drivers and prerequisites of the application.<\/p>\n<p>Once we have the <a href=\"https:\/\/github.com\/Azure\/terraform-with-jenkins-samples\/tree\/master\/terraform-plans\">plans<\/a> in the git repository, we need to run the workflow in the Jenkins pipeline to deploy the resources.<\/p>\n<p>Starting the workflow can be triggered from an HTTP request to the Jenkins pipeline.<\/p>\n<pre class=\"lang:sh decode:true\">http:\/\/JENKINS_SERVER_ADDRESS\/job\/YOUR_JOB_NAME\/buildWithParam<\/pre>\n<p>After the pipeline was triggered, the plans were pulled from the git repository and the deploying process is started using Terraform.<\/p>\n<p>Terraform works in 3 stages:<\/p>\n<ol>\n<li>Init &#8211;\u00a0The\u00a0<code>terraform init<\/code>\u00a0command is used to initialize a working directory containing Terraform configuration files.<\/li>\n<li>Plan &#8211;\u00a0The\u00a0<code>terraform plan<\/code>\u00a0command is used to create an execution plan.<\/li>\n<li>Apply &#8211;\u00a0The\u00a0<code>terraform apply<\/code>\u00a0command is used to apply the changes required to reach the desired state of the configuration, or the pre-determined set of actions generated by a\u00a0<code>terraform plan<\/code>\u00a0execution plan.<\/li>\n<\/ol>\n<p><span class=\"annotation\" style=\"background-color: #f0e465;\" data-annotation=\"Check wording\" data-author=\"Peter Cornell Andringa\">Once the plan was executed and Terraform saved the state results in files, w<\/span>e needed to upload the result files to a persistent storage such as <a href=\"https:\/\/azure.microsoft.com\/en-us\/services\/storage\/blobs\/\">Azure Blob storage<\/a>\u00a0. This step would later enable us to download the state files and destroy the resources that we created after the business process was completed, and to actually\u00a0<span class=\"annotation\" style=\"background-color: #f0e465;\" data-annotation=\"wording\" data-author=\"Peter Cornell Andringa\">create on-demand clusters.<\/span><\/p>\n<p>The flow of the solution is described in the following image:<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2020\/03\/5a49da242c073_34257878-57d1f228-e664-11e7-98bd-4a1e63b3860c.png\" \/><\/p>\n<h3>Terraform client<\/h3>\n<p>To invoke Terraform commands in the Jenkins pipeline, we created a small Docker container with Terraform and Azure CLI with the following Dockerfile.<\/p>\n<pre class=\"lang:default decode:true\">FROM azuresdk\/azure-cli-python:hotfix-2.0.41\r\n\r\nARG tf_version=\"0.11.7\"\r\n\r\nRUN apk update &amp;&amp; apk upgrade &amp;&amp; apk add ca-certificates &amp;&amp; update-ca-certificates &amp;&amp; \r\n    apk add --no-cache --update curl unzip\r\n\r\nRUN curl https:\/\/releases.hashicorp.com\/terraform\/${tf_version}\/terraform_${tf_version}_linux_amd64.zip -o terraform_${tf_version}_linux_amd64.zip &amp;&amp; \r\n    unzip terraform_${tf_version}_linux_amd64.zip -d \/usr\/local\/bin &amp;&amp; \r\n    mkdir -p \/opt\/workspace &amp;&amp; \r\n    rm \/var\/cache\/apk\/*\r\n\r\nWORKDIR \/opt\/workspace\r\nENV TF_IN_AUTOMATION somevalue<\/pre>\n<h2><b>Conclusion and Reuse<\/b><\/h2>\n<p>With the adoption of driverless cars, 10 billion-plus hours of recorded driving time just isn&#8217;t viable. Cognata&#8217;s complex simulations rendering for multiple autonomous car manufacturers can become costly and inefficient. Our Jenkins pipeline and Terraform solution enabled Cognata to dynamically scale GPU resources for their simulations, making it easier to serve their customers while saving significant cost in compute resources. Using these technologies we were able to automate the deployment and maintenance of Cognata&#8217;s Azure VMSS GPU clusters and simulation logic.<\/p>\n<p>Our joint solution is adaptable to any workload that requires:<\/p>\n<ul>\n<li>Provisioning Azure resources using Terraform.<\/li>\n<li>Using the Jenkins DevOps process to provision on-demand resources in Azure and in Kubernetes.<\/li>\n<\/ul>\n<h2>Resources<\/h2>\n<ul>\n<li><a href=\"https:\/\/github.com\/Azure\/terraform-with-jenkins-samples\">Terraform plans and the Jenkins pipeline<\/a><\/li>\n<\/ul>\n<p>Cover photo by <a href=\"https:\/\/unsplash.com\/@andresalagon\">Andres Alagon<\/a> on <a href=\"https:\/\/unsplash.com\/photos\/2rkj-I-wbS4\">Unsplash<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Developing robust algorithms for self-driving cars requires sourcing event data from over 10 billion hours of recorded driving time. CSE worked with Cognata, a startup developing simulation platforms for autonomous vehicles, to build a Jenkins pipeline and Terraform solution that enabled our partner to dynamically scale GPU resources for their simulations.<\/p>\n","protected":false},"author":21405,"featured_media":13040,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[15,16],"tags":[177,188,221,229,351,379],"class_list":["post-6667","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-containers","category-devops","tag-featured","tag-gpu","tag-jenkins-ci","tag-kubernetes","tag-terraform","tag-vmss"],"acf":[],"blog_post_summary":"<p>Developing robust algorithms for self-driving cars requires sourcing event data from over 10 billion hours of recorded driving time. CSE worked with Cognata, a startup developing simulation platforms for autonomous vehicles, to build a Jenkins pipeline and Terraform solution that enabled our partner to dynamically scale GPU resources for their simulations.<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/posts\/6667","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/users\/21405"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/comments?post=6667"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/posts\/6667\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/media\/13040"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/media?parent=6667"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/categories?post=6667"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/tags?post=6667"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}