{"id":5432,"date":"2017-11-21T08:31:13","date_gmt":"2017-11-21T16:31:13","guid":{"rendered":"\/developerblog\/?p=5432"},"modified":"2020-03-14T18:25:41","modified_gmt":"2020-03-15T01:25:41","slug":"autoscaling-deep-learning-training-kubernetes","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/ise\/autoscaling-deep-learning-training-kubernetes\/","title":{"rendered":"Autoscaling Deep Learning Training with Kubernetes"},"content":{"rendered":"<h2>Background<\/h2>\n<p>We recently partnered with <a href=\"http:\/\/www.litbit.com\/\">Litbit<\/a>, a San Jose-based startup, on a project to autoscale deep learning training. Litbit enables its customers to turn their &#8220;Internet of Things\u201d into conscious personas that can learn, think, and do helpful things. In order to accomplish this goal, customers train their AI-empowered personas using sight, sound, and touch sensors (among others) to recognize specific situations.<\/p>\n<p>Since different customers may be training different AI personas at different times, the training load tends to be bursty and unpredictable. Some of these training jobs (e.g., Spark ML) make heavy use of CPUs, while others (e.g., TensorFlow) make heavy use of GPUs. In the latter case, some jobs retrain a single layer of the neural net and finish very quickly, while others need to train an entire new neural net\u00a0and can take several hours to days.<\/p>\n<p>To meet the diverse requirements for training in a cost-efficient manner, Litbit needed a system that could scale different types of VM pools (CPU only, light GPU, heavy GPU) up and down based on demand. In this code story, we have generalized lessons learned from this scenario and will explain how to use the <em>acs-engine-autoscaler<\/em> to scale different types of VMs up and down based on demand.<\/p>\n<h2>Overview<\/h2>\n<p>While there are many options for running containerized distributed deep learning at scale, we have selected Kubernetes due to its superior cluster management technology and the huge developer community. To start, we need to create a Kubernetes cluster with GPU support on Azure to run different types of machine learning loads. Then we need to add autoscaling capability to the Kubernetes cluster to meet bursty demands in a cost-efficient manner.<\/p>\n<h2>Creating a Kubernetes cluster with GPU support using ACS-engine<\/h2>\n<p>To create a Kubernetes cluster that supports GPUs, we will use <em>acs-engine<\/em>, an open source tool that will generate the ARM template we need to deploy our cluster with everything already configured.<\/p>\n<p>NOTE: You might be wondering why we are using\u00a0<em>acs-engine<\/em> and not\u00a0<a href=\"https:\/\/docs.microsoft.com\/en-us\/azure\/aks\/intro-kubernetes\">AKS<\/a> (Azure Container Service, the managed Kubernetes service on Azure). To use the <em>acs-engine-autoscaler<\/em>, the Kubernetes cluster must be created by\u00a0<em>acs-engine<\/em> as the autoscaler requires metadata information about the agent pools to scale the nodes up and down, which is information only exposed by\u00a0<em>acs-engine.\u00a0<\/em>Therefore, the <em>acs-engine-autoscaler<\/em> does not work with AKS.<\/p>\n<h3>Install<\/h3>\n<p>Binary downloads for the <a href=\"https:\/\/github.com\/Azure\/acs-engine\/releases\/latest\">latest version of <em>acs-engine<\/em><\/a>\u00a0are available. Download <em>acs-engine<\/em> for your operating system. Extract the binary and copy it to your <em>$PATH<\/em>.<\/p>\n<h3>Generate Templates<\/h3>\n<p><em>acs-engine<\/em> reads a JSON cluster definition which describes the size, shape, and configuration of your cluster.<\/p>\n<p>First, update <a href=\"https:\/\/github.com\/Azure\/acs-engine\/blob\/master\/examples\/kubernetes.json\">example\\kubernetes.json<\/a> to create a cluster that satisfies our requirements. To create different types of VM pools (CPU only, light GPU, heavy GPU), we can create different agent pools by adding additional sections under <em>agentPoolProfiles<\/em>. Each pool can have a different VM size and can scale up to 100 nodes (as set by the <em>MaxAgentCount<\/em>\u00a0constant in acs-engine). For our scenario, we want a cluster for training with GPUs and inference with CPUs only. Here we are defining two pools as we don&#8217;t want to pay for GPU unless needed. The number of agents isn\u2019t really important because we are going to enable autoscaling later, so we will keep everything as 1.<\/p>\n<p>At the time of this writing, Azure has <a href=\"https:\/\/azure.microsoft.com\/en-us\/blog\/azure-n-series-preview-availability\">6 different VM sizes with GPU support<\/a>.\u00a0For more details with generating templates, refer to the\u00a0<a href=\"https:\/\/github.com\/Azure\/acs-engine\/blob\/master\/docs\/kubernetes\/deploy.md#acs-engine-the-long-way\">ACS engine guide<\/a>.<\/p>\n<p>Here is an example of our Kubernetes cluster definition with multiple agent pools:<\/p>\n<pre class=\"lang:js decode:true\">{\r\n  \"apiVersion\": \"vlabs\",\r\n  \"properties\": {\r\n    \"orchestratorProfile\": {\r\n      \"orchestratorType\": \"Kubernetes\"\r\n    },\r\n    \"masterProfile\": {\r\n      \"count\": 1,\r\n      \"dnsPrefix\": \"mymlcluster\",\r\n      \"vmSize\": \"Standard_D2_v2\"\r\n    },\r\n    \"agentPoolProfiles\": [\r\n      {\r\n        \"name\": \"agentpool1\",\r\n        \"count\": 1,\r\n        \"vmSize\": \"Standard_D2_v2\",\r\n        \"availabilityProfile\": \"AvailabilitySet\"\r\n      },     \r\n      {\r\n        \"name\": \"agentpool2\",\r\n        \"count\": 1,\r\n        \"vmSize\": \"Standard_NC6\",\r\n        \"availabilityProfile\": \"AvailabilitySet\"\r\n      }\r\n    ],\r\n    \"linuxProfile\": {\r\n      \"adminUsername\": \"azureuser\",\r\n      \"ssh\": {\r\n        \"publicKeys\": [\r\n          {\r\n            \"keyData\": \"SSH-PUB-KEY\"\r\n          }\r\n        ]\r\n      }\r\n    },\r\n    \"servicePrincipalProfile\": {\r\n      \"clientId\": \"xxxxx\",\r\n      \"secret\": \"xxxxx\"\r\n    }\r\n  }\r\n}<\/pre>\n<p>Now with the cluster definition JSON, let&#8217;s generate the templates by running:<\/p>\n<pre class=\"lang:sh decode:true \">$ acs-engine generate examples\/kubernetes.json<\/pre>\n<p>This step generates a bunch of files under the _output\/mymlcluster directory, including the ARM template and parameters that we want.<\/p>\n<h3>Deploy Templates<\/h3>\n<p>With the new <em>azuredeploy.json<\/em> and <em>azuredeploy.parameters.json<\/em> generated in the previous step, we can now deploy the templates using the <a href=\"https:\/\/docs.microsoft.com\/en-us\/cli\/azure\/install-azure-cli?view=azure-cli-latest\">Azure CLI<\/a>.<\/p>\n<p><strong>Note: <\/strong>make sure you choose a <a href=\"https:\/\/azure.microsoft.com\/en-us\/regions\/services\/\">region that has N-series VM available<\/a>. For example, eastus and southcentralus are two regions with N-series skus available. Also, make sure your subscription has enough cores to run those VM types.<\/p>\n<pre class=\"lang:sh decode:true\">$ cd _output\/mymlcluster\r\n$ az login\r\n$ az account set --subscription \"&lt;SUBSCRIPTION NAME OR ID&gt;\"\r\n$ az group create \\\r\n    --name \"&lt;RESOURCE_GROUP_NAME&gt;\" \\\r\n    --location \"&lt;LOCATION&gt;\"\r\n\r\n$ az group deployment create \\\r\n    --resource-group \"&lt;RESOURCE_GROUP_NAME&gt;\" \\\r\n    --template-file azuredeploy.json \\\r\n    --parameters azuredeploy.parameters.json<\/pre>\n<p>This step will take between 5 to 10 minutes to deploy. We will keep the generated <em>azurdeploy.json<\/em> and <em>azuredeploy.parameters.json<\/em> around as we will need them later to setup autoscaling.<\/p>\n<p>Once the deployment is completed, copy the Kubernetes config file of the cluster locally to allow kubectl to communicate with the cluster. If you do not already have the kubectl cli, follow these <a href=\"https:\/\/kubernetes.io\/docs\/tasks\/tools\/install-kubectl\/\">instructions<\/a> to install kubectl.<\/p>\n<pre class=\"lang:sh decode:true\">$ scp azureuser@&lt;dnsname&gt;.&lt;regionname&gt;.cloudapp.azure.com:.kube\/config ~\/.kube\/config<\/pre>\n<h3>\u00a0Verifying the Cluster<\/h3>\n<p>To ensure everything is working as intended, run:<\/p>\n<pre class=\"lang:sh decode:true \">$ kubectl describe node &lt;name-of-a-gpu-node&gt;<\/pre>\n<p>You should see the correct number of GPUs reported (in this example, it shows 1 GPU for a NC6 VM):<\/p>\n<pre class=\"lang:sh decode:true\">...\r\nCapacity:\r\n alpha.kubernetes.io\/nvidia-gpu:    1\r\n cpu:                   6\r\n...<\/pre>\n<p>If alpha.kubernetes.io\/nvidia-gpu is shown as 0, wait a bit longer. The driver installation takes about 12 minutes, and the node might join the cluster before the installation is completed. After a few minutes, the node should restart, and report the correct number of GPUs.<\/p>\n<h3>Scheduling a GPU Container<\/h3>\n<p>Now that we have a GPU-enabled Kubernetes cluster, we can run a container that requires GPU resources. Below is an example GPU container running TensorFlow. To request GPU resources, we have to specify how many GPU the container needs, then Kubernetes will map the device into the container. To use the drivers, we need to mount the driver from the Kubernetes agent host into the container.<\/p>\n<p><strong>Note:<\/strong> the drivers are installed under <em>\/usr\/lib\/nvidia-384\u00a0<\/em>(or another version number depending on the driver&#8217;s version).<\/p>\n<pre class=\"lang:yaml decode:true\">apiVersion: extensions\/v1beta1\r\nkind: Deployment\r\nmetadata:\r\n  labels:\r\n    app: tensorflow\r\n  name: tensorflow\r\nspec:\r\n  template:\r\n    metadata:\r\n      labels:\r\n        app: tensorflow\r\n    spec:     \r\n      containers:\r\n      - name: tensorflow\r\n        image: tensorflow\/tensorflow:latest-gpu\r\n        command: [\"python main.py\"]      \r\n        imagePullPolicy: IfNotPresent\r\n        env:\r\n        - name: LD_LIBRARY_PATH\r\n          value: \/usr\/lib\/nvidia:\/usr\/lib\/x86_64-linux-gnu\r\n        resources:\r\n          requests:\r\n            alpha.kubernetes.io\/nvidia-gpu: 1 \r\n        volumeMounts:\r\n        - mountPath: \/usr\/local\/nvidia\/bin\r\n          name: bin\r\n        - mountPath: \/usr\/lib\/nvidia\r\n          name: lib\r\n        - mountPath: \/usr\/lib\/x86_64-linux-gnu\/libcuda.so.1\r\n          name: libcuda\r\n      volumes:\r\n        - name: bin\r\n          hostPath: \r\n            path: \/usr\/lib\/nvidia-384\/bin\r\n        - name: lib\r\n          hostPath: \r\n            path: \/usr\/lib\/nvidia-384\r\n        - name: libcuda\r\n          hostPath:\r\n            path: \/usr\/lib\/x86_64-linux-gnu\/libcuda.so.1<\/pre>\n<p><strong>Note:<\/strong> we have specified <em>alpha.kubernetes.io\/nvidia-gpu: 1<\/em> for the resources requests, and mounted the drivers from the host into the container. We also modified the <em>LD_LIBRARY_PATH<\/em> environment variable to let Python know where to find the driver&#8217;s libraries.<\/p>\n<p>Some libraries, such as <em>libcuda.so<\/em>, are installed under <em>\/usr\/lib\/x86_64-linux-gnu<\/em> on the host. Depending on your requirements, you might need to mount them separately as shown above.<\/p>\n<p>Schedule the deployment with the following command:<\/p>\n<pre class=\"lang:sh decode:true\">$ kubectl create -f tftrain.yaml<\/pre>\n<h2>Autoscaling Kubernetes Cluster to Meet Bursty Demands<\/h2>\n<p>Now that we have a Kubernetes cluster that can run CPU workloads and GPU workloads, we need to be able to scale the VM pools up and down based on demands. \u00a0<a href=\"https:\/\/github.com\/wbuchwalter\/kubernetes-acs-autoscaler\">Kubernetes-acs-engine-autoscaler<\/a>, a fork of OpenAI&#8217;s Kubernetes-ec2-autoscaler, can autoscale an acs-engine Kubernentes cluster based on demand.<\/p>\n<p>The <em>Kubernetes<\/em>&#8211;<em>acs-engine-autoscaler<\/em> will run inside the cluster and monitor the different pods that get scheduled. Whenever a pod is pending because of a lack of resources, the autoscaler will create an adequate number of new VMs to support the scheduled pod. Finally, when VMs are idle, the autoscaler will delete them. As a result, we can achieve the flexibility we want, while still keeping costs down.<\/p>\n<h3>Setting up the Autoscaler<\/h3>\n<p>The\u00a0<em>acs-engine-autoscaler<\/em> can be installed with a Helm chart. Helm is a Kubernetes package manager that helps us package, install, and manage our Kubernetes applications. Using the <a href=\"https:\/\/github.com\/kubernetes\/charts\/tree\/master\/stable\/acs-engine-autoscaler\">stable\/acs-engine-autoscaler<\/a> Helm chart, we can install the autoscaler in our cluster.<\/p>\n<p>First, locate your <em>azuredeploy.parameters.json<\/em>\u00a0file generated with <em>acs-engine<\/em>\u00a0from the previous step.<\/p>\n<p>Next, find the\u00a0<a href=\"https:\/\/github.com\/kubernetes\/charts\/blob\/master\/stable\/acs-engine-autoscaler\/values.yaml\">values.yaml<\/a> file from the acs-engine-autoscaler Helm chart. Update the following parameters in the file.<\/p>\n<pre class=\"lang:yaml decode:true \">acsenginecluster:\r\n  resourcegroup:\r\n  azurespappid:\r\n  azurespsecret:\r\n  azuresptenantid:\r\n  kubeconfigprivatekey:\r\n  clientprivatekey:\r\n  caprivatekey:<\/pre>\n<table style=\"height: 315px\" width=\"1330\">\n<tbody>\n<tr>\n<td>Parameter<\/td>\n<td>Description<\/td>\n<\/tr>\n<tr>\n<td>resourcegroup<\/td>\n<td>Name of the resource group containing the cluster<\/td>\n<\/tr>\n<tr>\n<td>azurespappid<\/td>\n<td>An Azure service principal ID<\/td>\n<\/tr>\n<tr>\n<td>azurespsecret<\/td>\n<td>An Azure service principal secret<\/td>\n<\/tr>\n<tr>\n<td>azuresptenantid<\/td>\n<td>An Azure service principal tenant ID<\/td>\n<\/tr>\n<tr>\n<td>kubeconfigprivatekey<\/td>\n<td>The key passed to the <em>kubeConfigPrivateKey<\/em> parameter in your <em>azuredeploy.parameters.json<\/em> generated with <em>acs-engine<\/em><\/td>\n<\/tr>\n<tr>\n<td>clientprivatekey<\/td>\n<td>The key passed to the <em>clientPrivateKey<\/em>\u00a0parameter in your <em>azuredeploy.parameters.json<\/em> generated with <em>acs-engine<\/em><\/td>\n<\/tr>\n<tr>\n<td>caprivatekey<\/td>\n<td>\u00a0The key passed to the <em>caPrivateKey<\/em>\u00a0parameter in your <em>azuredeploy.parameters.json<\/em> generated with <em>acs-engine<\/em><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Finally, after you have updated the values.yaml file in the chart, run the following to install the chart with the release name my-release:<\/p>\n<pre class=\"lang:sh decode:true\">$ helm install stable\/acs-engine-autoscaler<\/pre>\n<h3>\u00a0Verifying Installation<\/h3>\n<p>To verify the <em>acs-engine-autoscaler<\/em> is configured properly, find the pod that the deployment created and look at its logs. The result will look something similar to the following:<\/p>\n<pre class=\"lang:sh decode:true\">To verify that acs-engine-autoscaler has started, run:\r\n\r\n  kubectl --namespace=default get pods -l \"app=olfactory-bunny-acs-engine-autoscaler\"\r\n\r\nTo verify that acs-engine-autoscaler is running as expected, run:\r\n  kubectl logs $(kubectl --namespace=default get pods -l \"app=olfactory-bunny-acs-engine-autoscaler\" -o jsonpath=\"{.items[0].metadata.name}\")\r\n\r\n$ kubectl --namespace=default get pods -l \"app=olfactory-bunny-acs-engine-autoscaler\"\r\n\r\nNAME                                                     READY     STATUS    RESTARTS   AGE\r\nolfactory-bunny-acs-engine-autoscaler-1715934483-c673v   1\/1       Running   0          10s\r\n\r\n$ kubectl logs $(kubectl --namespace=default get pods -l \"app=olfactory-bunny-acs-engine-autoscaler\" -o jsonpath=\"{.items[0].metadata.name}\")<\/pre>\n<p>You should see something like the following in the logs of the autoscaler pod.<\/p>\n<pre class=\"lang:sh decode:true\">2017-06-11 23:20:59,352 - autoscaler.cluster - DEBUG - Using kube service account\r\n2017-06-11 23:20:59,352 - autoscaler.cluster - INFO - ++++ Running Scaling Loop ++++++\r\n2017-06-11 23:20:59,421 - autoscaler.cluster - INFO - Pods to schedule: 0\r\n2017-06-11 23:20:59,421 - autoscaler.cluster - INFO - ++++ Scaling Up Begins ++++++\r\n2017-06-11 23:20:59,421 - autoscaler.cluster - INFO - Nodes: 1\r\n2017-06-11 23:20:59,421 - autoscaler.cluster - INFO - To schedule: 0\r\n2017-06-11 23:20:59,421 - autoscaler.cluster - INFO - Pending pods: 0\r\n2017-06-11 23:20:59,422 - autoscaler.cluster - INFO - ++++ Scaling Up Ends ++++++\r\n2017-06-11 23:20:59,422 - autoscaler.cluster - INFO - ++++ Maintenance Begins ++++++\r\n2017-06-11 23:20:59,422 - autoscaler.engine_scaler - INFO - ++++ Maintaining Nodes ++++++\r\n2017-06-11 23:20:59,423 - autoscaler.engine_scaler - INFO - node: k8s-agentpool1-29744472-4                                                   state: under-utilized-undrainable\r\n2017-06-11 23:20:59,423 - autoscaler.cluster - INFO - ++++ Maintenance Ends ++++++\r\n...<\/pre>\n<h2>Autoscaling the Cluster<\/h2>\n<p>Recall from the previous section how our Kubernetes cluster has 2 agent pools, each with a single agent. The job we ran in the previous section requires 1 GPU and only agentpool2 has a VM with GPU. Now to test our autoscaler, let&#8217;s schedule a second job with GPU request that is similar to the tftrain.yaml deployment we ran earlier in our cluster.<\/p>\n<p>From the autoscaler&#8217;s pod&#8217;s log, we should see agentpool2 scaling up to meet the demand:<\/p>\n<pre class=\"lang:sh decode:true \">autoscaler.cluster - INFO - ++++++ Running Scaling Loop ++++++\r\nautoscaler.cluster - INFO - Pods to schedule: 1\r\nautoscaler.cluster - INFO - ++++++ Scaling Up Begins ++++++\r\nautoscaler.cluster - INFO - Nodes: 2\r\nautoscaler.cluster - INFO - To schedule: 1\r\nautoscaler.cluster - INFO - Pending pods: 1\r\nautoscaler.cluster - INFO - ========= Scaling for 1 pods ========\r\n[...]\r\nautoscaler.cluster - INFO - New capacity requested for pool agentpool2: 2 agents (current capacity: 1 agents)\r\nautoscaler.deployments - INFO - Deployment started<\/pre>\n<p id=\"8f11\" class=\"graf graf--p graf-after--pre\">After a few minutes, the new VM with GPU will be created, and our second job starts running. Once the jobs are completed, the pods are terminated.\u00a0The autoscaler will notice one or more nodes are now idle and will adjust the cluster size accordingly.<\/p>\n<p class=\"graf graf--p graf-after--pre\">First, idle VMs will be cordoned and drained:<\/p>\n<pre class=\"lang:sh decode:true\">autoscaler.cluster - INFO - node: k8s-agentpool1-32238962-1                                                   \r\nstate: under-utilized-drainable\r\nautoscaler.kube - INFO - cordoned k8s-agentpool1-32238962-1\r\nautoscaler.kube - INFO - Deleting Pod kube-system\/kube-proxy-ghr3z\r\nautoscaler.kube - INFO - drained k8s-agentpool1-32238962-4<\/pre>\n<p>Then after some time, the cordoned node will get deleted:<\/p>\n<pre class=\"lang:sh decode:true\">autoscaler.cluster - INFO - node: k8s-agentpool1-32238962-1                                                   \r\nstate: idle-unschedulable\r\nautoscaler.container_service - INFO - deleting node k8s-agentpool1-32238962-1\r\nautoscaler.container_service - INFO - Deleting VM\r\nautoscaler.container_service - INFO - Deleting NIC\r\nautoscaler.container_service - INFO - Deleting OS disk<\/pre>\n<p>Voil\u00e0! Now we have a Kubernetes cluster that can autoscale as new pods are scheduled and resources are requested.<\/p>\n<h2 id=\"685f\" class=\"graf graf--h4 graf-after--pre\">Horizontal Pod Autoscaling<\/h2>\n<p>In some scenarios, you might want to\u00a0scale up and down based on some metrics, for example, CPU or memory usage.\u00a0Kubernetes\u00a0<a class=\"markup--anchor markup--p-anchor\" href=\"https:\/\/kubernetes.io\/docs\/user-guide\/horizontal-pod-autoscaling\/\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">Horizontal Pod Autoscaling<\/a>\u00a0(HPA)\u00a0allows us to specify a metric and target to track on a deployment.<\/p>\n<p>For example, for a given\u00a0deployment, you might want to configure HPA to have a combined average CPU usage not exceeding 50%. Once the CPU usage of all running pods exceeds 50%, HPA will increase the number of replicas in the deployment and spread the load across the cluster. But eventually, the existing VMs in the cluster will not be able to support more replicas, and new pods created by HPA will start hanging in a &lt;pending&gt; state. This is where the <em>acs-engine-autoscaler<\/em> will notice the pending pods and start to create new VMs to support them, then delete the idle VMs once the jobs are completed.<\/p>\n<p id=\"aeb3\" class=\"graf graf--p graf-after--p\">To understand how to configure Horizontal Pod Autoscaling,\u00a0<a class=\"markup--anchor markup--p-anchor\" href=\"https:\/\/kubernetes.io\/docs\/tasks\/run-application\/horizontal-pod-autoscale\/\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">check out the official documentation<\/a>.<\/p>\n<h2>Conclusion<\/h2>\n<p>With this solution, we were able to help Litbit to scale up to 40 nodes at a time then subsequently downscale as planned. Litbit has been successfully using this for the past 4 months.\u00a0 This solution is ideal for use cases where you need to scale different types of VMs up and down based on demand. To test the Azure autoscaler for your own use case, check out <a href=\"https:\/\/github.com\/wbuchwalter\/Kubernetes-acs-engine-autoscaler\">this GitHub repo<\/a>.<\/p>\n<h2>Resources<\/h2>\n<ul>\n<li><a href=\"https:\/\/github.com\/wbuchwalter\/Kubernetes-acs-engine-autoscaler\">Kubernetes acs-engine-autoscaler GitHub repo<\/a><\/li>\n<li><a href=\"https:\/\/github.com\/kubernetes\/charts\/tree\/master\/stable\/acs-engine-autoscaler\">Official acs-engine-autoscaler Helm chart<\/a><\/li>\n<li><a href=\"https:\/\/github.com\/Azure\/acs-engine\/blob\/master\/docs\/kubernetes\/deploy.md#acs-engine-the-long-way\">ACS engine guide<\/a><\/li>\n<li><a class=\"markup--anchor markup--p-anchor\" href=\"https:\/\/kubernetes.io\/docs\/tasks\/run-application\/horizontal-pod-autoscale\/\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">Official Kubernetes Horizontal pod autoscale documentation<\/a><\/li>\n<li><a href=\"https:\/\/docs.microsoft.com\/en-us\/azure\/aks\/intro-kubernetes\">AKS &#8211; Azure Container Service: Managed Kubernetes Service<\/a><\/li>\n<\/ul>\n<hr \/>\n<p>Cover image by\u00a0<a href=\"https:\/\/unsplash.com\/photos\/vrbZVyX2k4I?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText\">Markus Spiske<\/a>\u00a0on\u00a0<a href=\"https:\/\/unsplash.com\/?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText\">Unsplash<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>We explore how we worked with a customer to add autoscaling capability to a Kubernetes cluster to meet bursty demands for deep learning training in a cost-efficient manner.<\/p>\n","protected":false},"author":21378,"featured_media":10854,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[15,19],"tags":[56,60,147,229,239,350],"class_list":["post-5432","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-containers","category-machine-learning","tag-autoscaling","tag-azure","tag-deep-learning","tag-kubernetes","tag-machine-learning-ml","tag-tensorflow"],"acf":[],"blog_post_summary":"<p>We explore how we worked with a customer to add autoscaling capability to a Kubernetes cluster to meet bursty demands for deep learning training in a cost-efficient manner.<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/posts\/5432","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/users\/21378"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/comments?post=5432"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/posts\/5432\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/media\/10854"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/media?parent=5432"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/categories?post=5432"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/tags?post=5432"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}