Autoscaling Deep Learning Training with Kubernetes

Rita Zhang


We recently partnered with Litbit, a San Jose-based startup, on a project to autoscale deep learning training. Litbit enables its customers to turn their “Internet of Things” into conscious personas that can learn, think, and do helpful things. In order to accomplish this goal, customers train their AI-empowered personas using sight, sound, and touch sensors (among others) to recognize specific situations.

Since different customers may be training different AI personas at different times, the training load tends to be bursty and unpredictable. Some of these training jobs (e.g., Spark ML) make heavy use of CPUs, while others (e.g., TensorFlow) make heavy use of GPUs. In the latter case, some jobs retrain a single layer of the neural net and finish very quickly, while others need to train an entire new neural net and can take several hours to days.

To meet the diverse requirements for training in a cost-efficient manner, Litbit needed a system that could scale different types of VM pools (CPU only, light GPU, heavy GPU) up and down based on demand. In this code story, we have generalized lessons learned from this scenario and will explain how to use the acs-engine-autoscaler to scale different types of VMs up and down based on demand.


While there are many options for running containerized distributed deep learning at scale, we have selected Kubernetes due to its superior cluster management technology and the huge developer community. To start, we need to create a Kubernetes cluster with GPU support on Azure to run different types of machine learning loads. Then we need to add autoscaling capability to the Kubernetes cluster to meet bursty demands in a cost-efficient manner.

Creating a Kubernetes cluster with GPU support using ACS-engine

To create a Kubernetes cluster that supports GPUs, we will use acs-engine, an open source tool that will generate the ARM template we need to deploy our cluster with everything already configured.

NOTE: You might be wondering why we are using acs-engine and not AKS (Azure Container Service, the managed Kubernetes service on Azure). To use the acs-engine-autoscaler, the Kubernetes cluster must be created by acs-engine as the autoscaler requires metadata information about the agent pools to scale the nodes up and down, which is information only exposed by acs-engine. Therefore, the acs-engine-autoscaler does not work with AKS.


Binary downloads for the latest version of acs-engine are available. Download acs-engine for your operating system. Extract the binary and copy it to your $PATH.

Generate Templates

acs-engine reads a JSON cluster definition which describes the size, shape, and configuration of your cluster.

First, update example\kubernetes.json to create a cluster that satisfies our requirements. To create different types of VM pools (CPU only, light GPU, heavy GPU), we can create different agent pools by adding additional sections under agentPoolProfiles. Each pool can have a different VM size and can scale up to 100 nodes (as set by the MaxAgentCount constant in acs-engine). For our scenario, we want a cluster for training with GPUs and inference with CPUs only. Here we are defining two pools as we don’t want to pay for GPU unless needed. The number of agents isn’t really important because we are going to enable autoscaling later, so we will keep everything as 1.

At the time of this writing, Azure has 6 different VM sizes with GPU support. For more details with generating templates, refer to the ACS engine guide.

Here is an example of our Kubernetes cluster definition with multiple agent pools:

  "apiVersion": "vlabs",
  "properties": {
    "orchestratorProfile": {
      "orchestratorType": "Kubernetes"
    "masterProfile": {
      "count": 1,
      "dnsPrefix": "mymlcluster",
      "vmSize": "Standard_D2_v2"
    "agentPoolProfiles": [
        "name": "agentpool1",
        "count": 1,
        "vmSize": "Standard_D2_v2",
        "availabilityProfile": "AvailabilitySet"
        "name": "agentpool2",
        "count": 1,
        "vmSize": "Standard_NC6",
        "availabilityProfile": "AvailabilitySet"
    "linuxProfile": {
      "adminUsername": "azureuser",
      "ssh": {
        "publicKeys": [
            "keyData": "SSH-PUB-KEY"
    "servicePrincipalProfile": {
      "clientId": "xxxxx",
      "secret": "xxxxx"

Now with the cluster definition JSON, let’s generate the templates by running:

$ acs-engine generate examples/kubernetes.json

This step generates a bunch of files under the _output/mymlcluster directory, including the ARM template and parameters that we want.

Deploy Templates

With the new azuredeploy.json and azuredeploy.parameters.json generated in the previous step, we can now deploy the templates using the Azure CLI.

Note: make sure you choose a region that has N-series VM available. For example, eastus and southcentralus are two regions with N-series skus available. Also, make sure your subscription has enough cores to run those VM types.

$ cd _output/mymlcluster
$ az login
$ az account set --subscription "<SUBSCRIPTION NAME OR ID>"
$ az group create \
    --name "<RESOURCE_GROUP_NAME>" \
    --location "<LOCATION>"

$ az group deployment create \
    --resource-group "<RESOURCE_GROUP_NAME>" \
    --template-file azuredeploy.json \
    --parameters azuredeploy.parameters.json

This step will take between 5 to 10 minutes to deploy. We will keep the generated azurdeploy.json and azuredeploy.parameters.json around as we will need them later to setup autoscaling.

Once the deployment is completed, copy the Kubernetes config file of the cluster locally to allow kubectl to communicate with the cluster. If you do not already have the kubectl cli, follow these instructions to install kubectl.

$ scp azureuser@<dnsname>.<regionname> ~/.kube/config

 Verifying the Cluster

To ensure everything is working as intended, run:

$ kubectl describe node <name-of-a-gpu-node>

You should see the correct number of GPUs reported (in this example, it shows 1 GPU for a NC6 VM):

Capacity:    1
 cpu:                   6

If is shown as 0, wait a bit longer. The driver installation takes about 12 minutes, and the node might join the cluster before the installation is completed. After a few minutes, the node should restart, and report the correct number of GPUs.

Scheduling a GPU Container

Now that we have a GPU-enabled Kubernetes cluster, we can run a container that requires GPU resources. Below is an example GPU container running TensorFlow. To request GPU resources, we have to specify how many GPU the container needs, then Kubernetes will map the device into the container. To use the drivers, we need to mount the driver from the Kubernetes agent host into the container.

Note: the drivers are installed under /usr/lib/nvidia-384 (or another version number depending on the driver’s version).

apiVersion: extensions/v1beta1
kind: Deployment
    app: tensorflow
  name: tensorflow
        app: tensorflow
      - name: tensorflow
        image: tensorflow/tensorflow:latest-gpu
        command: ["python"]      
        imagePullPolicy: IfNotPresent
        - name: LD_LIBRARY_PATH
          value: /usr/lib/nvidia:/usr/lib/x86_64-linux-gnu
        - mountPath: /usr/local/nvidia/bin
          name: bin
        - mountPath: /usr/lib/nvidia
          name: lib
        - mountPath: /usr/lib/x86_64-linux-gnu/
          name: libcuda
        - name: bin
            path: /usr/lib/nvidia-384/bin
        - name: lib
            path: /usr/lib/nvidia-384
        - name: libcuda
            path: /usr/lib/x86_64-linux-gnu/

Note: we have specified 1 for the resources requests, and mounted the drivers from the host into the container. We also modified the LD_LIBRARY_PATH environment variable to let Python know where to find the driver’s libraries.

Some libraries, such as, are installed under /usr/lib/x86_64-linux-gnu on the host. Depending on your requirements, you might need to mount them separately as shown above.

Schedule the deployment with the following command:

$ kubectl create -f tftrain.yaml

Autoscaling Kubernetes Cluster to Meet Bursty Demands

Now that we have a Kubernetes cluster that can run CPU workloads and GPU workloads, we need to be able to scale the VM pools up and down based on demands.  Kubernetes-acs-engine-autoscaler, a fork of OpenAI’s Kubernetes-ec2-autoscaler, can autoscale an acs-engine Kubernentes cluster based on demand.

The Kubernetesacs-engine-autoscaler will run inside the cluster and monitor the different pods that get scheduled. Whenever a pod is pending because of a lack of resources, the autoscaler will create an adequate number of new VMs to support the scheduled pod. Finally, when VMs are idle, the autoscaler will delete them. As a result, we can achieve the flexibility we want, while still keeping costs down.

Setting up the Autoscaler

The acs-engine-autoscaler can be installed with a Helm chart. Helm is a Kubernetes package manager that helps us package, install, and manage our Kubernetes applications. Using the stable/acs-engine-autoscaler Helm chart, we can install the autoscaler in our cluster.

First, locate your azuredeploy.parameters.json file generated with acs-engine from the previous step.

Next, find the values.yaml file from the acs-engine-autoscaler Helm chart. Update the following parameters in the file.

Parameter Description
resourcegroup Name of the resource group containing the cluster
azurespappid An Azure service principal ID
azurespsecret An Azure service principal secret
azuresptenantid An Azure service principal tenant ID
kubeconfigprivatekey The key passed to the kubeConfigPrivateKey parameter in your azuredeploy.parameters.json generated with acs-engine
clientprivatekey The key passed to the clientPrivateKey parameter in your azuredeploy.parameters.json generated with acs-engine
caprivatekey  The key passed to the caPrivateKey parameter in your azuredeploy.parameters.json generated with acs-engine

Finally, after you have updated the values.yaml file in the chart, run the following to install the chart with the release name my-release:

$ helm install stable/acs-engine-autoscaler

 Verifying Installation

To verify the acs-engine-autoscaler is configured properly, find the pod that the deployment created and look at its logs. The result will look something similar to the following:

To verify that acs-engine-autoscaler has started, run:

  kubectl --namespace=default get pods -l "app=olfactory-bunny-acs-engine-autoscaler"

To verify that acs-engine-autoscaler is running as expected, run:
  kubectl logs $(kubectl --namespace=default get pods -l "app=olfactory-bunny-acs-engine-autoscaler" -o jsonpath="{.items[0]}")

$ kubectl --namespace=default get pods -l "app=olfactory-bunny-acs-engine-autoscaler"

NAME                                                     READY     STATUS    RESTARTS   AGE
olfactory-bunny-acs-engine-autoscaler-1715934483-c673v   1/1       Running   0          10s

$ kubectl logs $(kubectl --namespace=default get pods -l "app=olfactory-bunny-acs-engine-autoscaler" -o jsonpath="{.items[0]}")

You should see something like the following in the logs of the autoscaler pod.

2017-06-11 23:20:59,352 - autoscaler.cluster - DEBUG - Using kube service account
2017-06-11 23:20:59,352 - autoscaler.cluster - INFO - ++++ Running Scaling Loop ++++++
2017-06-11 23:20:59,421 - autoscaler.cluster - INFO - Pods to schedule: 0
2017-06-11 23:20:59,421 - autoscaler.cluster - INFO - ++++ Scaling Up Begins ++++++
2017-06-11 23:20:59,421 - autoscaler.cluster - INFO - Nodes: 1
2017-06-11 23:20:59,421 - autoscaler.cluster - INFO - To schedule: 0
2017-06-11 23:20:59,421 - autoscaler.cluster - INFO - Pending pods: 0
2017-06-11 23:20:59,422 - autoscaler.cluster - INFO - ++++ Scaling Up Ends ++++++
2017-06-11 23:20:59,422 - autoscaler.cluster - INFO - ++++ Maintenance Begins ++++++
2017-06-11 23:20:59,422 - autoscaler.engine_scaler - INFO - ++++ Maintaining Nodes ++++++
2017-06-11 23:20:59,423 - autoscaler.engine_scaler - INFO - node: k8s-agentpool1-29744472-4                                                   state: under-utilized-undrainable
2017-06-11 23:20:59,423 - autoscaler.cluster - INFO - ++++ Maintenance Ends ++++++

Autoscaling the Cluster

Recall from the previous section how our Kubernetes cluster has 2 agent pools, each with a single agent. The job we ran in the previous section requires 1 GPU and only agentpool2 has a VM with GPU. Now to test our autoscaler, let’s schedule a second job with GPU request that is similar to the tftrain.yaml deployment we ran earlier in our cluster.

From the autoscaler’s pod’s log, we should see agentpool2 scaling up to meet the demand:

autoscaler.cluster - INFO - ++++++ Running Scaling Loop ++++++
autoscaler.cluster - INFO - Pods to schedule: 1
autoscaler.cluster - INFO - ++++++ Scaling Up Begins ++++++
autoscaler.cluster - INFO - Nodes: 2
autoscaler.cluster - INFO - To schedule: 1
autoscaler.cluster - INFO - Pending pods: 1
autoscaler.cluster - INFO - ========= Scaling for 1 pods ========
autoscaler.cluster - INFO - New capacity requested for pool agentpool2: 2 agents (current capacity: 1 agents)
autoscaler.deployments - INFO - Deployment started

After a few minutes, the new VM with GPU will be created, and our second job starts running. Once the jobs are completed, the pods are terminated. The autoscaler will notice one or more nodes are now idle and will adjust the cluster size accordingly.

First, idle VMs will be cordoned and drained:

autoscaler.cluster - INFO - node: k8s-agentpool1-32238962-1                                                   
state: under-utilized-drainable
autoscaler.kube - INFO - cordoned k8s-agentpool1-32238962-1
autoscaler.kube - INFO - Deleting Pod kube-system/kube-proxy-ghr3z
autoscaler.kube - INFO - drained k8s-agentpool1-32238962-4

Then after some time, the cordoned node will get deleted:

autoscaler.cluster - INFO - node: k8s-agentpool1-32238962-1                                                   
state: idle-unschedulable
autoscaler.container_service - INFO - deleting node k8s-agentpool1-32238962-1
autoscaler.container_service - INFO - Deleting VM
autoscaler.container_service - INFO - Deleting NIC
autoscaler.container_service - INFO - Deleting OS disk

Voilà! Now we have a Kubernetes cluster that can autoscale as new pods are scheduled and resources are requested.

Horizontal Pod Autoscaling

In some scenarios, you might want to scale up and down based on some metrics, for example, CPU or memory usage. Kubernetes Horizontal Pod Autoscaling (HPA) allows us to specify a metric and target to track on a deployment.

For example, for a given deployment, you might want to configure HPA to have a combined average CPU usage not exceeding 50%. Once the CPU usage of all running pods exceeds 50%, HPA will increase the number of replicas in the deployment and spread the load across the cluster. But eventually, the existing VMs in the cluster will not be able to support more replicas, and new pods created by HPA will start hanging in a <pending> state. This is where the acs-engine-autoscaler will notice the pending pods and start to create new VMs to support them, then delete the idle VMs once the jobs are completed.

To understand how to configure Horizontal Pod Autoscaling, check out the official documentation.


With this solution, we were able to help Litbit to scale up to 40 nodes at a time then subsequently downscale as planned. Litbit has been successfully using this for the past 4 months.  This solution is ideal for use cases where you need to scale different types of VMs up and down based on demand. To test the Azure autoscaler for your own use case, check out this GitHub repo.


Cover image by Markus Spiske on Unsplash


Comments are closed. Login to edit/delete your existing comments

Feedback usabilla icon