Common Pitfalls of using Self-Hosted Build Agents

what can possibly go wrong?

Moving from the build and release system’s hosted agents to self-hosted agents, regardless of the systems being GitHub Actions, Azure Pipelines, or any other, is often a daunting task that takes longer than it should.

Our team at Microsoft builds with our customers, leveraging cloud and edge technologies to engineer their next-generation solutions, helping them mature development processes and scale operations across the organization. As they grow, we often face similar issues that emerge across the teams we meet and generalize them to patterns anyone can use.

This post summarizes our team’s experience with the common pitfalls of moving to self-hosted agents and includes code samples mitigating them.

Why Self-Hosted Agents

Vendor-hosted agents are simple and splendid, with everything you need pre-installed and working out of the box, i.e., the source for pre-existing software and tools on managed agents in Azure Pipelines and GitHub Actions. Still, moving to hosting and maintaining your own set of agents often makes sense in terms of cost savings and added security.

Saving costs

The larger the organization, and the more mature it is with cloud processes, saving money on build agents is a significant factor. For instance, the costs of parallel jobs on self-hosted agents are as much as four times cheaper on Azure DevOps than using the Microsoft-hosted option (source), or even free, on GitHub Actions runners (source).

Virtual machines can be paid for in advance or discounted by your cloud/virtualization vendor, so it is easy to understand why IT and DevOps leaders in organizations that are used to maintaining shared infrastructure would opt for self-hosting.

Added Security

Another common factor is security and compliance. Limiting public network access to internal resources increases the security posture and reduces the organization’s attack surface. For instance, access to storage accounts, repositories of code, binary artifacts, containers or models, and other resources that are required for a build and release process are usually configured so that only specific networks can access them. Furthermore, limiting access and authorization to resources from specific networks and identities downsizes potential attacks and promotes zero-trust.

Self-hosted agents live within the corporate network and can be equipped with the required certificates, VPN tools, and identities to connect anywhere within the organization.

More information on choosing hosting plans can be found at Microsoft Learn.

What can possibly go wrong?

Configuring self-hosted agents is well documented, and sample labs exist that will help you set up your first agent, be it as Windows, Linux, Mac OSX, or Docker, and guide you on utilizing cloud offerings such as VM Scale sets or Kubernetes to support different scales. Likewise, changing the code to make your pipeline run on self-hosted agents is also trivial and requires a minimal change to implement. But still, there is a gap in making everything work well!

Teams often need a clearer idea of the environmental complexity of how agents work and how self-hosted agents are set up, causing them to deal with recurring issues that can take up to a few weeks to iron out completely.

The Workspace and Shared Storage

One of the most significant differences between self-hosted and vendor-hosted agents is how the agent workspace is set up. The workspace is a directory in which the source is downloaded and the pipeline processes run. While vendor-hosted agents provide a new and clean directory for every build and every job, the default for self-hosted agents is different and does not clean all directories by default. This behavior is configurable, and our team usually sets the workspace option to “clean: all” to resemble the setup on cloud-hosted agents.

However, even when setting the agent runtime to clean the workspace, in some setups, that might still need to be more, as some tools use global, or root, path references to maintain persistence in folders that are not cleaned by the agent runtime. Conda, for instance, maintains its virtual environments directories in a folder outside the workspace scope, so they remain undeleted unless explicitly cleaned. To mitigate this, add a pipeline step configured to run always (even if previous steps fail) that destroys and deletes the environment that was created during the pipeline execution.

When running jobs or pipelines in parallel, additional issues may arise even when dealing properly with deleting the persistence between jobs. As the agents use a shared folder at a global/root level, naming collisions may happen resulting in processes which can invalidate or delete a parallel process’ persistence folder (i.e., Conda environment) while it is still running and dependent on it. You should expect these issues when using Docker-based agents with mounted shared drives, VM agents with multiple runtime instances, or VM agents with a shared drive.

Pattern for Uniqueness in Shared resources

To mitigate the issues described in the previous section, we typically implement a uniqueness pattern when using shared resources in agent jobs. This practice ensures that any entities created in the shared resource (i.e., storage) have a unique name whenever they are referenced, using the Build ID. Taking the Conda issues described before as an example, we would provide a unique name while creating, operating, and deleting every Conda environment in our pipeline.

Conda Uniqueness Pattern

The following sample shows two YAML files — a template, and a pipeline that uses the template to create, use and delete a uniquely named Conda environment for each build.

# run-tests-in-conda-environment-template.yml
parameters:
- name: CONDA_ENV_NAME
  default: ''
- name: CONDA_ENV_FILE
  default: ''
- name: TESTS_TO_RUN
  type: stepList
  default: []

steps:
- bash: |
    set -eux

    echo "##vso[task.prependpath]/tools/anaconda3/bin"
    conda env create --force -n $CONDA_ENV_NAME --file $CONDA_ENV_FILE
    conda init bash
    source activate $CONDA_ENV_NAME
    conda deactivate
  displayName: Create Anaconda environment
  env:
    CONDA_ENV_NAME: ${{ parameters.CONDA_ENV_NAME }}
    CONDA_ENV_FILE: ${{ parameters.CONDA_ENV_FILE }}

- ${{ parameters.TESTS_TO_RUN }}

- bash: |
    set -eux

    conda env remove --name $CONDA_ENV_NAME
  condition: always()
  displayName: "Delete conda environment"  
  env:
    CONDA_ENV_NAME: ${{ parameters.CONDA_ENV_NAME }}

# anaconda-test-pipeline.yml
trigger:
  - main

variables:
  - name: CONDA_ENV_NAME
    value: test-env-$(Build.BuildNumber)

jobs:          

- job: Test
  displayName: 'Test Application'
  variables:
    python.version: "3.7"
  steps:
  - template: run-tests-in-conda-environment-template.yml
    parameters:
      CONDA_ENV_NAME: ${{ variables.CONDA_ENV_NAME }}
      CONDA_ENV_FILE: 'app/environment.yml'
      TESTS_TO_RUN: 
      - bash: |
          set -eux
          source activate $CONDA_ENV_NAME
          python -m pytest  app/ --junitxml=junit/results.xml
          conda deactivate
        displayName: Run Tests
        env:
          CONDA_ENV_NAME: ${{ variables.CONDA_ENV_NAME }}

run-tests-in-conda-environment-template.yml – A template that uses YAML insertion to enrich the StepList “STEPS_TO_RUN” with pre and post-tasks that create and delete the Conda environment.
anaconda-test-pipeline.yml – A pipeline that determines a unique CONDA_ENV_NAME variable by using the predefined, unique $(Build.BuildNumber) environment variable and provides that value to the template which creates, uses and deletes the Conda environment.

Shared Docker Daemon

Using Docker to test the application builds and runs appropriately is quite common and expected during build and test and works seamlessly on vendor-hosted agents. However, when using Docker-based agents, which themselves use Docker (Docker from Docker), the experience may sometimes be unexpected since the agents share the Docker daemon. Having a shared Docker daemon is very similar to sharing the Conda root folder, since it means that creating a container (or any Docker entity) named “foo” in one build agent is evident in all other agents that use the same Docker daemon. Deleting a container named “foo” also affects all other agents.

The result is that any Docker entities (container, networks, volumes, ports, etc.) that are started and stopped in one build agent may be affected by the Docker processes in other agents, parallel or serial/consecutive.

Docker Uniqueness Pattern

The following sample shows how to implement the uniqueness pattern, which was previously detailed, using a Docker compose file and a YAML template, which uses compose for testing by creating uniquely named container sets for each build.

# docker-compose.yml
version: '2'

volumes:
  app-drive:
    name: ${APP_VOLUME_NAME}

services:
  app:
    container_name: ${APP_CONTAINER_NAME}
    build: .
    restart: always
    ports:
      - "80:80"
    networks:
      app_network:
    environment:
      - DATABASE_URL=mysql+pymysql://user:pass@${DB_CONTAINER_NAME}/appdb
      - REDIS_URL=redis://${CACHE_CONTAINER_NAME}:6379

  db:
    image: mariadb:10.4.12
    container_name: ${DB_CONTAINER_NAME}
    restart: always
    volumes:
      - app-drive:/data
    networks:
      app_network:

  cache:
    image: redis:4
    container_name: ${CACHE_CONTAINER_NAME}
    restart: always
    networks:
      app_network:

networks:
  app_network:
    name: ${NETWORK_NAME}

# test-pipeline.yml
trigger:
  - main

variables:
  - name: APP_DRIVE_NAME
    value: shared-drive-$(Build.BuildId)
  - name: DOCKER_NETWORK_NAME
    value: network-$(Build.BuildId)
  - name: APP_CONTAINER_NAME
    value: app-$(Build.BuildId)
  - name: DB_CONTAINER_NAME
    value: db-$(Build.BuildId)
  - name: CACHE_CONTAINER_NAME
    value: cache-$(Build.BuildId)

steps:
  - script: |
      set -eux  # fail on error

      docker-compose up --build -d
    displayName: "Run Application"
    env:
      APP_DRIVE_NAME: ${{ variables.APP_DRIVE_NAME }}
      DOCKER_NETWORK_NAME: ${{ variables.DOCKER_NETWORK_NAME }}
      APP_CONTAINER_NAME: ${{ variables.APP_CONTAINER_NAME }}
      DB_CONTAINER_NAME: ${{ variables.DB_CONTAINER_NAME }}
      CACHE_CONTAINER_NAME: ${{ variables.CACHE_CONTAINER_NAME }}

  - template: smoke-test-template.yml
    parameters:
      API_URL: 'http://localhost:80'

  - script: |
      docker-compose down
      docker volume rm $SHARED_DRIVE_VOLUME_NAME

    displayName: "Stop Application"
    condition: succeededOrFailed()
    env:
      APP_DRIVE_NAME: ${{ variables.APP_DRIVE_NAME }}
      DOCKER_NETWORK_NAME: ${{ variables.DOCKER_NETWORK_NAME }}
      APP_CONTAINER_NAME: ${{ variables.APP_CONTAINER_NAME }}
      DB_CONTAINER_NAME: ${{ variables.DB_CONTAINER_NAME }}
      CACHE_CONTAINER_NAME: ${{ variables.CACHE_CONTAINER_NAME }}

docker-compose.yml – Docker compose with three containers, network, and volume showing dependencies between containers while maintaining uniqueness for each element.
test-pipeline.yml – A testing pipeline that provides unique values for each of the parameters expected by the Docker compose file.

Missing or outdated tools

Vendors usually keep their agents clean, tidy, and up-to-date to support their tenants’ demands and to keep the multi-tenant agents secure. On the other hand, organizations typically depend on internal feedback from the development teams, or a security vulnerability, to update their agents.

Common issues affecting the pipeline include outdated CLI or SDK versions missing features that are required for the build/release, frameworks with known issues breaking the build, or even a missing language or specific version.

In this case, a rule of thumb for mitigation is to consistently provision and maintain your agents as Infrastructure as Code (IaC). That makes updating them at-scale much easier and quicker than it was if the process was manual and dependent teams will experience shorter downtimes while the update takes place.

On the process level, make sure the teams who own the infrastructure, including the build agent’s access and update process, have the capacity and ability to respond to such feedback from the development teams within a reasonable time. Furthermore, make it a habit to use the same scripts and bootstrapping code in the local development environment and the build system to keep the environments at par. A specific use case of this method is using development containers to define the build agent’s runtime, as suggested by my Microsoft team members Eliise Seling in her post on using dev-containers for Azure Pipelines and Stuart Leeks in his post on using them for GitHub Actions.

Final Words

This post aims to generalize the mitigations for our team’s common issues while migrating our customer’s DevOps processes to self-hosted agents. Using self-hosted agents is often an indicator and a milestone for organizations’ automation processes maturity, but moving an existing, working pipeline from vendor-hosted agents to self-hosted can be a notoriously tricky task.

We hope that the patterns and samples we have collected help other teams make the move to self-hosted agents easier, regardless of the tool. If you want to try them yourself, we recommend following your DevOps tooling vendor’s guide for using self-hosted agents, such as:

Please feel free to share your feedbacks and your patterns and practices that help you use self-hosted agents more efficiently.

Common Pitfalls of using Self-Hosted Build Agents

Why Self-Hosted Agents

Saving costs

Added Security

What can possibly go wrong?

The Workspace and Shared Storage

Pattern for Uniqueness in Shared resources

Conda Uniqueness Pattern

Shared Docker Daemon

Docker Uniqueness Pattern

Missing or outdated tools

Final Words

Author

Read next

Large-scale Data Operations Platform for Autonomous Vehicles

A Hypothesis-Driven Approach to Building and Testing Resilience in .NET Azure Functions

Why Self-Hosted Agents

Saving costs

Added Security

What can possibly go wrong?

The Workspace and Shared Storage

Pattern for Uniqueness in Shared resources

Conda Uniqueness Pattern

Shared Docker Daemon

Docker Uniqueness Pattern

Missing or outdated tools

Final Words

Author

Read next

Large-scale Data Operations Platform for Autonomous Vehicles

A Hypothesis-Driven Approach to Building and Testing Resilience in .NET Azure Functions

Stay informed