August 1st, 2023

Best Practices for Performance Testing: A Guide for Developers

Introduction

As developers, an integral part of our work involves answering and documenting system scalability to ensure reliability during peak performance periods. In this article, we delve into our key insights gleaned from performance testing.

In our most recent project, we were tasked with documenting the scaling metrics for a containerized workload capable of processing hundreds of thousands of events every minute. We had to record specific metrics such as the replica count for each component, along with the CPU and memory resources required for each replica. While we could rely on the documentation for external components like Azure Event Hub and its limits, the metrics for the components we created could only be determined through performance testing.

An additional request was to document the metrics and error logs a Site Reliability Engineer (SRE) should monitor and use during performance investigation.

Initial Steps and Considerations

Our initial step was to establish our performance goals. Performance testing ensures that a system can handle a specified workload size using certain resources within a defined period and shows how those resources can scale out (or up) to accommodate additional load up to a certain limit. The workload size and period are usually known well in advance and typically represent the maximum load a system is designed to handle. For instance, a specific system might be expected to process hundreds of thousands of events per minute for an hour.

How can we meet this load, sustained over a given time period? Would three initial replicas with 1 CPU core and 1 GB of memory per replica suffice, or would we need more? Given that an Event Hub scales by partitions, could we scale out additional replicas (up to 32) to meet the maximum load imposed by Azure Event Hub partitions?

We also considered other factors:

  • We needed to examine which external components or services (like gateways) warrant deeper scrutiny. Could there be service level throttling and throughput limits that aren’t related to the number of replicas we have?
  • How would the database backend factor in? How much CPU and memory would be required? For example, an Azure SQL server could scale out based on Database Transaction Unit (DTU) or remain static, irrespective of the number of replicas we have.
  • Is the workload size realistic? At times, we might need to dig deeper into the specific use case.

Best Practices

Identifying the Performance Engineer Early and Establishing Expectations

Our process began by identifying a Performance Engineer to assist us. Regrettably, due to time constraints, the Performance Engineer was only able to conduct the performance tests in their own yet-to-be-ready performance environment. This experience underscores the importance of setting expectations at the onset, allowing the Development team to devise contingencies.

Numerous commercial tools are available for a Performance Engineer to build and execute performance tests. Nevertheless, our system demanded data configuration to be seeded across services, necessitating specific expertise to create this configuration dataset. This required the Performance Engineer to collaborate closely with the Developer and dedicate time to acquire business domain knowledge to craft the necessary tool or script. Yet again, due to time pressures, the Performance Engineer couldn’t allocate time to develop the tool, consequently expanding our scope as we were tasked with this responsibility.

It’s crucial to note that even if a Developer assumes this responsibility, they will still need to develop the requisite skill set if it’s lacking, as well as secure access to the pertinent logs and necessary tooling.

Mastering the APM (Application Performance Monitoring) Tool and Workspace

Performance testing yields a substantial amount of logs. Thus, it’s crucial to learn how to effectively utilize the Application Performance Monitoring (APM) tool, in our case, Kibana. This involves filtering logs based on criteria like environment, log levels, and application source. Upon filtering, these logs must be grouped into helpful views with specific fields and time constraints.

Given the complexity of the task, it can be time-consuming and mistakes in steps or incorrectly applied filters can waste valuable time. Additionally, if the APM workspace is shared, understanding certain business context becomes necessary. For instance, standardizing metric key values, such as log levels, becomes important as they might be defined differently (“Error” vs “error”) by different systems using the same APM workspace.

To avoid pitfalls, it’s important to dedicate time to learn and up-skill. Having a Subject Matter Expert (SME) who is proficient in using the APM tool is an asset. The SME can help with tasks such as debugging issues and saving time on tasks like log level filtering. Their expertise is also invaluable when creating APM-specific dashboards.

Maintaining a Dedicated Performance Environment

Throughout our performance testing phase, we found ourselves resorting to using the Development environment due to constraints in resources that prohibited the provisioning of a dedicated performance environment. This decision brought with it several complications, including confusion over the versions being tested and inaccuracies due to simultaneous development work on other features. In some instances, we even had to halt other development work during performance testing, as other developers couldn’t deploy or test their work simultaneously.

Reviewing the logs revealed more issues, with misleading or irrelevant error messages not related to performance testing muddying the results. These challenges underline the necessity of a dedicated performance environment for accurate and efficient testing.

In addition, there must be full control over the creation and provisioning of replicas and their resources, such as CPU/memory. This control is vital because experimenting with different configurations is key to the performance testing process. It’s important to remember that this is an iterative and demanding process, necessitating flexibility and precision.

Depending on the size of the dedicated performance environment, some organizations may find it cost-prohibitive. In these cases, organizations can adopt the automation of standing up and tearing down test environments as needed, reducing resource expenditure during idle times. Alternatively, larger infrastructures might employ canary testing in production, allowing for a portion of users to trial new features, effectively mimicking a performance testing environment without requiring a separate setup.

The Importance of Early, Frequent, and Local Testing

Developers should be capable of executing performance tests locally with smaller loads. This strategy, which we employed before deploying to the “Development” performance environment, benefits in several ways:

  • It informs the team about specific metrics and logs to monitor, which can expedite the documentation process.
  • It allows for early identification of performance issues, enabling quicker resolution.
  • It avoids the time-consuming build and deployment process that is often required in a formal performance test environment, where there’s usually a wait for a build and a release.
  • It helps to determine the level of automation needed for setting up and tearing down resources, assisting with the creation of initial scripts.

Our experience suggests that conducting lightweight performance tests frequently on a developer’s local machine, such as at the conclusion of each milestone, is a beneficial practice. This proactive approach leads to a more robust system and reduces the number of performance issues discovered once the system moves into the official performance testing environment.

Additional Best Practices

  • Make sure the application generates performance metrics for both ingress and egress. Logging the number of processed, errored, and dropped records at set intervals enables the construction of a time-filtered dashboard, aiding in load pattern analysis. Also, capturing timestamps from messages lets us monitor message processing delays – allowing us to identify lag in our dashboard.
  • Ensure the developer distinctly decides which logs are Debug and which are Informational logs for ease of analysis. Swimming through spammed logs is neither fun, nor efficient.
  • Make sure errors are logged properly and are not discarded or hidden within complex exception stacks, as they can be difficult to locate.
  • Strive to automate the setup and teardown of resources for a performance run as much as possible. Manual execution can lead to misconfigurations and wasted time resolving subsequent issues.
  • Maintain end-to-end visibility for logging and metrics, whether concerning the gateway or database. This could involve consulting Subject Matter Experts (SMEs) like a Database Administrator (DBA) to review backend metrics or a DevOps engineer to assist with gateway-related metrics such as request counts and throttling errors.

Conclusion

Performance testing is an essential part of the development process. It assures us that our system can scale and meet the expected load. The task may be demanding, but its benefits are clear when your system efficiently handles peak workloads without fail.

Throughout our recent project, we learned the value of preparing thoroughly for performance testing. Our key takeaways include:

  1. Early identification of a Performance Engineer can streamline the process and help us effectively use their expertise. The Performance Engineer must closely collaborate with the Development team to generate the right configuration data.
  2. Understanding the Application Performance Monitoring (APM) tool is crucial. Knowing how to filter and group logs, as well as understanding the tool’s workspace, can save considerable time and prevent potential errors.
  3. Having a dedicated performance environment is necessary. A shared environment can lead to issues related to version control and the risk of inaccurate test results.
  4. Frequent and early testing can help find performance issues sooner, allowing for early fixes. Regular lightweight performance tests conducted by Developers in their local environment can save considerable time and effort.
  5. Incorporating other best practices such as automating as much as possible, ensuring clear segregation of logs, producing performance-level metrics, and maintaining end-to-end visibility for logging and metrics can also significantly enhance the performance testing process.

Finally, remember that performance testing is an iterative process. It demands a continuous feedback loop, continuous learning, and improvements. It is not just about doing the tests but also about understanding the results and acting on them to deliver a scalable, reliable, and efficient system.

Author

Andrew Vineyard
Software Engineer 2