Best practices to prepare for a high priority event

Premier Developer

Premier

App Dev Managers Kyle Kapphahn and Matt Hyon examine some proven practices to help you plan and execute a high priority online event.


Introduction

Your team has a major upcoming event centered around your online services. How do you best align your team and prepare your services to ensure a successful event? We will walk you through the proven practices to ensure your team executes on your event successfully.

Event Management QuickStart Guide
Planning and PreparationBuild and test your Support Package
  • Playbooks for critical applications and services
  • Roles and processes for the event
  • Table-top simulations to test processes
  • Dry-run execution of contingency plans
Event TimeCommunication roles and process
  • Roles that report out separate from roles handling issues
  • Communication channels up and running

Monitoring telemetry data and system metrics

  • Data-driven decisions to identify potential issues
  • Pre-defined thresholds that will trigger contingency actions
After EventRetrospective
  • Review what worked and what can improve
  • Update the Support Package playbooks for recurring/seasonal events

 

Planning

It goes without saying that proper planning and preparation is essential to ensure a well-executed event. Part of that planning and preparation activity should include building out your “Support Package” for the event. The Support Package helps you build a playbook with the supporting documentation and processes to support the critical systems on which your event relies. Existing documentation such as architecture diagrams, support roles and process are a good starting point to create a playbook of the most critical applications and services and how each need to be supported during the event. Building your Support Package will also help to uncover potential support gaps and socialize the important support elements across your teams. This includes defining the real-time telemetry and system metrics data thresholds that will trigger actions in the playbooks. Just as you would test your software, you will want to test your Support Package throughout the process through table-top scenario simulations to dry-run activities executing contingency plans.

Ask yourself some key questions when building your Support Package:

  • Is this a global event, or focused on specific regions or countries?
  • Is this a multi-day, around the clock event or for just a few short hours?
  • Does your team have the expertise to troubleshoot and resolve production issues quickly?
  • When something goes wrong, what is your escalation plan?
    • How quickly can you mitigate an issue?
    • How and when should you notify your stakeholders, customers or the general public?

How you answer these will shape the Support Package you need and identify where you may need additional assistance.

Preparation

Well ahead of the event, validating the environment to handle the load is vital to ensure stability during the event. In addition to implementing scalability and performance testing in your regular release pipelines, having Subject Matter Experts review your infrastructure, configuration and deployment, and leveraging on-demand assessments in Services Hub (Unified Support customers) can identify early any areas of remediation that may cause issues or are out of recommended practices. Perhaps you’re expecting a dramatic increase in site traffic this year compared to prior years, and the configurations you have in production are not optimized for the anticipated load. Do you need help getting your arms around performance tests that reflect the changes you have seen in your application and usage? Microsoft Test Services can help you design a testing strategy and execute performance tests. Performing these types of general health checks would identify those potential areas of improvement.

Also review your support model. Are you new to Azure or have multi-cloud requirements? Perhaps leveraging the Azure Hi Pri (High Priority) Event Team could help round out your event. Do you need deep technical expertise to help troubleshoot and remedy potential issues in production during the event? Perhaps engaging Premier Field Engineers to help review errors, logs and performance metrics during the event could provide a bit of confidence that issues would be resolved quickly. Reviewing your architecture with technical experts, and perhaps even code reviews with Developer Support consultants could also increase confidence and potentially identify any issues before the big event.

Do you have a good reporting mechanism in place? Not only reporting and logging within your application, but also alerting and monitoring. Escalating critical issues to the technical teams to resolve is one key point, but also when should that be escalated to your extended support team, to managers, or to senior leaders within your organization?

Event Time

You have prepared for the event, your applications and services are ready for the big day, and your Support Package is ready.

You have your Teams Meeting up and running, with regular statuses being given hourly, or just before those “doorbusters” hit to ensure you can handle the upcoming load. Your external support teams are on the meeting with you around the clock. Any issues are quickly escalated to the right team, and if there is a question on the Azure side, you engage the HiPri team immediately to mitigate. Regular communications go out to your leadership team.

The event time is now just about executing on that preparation by enabling the communication channels that will be used and enabling monitoring activities.

Retrospective

Although it may be tempting to skip this debrief exercise, reviewing both the successful aspects of the event support coverage as well as the improvement areas is vital in preparing for your next major event. Also, reviewing both the metrics-based and qualitative feedback from team members are important in determining improvement opportunities. Reviewing the telemetry recorded during the event and correlating with the decisions and actions taken will reveal the metrics that need to be refined for future events. Perhaps unexpected steps in the process were not covered in the playbook and resulted in delays in detecting and troubleshooting issues. These learnings should be applied to revise metrics for the next event. Qualitative assessments of the process and communication is also important in refining the approach for future events. Perhaps the mode or format of the communication message was difficult to assess quickly. These learning lessons will help to guide your team not only in the technical areas that need attention but also where processes can be streamlined and made more effective in the future.

0 comments

Leave a comment