Incident Report – PowerShell Gallery Downtime October 30, 2020

Sydney Smith

Sydney

 

The PowerShell gallery experienced downtime on October 30th 2020. This report will give context as to what caused the downtime, what actions were taken to mitigate the issue, and what steps we are taking to improve the PowerShell gallery experience moving forward.

Downtime Impact

The downtime was declared at 2020-10-30 03:29 PDT, and was mitigated about 12 hours later at 2020-10-30 15:39 PDT. During this time packages were not available from the gallery, and the web interface was not accessible.

Root Cause of the Downtime

The downtime was a result of an attempt to fix ongoing statistics errors with the gallery. For roughly 3 weeks the PowerShell gallery was experiencing many server errors (roughly 100-200 per minute) due to a key that had reached a max int value (total downloads reached over 2 billion) and was causing persistent int overflow errors on the gallery. This prevented new entries from being added to the ‘PackageStatistics’ table (required for the intermediary processing of statistics). The int overflow first occurred on 9/18/2020.

After an attempt to perform database migrations failed due to the persistent errors manual updates were made to the database to fix inflated package statistics numbers. These changes triggered a series of deadlock and timeout errors which consumed all our available cloud resources. This caused a spike in DTU/CPU utilization for the database which inversely correlated with the availability for the service. The availability for the gallery was so low that it was non-functional and declared down.

Mitigating the Downtime

The first mitigation step was to restore the gallery database (DB) to a previous timestamp. It was believed that an error in the attempted fix of gallery statistics caused the DB to get into a bad state and thus restoring the DB reverted those changes. This initial error was likely due to a trigger on the database that we did not account for. Unfortunately, reverting the DB caused additional issues. Checking the PowerShell gallery backend logs, we saw that the service had trouble connecting to the DB with an error that user credentials were wrong. This indicated that the user had been orphaned by the restore so we re-created the user in the DB. After this step, checking the PowerShell gallery backend logs again, the service had additional trouble connecting to the DB with an error that login was failing. We determined that this error was caused by the DB restore dropping the DB from the gallery’s failover group. The next mitigation step was to re-add the DB to the gallery’s failover group. The final mitigation step was to restart the cloud services so they could re-connect to the failover group. At this point the gallery started working again. We validated these fixes with customers, as well as with our own testing and continued to closely monitor the DTU/CPU utilization and service availability.

Statistics Errors

The gallery has had ongoing issues with the package statistics since August 2020.

These errors came from the gallery reaching a scale (more than 2 billion installations) that was not supported by the design of the statistics pipeline. The impact of this has been both incorrect and unavailable package statistics. The package statistics from 2020-09-18 through 2020-10-07 were never recorded, which meant we were unable to recover statistics from this period.

Restoring Statistics

We restored statistics in two ways, first we repaired statistics for individual packages (surfaced on a package’s page), and then we repaired aggregated statistics (surfaced on the gallery homepage and statistics page).

In ordered to repair package statistics we updated values in our main database and within the code base itself, that referenced a key for package statistics from an integer to a bigint/long. There was some pending data that was dropped when the int overflow error first appeared. We retrieved specific ‘lost’ data from a restored database, but were unfortunately unable to recover some data (mentioned in Statistics Errors).

To repair the aggregated statistics, we then made parallel changes to our data warehouse.

Our repair items are focused on 3 categories: detect, diagnose, and fix. By focusing on these three areas, we hope to not only improve the overall performance of the gallery but also, more quickly find and mitigate issues as they arise.

  • Detect:
    • Add more notifications to the production database
    • Create alerts for when critical metrics are reached in the DB
    • Improve post-deployment validation so that we can quickly roll back undesirable changes
  • Diagnose:
    • Send database logs to a central location outside of the service so that logs are more easily available
  • Fix:
    • Improve the deployment process for gallery cloud services
    • Better document (internal) procedures for recovery and communication during an outage

 

We are also in the process of designing architectural changes to the PowerShell gallery, to ensure this is a reliable, performant, and supportable service moving forward.

Expectations going forward

In conjunction with these repairs, we are working to set and monitor Service Level Objectives (SLOs). Look forward to a future post detailing these expectations and how gallery users can track our progress against these objectives.

Reporting Issues

If you notice any issues with the PowerShell gallery please open an issue in our GitHub repository.

If you are a package owner and have an issue with your package please use our support alias: cgadmin@microsoft.com.

We continue to update the status of the PowerShell gallery at: aka.ms/PSGalleryStatus.

Sydney

PowerShell Team

 

2 comments

Leave a comment