Multi-vendor Service Level Agreement (SLA) model based on Business Availability

Developer Support

App Dev Manager Dipanjan Ghanti shares an example of how to frame a multi-vendor SLA based on business availability.


Introduction

Enterprises often run business critical applications that are supported by multiple vendors. While each vendor takes responsibility for their component’s availability, the need to tie it up with the overall business service availability is frequently overlooked. In the traditional service level models, each organization has its own target service levels and is strictly accountable for their availability, regardless of whether their unavailability resulted in the loss of the business functionality.

Given below is a suggested model to tackle this based on business availability and shared responsibilities among multiple vendors. This model is built around defining the required availability of the business functionality. When the business availability is not met, each organization’s contribution to that failure factors into the apportioning of penalties.

Before I get into the details of this suggested approach, let me first share the assumptions and definitions that are part of this model.

Assumptions

  • The model assumes that the system is expected to be available 24×7.
  • The business owning organization will oversee the evaluation & penalty determination process.
  • Business service unavailability will be measured and reported by the owning organization.
  • Each vendor will have its own tools & monitors to measure and report the unavailability of its components.
  • All unplanned outages will go through the Root Cause Analysis (RCA) process to determine the responsible organization.
  • Unavailability to be measured in one-minute resolution interval.
  • This model does not consider data latency when evaluating availability.
  • The concept of Earn Back for Web Services is considered as out of scope.

Definitions

  • Business Unavailability: Business is considered as unavailable if its Web Services is unavailable.
  • Unplanned Down Time: Unplanned Down Time is the sum of all the times in a month when business service in relation to a specific application was unavailable due to unplanned (unscheduled) outages.
  • Scheduled Down Time: Scheduled Down Time is the sum of all the times in a month when business service was unavailable due to scheduled/planned outages.
  • Total Time: Total Time is the number of hours / minutes in a month. The unit of time for Total Time, Scheduled Down Time, and Unplanned Down Time would be represented in minutes.
  • # of Failed Requests: The total number of failed test web service requests that are sent during the Total Time – Scheduled Down Time period. If the test web service request tool is down due to an unknown reason, then that portion of the tool’s unavailability will be considered as Scheduled Outage thereby putting it out of the SLA evaluation.
  • # of Total Requests: The total number of test web service requests that are sent during the Total Time – Scheduled Down Time period. If the test web service request tool is down due to an unknown reason, then that portion of the tool’s unavailability will be considered as Scheduled Outage thereby putting it out of the SLA evaluation.
  • Total Penalty: Total penalty is defined as the summation of penalties incurred by each vendor due to the business service unavailability. The penalty is determined as a percentage of the vendor’s monthly operational support fees amount (excludes SOWs or CRs created for non-operations project work).
  • Service Penalty: That portion of the Total Penalty that is assessed against a vendor for the failure to meet the required business service level.

Suggested Business Availability Model

Proposed below is a model that could be leveraged in determining Service Penalties in the event the business service unavailability exceeds the agreed service level commitment.

Business Availability Framework

This model is illustrated based on the diagram provided below. The diagram represents a hypothetical scenario and shows the availability of the components owned by each vendor in conjunction with the application’s business availability. The diagram should be read as outlined below:

Image sla 1

  • The blue pipe represents the business availability in a typical month.
  • The orange, brown, and green pipes represent Vendor1, Vendor2, and Vendor3’s component availability during the same month.
  • As represented in this diagram there was no Planned/Scheduled Down Time during the month.
  • The dotted red lines represent unplanned unavailability of the business service. As per the diagram above, the unplanned unavailability in the month equals to (X + Y) minutes and represents “Unplanned Down Time”.
  • The “Business Unavailability” is calculated by dividing “Unplanned Down Time” by Total Time less the Scheduled Down Time.
  • Image sla 2
  • As represented in this diagram,
    • Vendor 1 managed component was unavailable for (A + B) minutes in total.
    • Vendor 2 managed component was unavailable for (C + D) minutes in total.
    • Vendor 3 managed component was unavailable for (E + F) minutes in total.
  • Refer to your Service Level Agreement for No Penalty situation. As an example, your SLA might require 99% or above availability. Hence if the “Business Unavailability” percentage i.e. (X+Y) / T % is greater than (100 – 99) % i.e. 1% then Service Penalty will be applicable.
  • For the model described below, let ‘s assume that business was available for 96.0% i.e. unavailable for 4.0 % in the month. Refer to your Service Level Agreement, to find the Penalty Percentage that will be applicable.

Service Penalty Calculation:

In this model, the monthly business service unavailability penalty is owned by the organization(s) that is/are responsible for the cause (unplanned outage). A key aspect of this model is if a vendor’s unavailability did not impact the business service then that unavailability period is exempted from the monthly SLA calculation. Based on this philosophy and the diagram above:

  • Only the unavailability periods that fall within the business unavailability zone(s) will be considered in the penalty calculation and ownership evaluation. Hence, as per the above diagram, the “A” minutes unavailability of Vendor1 and “E” minutes unavailability of Vendor3 are exempted as both fall outside the business unavailability zone.
  • Based on the same logic, the “B” minutes unavailability of Vendor1, (C + D) minutes unavailability of Vendor2, and the “F” minutes unavailability of Vendor3 fall within the zone when the business service was unavailable; hence they will be considered in the SLA calculation.
  • The penalty is owned by the organization(s) that is/are responsible for the outage. Image sla 3

It requires further analysis to identify the organization(s) that is/are responsible for each unavailability slot of the business service.

Image sla 4

Let us now zoom into the “Y” minutes outage slot when Vendor1, Vendor2, and Vendor2 were unavailable. All the unavailable organizations are initially considered as responsible for this business unavailability till it gets confirmed from investigation. If multiple organizations contribute to an outage and if their outage slots overlap with each other either partially or fully, then the penalty amount for each organization for each overlapped outage slot is calculated separately and is adjusted (reduced) by multiplying with the Penalty Factor. Note that in these situations, separate Penalty Factors will have to be calculated for each responsible organization for each sub-segment of the full segment of Business Unavailability. In this case of segment Y of Business Unavailability, the sub-segments (Y1, Y2, Y3) are appended to the sub-scripted responsible Vendor i.

Parting Notes

This model is a significant departure from traditional service level models. Hence it is strongly recommended to go through a test period wherein all aspects of determining the business availability and assessing each organization’s responsibility for unplanned outages is performed. At the end of the test period make necessary adjustments to the model and the processes involved.

0 comments

Discussion is closed.

Feedback usabilla icon