Because incidents cause trouble and potential losses, there are processes such as business continuity management and disaster recovery planning. These processes help keep business up and running. After all, downtime equals potential losses.
Many think losses only occur when incidents happen, and many believe return time objective/return point objective (RTO/RPO) is the most important metric to stay within the limits of the losses the organization is willing to accept. But is this true? Are there any other metrics that can/should/must be used when working with event and incident handling?
Service management contains many metrics that can be considered when trying to optimize service-management processes. One aspect of service management is handling events and incidents. In an event, something may have an impact on the process, or a threshold may be exceeded. In an incident, something is broken or likely to break. An example of an event may be a slight increase in a server’s resource use, whereas an incident would be if the server has 90 percent of its resources in use. In the latter case, an outage may be imminent.
In light of this, it is imperative to understand the key difference between an event and an incident. An event is essentially something that happens. An incident, on the other hand, defines the fact that, as stated, something is broken or likely to break. Also, any event or chain of events can become one huge incident or disaster if not handled properly.
Consider the chain of events in incident resolution (figure 1).
There are multiple phases in the event/incident timeline:
- Event/incident phase—The event or incident occurs.
- Detection phase—The incident is actually detected.
- Ownership phase—Ownership of the event or incident is taken. This phase also initiates actions that should bring the service back to normal operation. Initially, this is done through procedures defined in the business continuity plan (BCP).
- Escalation phase (optional)—A decision needs to be made whether the event/incident should be escalated. When escalation occurs, the disaster recovery plan (DRP) is triggered.
- Recovery phase—The service is restored to normal operation.
- Lessons learned phase—An analysis, such as a root cause analysis (RCA), is performed, reports are issued, and feedback is given. This feedback may also result in changes of procedures, configurations and even corporate risk management.
Figure 2, which includes the discovered metrics, was created by observing the time intervals between each phase.
These timers can be mapped to phases:
- Time to detect (TTD)—How long does it take before the event is detected?
- Time to own (TTO)—Once the event is detected, how long does it take before someone takes ownership? After all, what good is event detection if nobody cares?
- Time to escalate (TTE)—Once ownership has been taken, there is some time to try and fix the event/incident before escalating it. During the escalation phase, the disaster recovery plan may be triggered, or additional help may be required from others.
- RPO/RTO—Once the problem has been escalated, a critical point in time has been determined. The service being provided is probably down and losses are occurring. This is the best-known metric, as it basically states: Once it is broken, there is x-time to repair it, and the organization is willing to indulge y-amount of data loss.
- Mean time between failures (MTBF)—This is the average time between two identical events.
When looking at the previous mapping, considering that there is a defined maximum tolerable downtime (MTD), the first conclusion that can be drawn is that the MTBF should always be lower than the MTD.
From an enterprise perspective, the metric MTD is used once an incident occurs. In essence, MTD is the sum of the time to detect (TTD), time to own (TTO), time to escalate (TTE) and recovery time objective (RTO). So, the MTD begins at the incident, and the incident should be resolved before the MTD is exceeded. When looking at enterprise losses, the MTD has a cash value associated with it. For the sake of simplicity, this is called the maximum tolerable loss (MTL). This metric should be derived from the business impact analysis (BIA). The MTL is also related closely to the acceptable losses, as defined by senior management.
In terms of risk management, the estimated loss for a single event is the single-loss expectancy (SLE). The MTBF is an average value of all the time spent on this type of incident or event divided by the number of occurrences. This gives a rate of reoccurrence, which also means that the MTBF and the annual rate of occurrence (ARO) are related. After all, if there is an event that recurs every week, there are 52 occurrences per year, which results in an ARO of 52 for this type of event/incident.
For as long as the incident is not detected, the enterprise is essentially bleeding cash, which is unacceptable. The same drawback happens when nobody takes ownership of the event.
FOR AS LONG AS THE INCIDENT IS NOT DETECTED, THE ENTERPRISE IS ESSENTIALLY BLEEDING CASH, WHICH IS UNACCEPTABLE.
For example, if a process suffers from an outage, and this process is responsible for US$24,000 in revenue per day, it is easy to see that each hour of outage results in a loss of US$1,000. If it takes four hours to detect this event or incident, the result is US$4,000 in missed revenue. The organization has probably determined that it can withstand an MTD of 10 hours per year. So, this comes to US$10,000 a year of acceptable losses.
But there is more to consider. Imagine the same process, but now it runs on 80 percent capacity. Is this an incident or an event? It may be acceptable that the server runs on reduced capacity for some time. But keep in mind that when dealing with this type of event, the organization is (when using the data from the previous example) only generating US$19,200 per day, so basically it is losing US$800 per day. This may be perfectly acceptable, but remember that if this event or problem is not addressed, losses will start to accumulate beyond what has been defined as acceptable.
It is worth discussing some issues that are not always considered: An MTD of 10 hours per year and four hours to detect and take ownership of an incident would imply that the enterprise has only a total of six hours to fix the problem, right? Correct, if the enterprise has only a single occurrence of this event per year.
But what if this event occurs two times a year? Because the MTD is 10 hours per year and there is a four-hour detection interval, the maximum time to repair is one hour: The two events times four hours means eight hours for detection and ownership, leaving just two hours for the repair of the two events/incidents.
If the enterprise exceeds this, it is potentially dealing with unacceptable monetary losses because it is spending more time than is defined by the maximum tolerable downtime, hence probably also exceeding the amount of tolerable losses.
So in terms of service management, it is crucial that the enterprise detect events as soon as they occur; this minimizes time spent in the TTD phase. Also, as soon as the event/incident is detected, ownership must be taken immediately.
Since the enterprise now knows that it has only a specified time to fix the problem, is the RTO specified in the BIA feasible or acceptable? In the case where an enterprise has an MTD of 10 hours per year, an ARO of two and a TTD per event of four hours, the enterprise cannot have an RTO of two hours because this would exceed the MTD by two hours.
If this is acceptable, then maybe the MTD should be set to 12 hours. Thus, it is important to make sure all metrics can be measured and optimized. Also, those industries should be included when doing a BIA or in the risk management process.
Moving on to the service level agreement (SLA), there are in essence a few options (figure 3).
In some SLAs, the partners (vendors and other internal parties) handle the entire incident cycle from detection to closure (essentially outsourcing). Support contracts kick in during the BCP phase (maybe some form of vendor/application support), and some contracts kick in during the disaster recovery (DR) phase (maybe next-business-day delivery of hardware).
Considering a full outage case, what would happen if a process is underperforming? If the process baseline is established and indicates 1,000 transactions per day for an average value of US$10 per transaction, then on any normal day this would be US$10,000 per day in revenue.
But what if the process is not monitored and starts underperforming? What if the process could no longer exceed 900 transactions every day? This would result in US$1,000 in missed revenue. And, since the issue could not be detected, the enterprise is basically bleeding cash.
Even though this may not be considered an incident, it can still become an undetected one. Perhaps US$10,000 in missed revenue per year is acceptable, but is losing US$1,000/day for 11 days in a row acceptable? Assume that the event is detected after 11 days. Would the RTO defined in the BIA still be valid? Or should measures be taken so that the RTO remains valid? If so, having an RTO of two days would imply that after eight days at most (in case of an ARO = 1, 10 days – RTO = 8 days), this event should be escalated to keep the losses to an acceptable level. If the ARO is two, the enterprise has only three days to detect ([10 days – 2 x RTO]/2 = 3 days), own and escalate the event.
WHEN CREATING A NEW SERVICE, IT WOULD BE WISE TO DEFINE SOME OF THE METRICS DURING THE SERVICE DESIGN.
The metrics defined in service management should be seen as valuable information when doing a BIA. These metrics should also be used for measuring, reporting, baselining and updating the service management improvement plan and for governance purposes. These metrics and the potential losses involved can help an enterprise determine when to escalate from BCP to DR. Depending on what the organizational capabilities are, a different kind of SLA may be picked. The metrics included in the SLA should also be monitored and reported. In terms of the SLA, this is basically also a risk-management exercise that needs a BIA.
Conclusion
There is no use in protecting a US$5 bill by means of purchasing a US$500 safe. However, one should keep track of the metrics outlined here, as they may help define correct and acceptable metrics when it comes to risk management, acceptable losses and RTOs. When creating SLAs, these metrics may also come in handy. Maybe the service desk has an average TTO of two hours. Customers and partners should be aware of this.
That said, the function of service management is not only incident-management-related. Some parts of service management are also related to designing new services to create additional value for the customers and users. When creating a new service, it would be wise to define some of the metrics during the service design. After all, if the enterprise knows what to monitor, what the thresholds are, what the response times are and how long a service can be down, the enterprise can incorporate those in the design and assigned budgets, for example.
Sven De Preter, CCAP, CSCP
Is currently working for an organization that owns and runs several concert venues and theaters in Belgium and a ticket-selling company with a full customer service center. He has been working at this organization as a senior network and system administrator for more than 19 years. During this time, he spent lots of time in the fields of event and incident management, change management, operational team lead, data center virtualization (using VMWare), connectivity, and architecture. He has also provided advice on several policies, procedures and guidelines.