[an error occurred while processing this directive]

The Service Disruption Continuum

By Bob Baird

Disruptive events don't have to be a major disaster to wipe out your business. They can be anything from a relatively minor malfunctioning network card to a devastating event such as a sudden regional disaster that not only destroys your data center but also shuts down surrounding roads, bridges, and other infrastructure.

When businesses take adequate protective measures, they can survive even major disasters. But without protective measures, a business can be wiped out by something as trivial as a coffee spill.

A disruptive event might cause a business interruption depending on the importance of disrupted services and resilience of underlying systems. A disruptive event might also result in data loss/corruption depending on the pervasiveness of the event and the data protection measures in force.

Many companies, when planning for disruptive events, tend to classify those situations into either a high-availability problem domain or a disaster-recovery problem domain. But many disruptive events don't clearly fall into one category or the other.

Instead, a more sensible way to deal with disruptive events is to ignore these distinctions and simply deal with a continuum of disruptive events as they apply to your business. This model, the service disruption continuum, addresses the issues related to preserving or restoring service both during and after a disruptive event.

The continuum places disruptive events at various points in the same matrix. This change of perspective allows an organization to analyze risks, investigate approaches and implement solutions without being bound by misleading domain boundaries. Here we take a closer look at the service disruption continuum, and how it can help prepare organizations to handle whatever disruptive event may come.

The Continuum
The service disruption continuum illustrates various disruptive events as a continuum of disruptions and consequences. Those events that are less severe disruptions but more frequent are at one end, and those that are more severe but less frequent are at the other. As you move along the continuum from left to right, you will see that each disruption is significantly more severe than the last, but also significantly less frequent.

The events on either end of the continuum have the greatest impact on a business. For example, on the right, an earthquake or hurricane could shut down a data center. But on the left, you will see that an operational mishap might corrupt a database with even worse consequences than a natural disaster if protective measures were not in place.

High-Availability vs. Disaster Recovery
High-availability solutions mainly address server outages at a single site, while disaster recovery solutions mainly address sudden, site-wide disasters. As you can see from Figure 1, there are many situations more serious than a server failure and more frequent than a sudden, regional disaster. High-availability and disaster recovery objectives and metrics are different due to their own narrow focus.

High-availability objectives are commonly specified as percent uptime. To achieve 99.99 percent uptime, you might have 5 outages per year of 10 minutes each, one 53 minute outage every year, or a 4 day outage every 108 years. Similarly, you might recover from 100 outages in five minutes or recover from 99 outages in one minute and one outage in 6.7 hours to achieve the same average. Averages, however, mean very little when severe consequences are at stake. The only outage of any consequence is the one that requires 6.7 hours for recovery.

By contrast, there are two metrics for disaster recovery: recovery time objective (RTO) and recovery point objective (RPO).

RTO is the elapsed time from service interruption until service is restored. It answers the question: "How long can you be without service?" RTO represents a time limit that you cannot exceed or you will face severe consequences. A unified high-availability and disaster-recovery approach would establish both an uptime objective and an RTO for each service.

RPO, on the other hand, is the point of time represented by the data upon service resumption. It answers the question: "How old can the data be?" We interpret RPO differently for real-time and transactional processes. With real-time processes, the world does not stand still; data has no value after a very short time. Conversely, transactional processes usually deal with information that has been committed at some known point in time, where the value of data remains relatively stable long after it is committed. Planning for high-availability assumes no loss of committed database transactions, although loss of data recently written to application files is commonly acceptable. The high-availability objective for databases is assumed to be zero. By contrast, in disaster recovery planning, RTOs for databases and application files are explicit. A unified objective would ignore both high-availability and disaster recovery boundaries and establish an RPO for all services.

A service with a high RTO might have a zero RPO. For example, the loss of financial data which must be reported to the government is unacceptable, although resuming access to that data might take days. Similarly, the RPO for a service could be greater than the RTO of the service. The RTO for a given service might be a few minutes, but the RPO might allow restoration to a two-day-old image.

Recovery time criticality, however, does not mean that applications with high RTO and RPO values are unimportant. Many strategically important applications have high RTO/RPO values.

A high-availability objective has a single dimension for expressing average recovery time (RTO is assumed to be zero for databases and undefined for application files). A disaster recovery objective has two dimensions expressing recovery time and recovery point. But in a service disruption continuum, recovery objectives for all services have both RTO and RPO dimensions. Figure 3 illustrates a 4x4 recovery-class matrix for classifying applications comprising a service and mapping applications to solution approaches.

Analysis of your business processes will determine the coordinates of your matrix and how you classify applications. For any given coordinate, there are a small number of solutions that will satisfy the corresponding RTO/RPO values.

In theory, the solution that best satisfies the objectives of a given coordinate will cost less than the best solution for a coordinate having a lower RTO/RPO value. The goal is to implement the lowest cost solution that satisfies the RTO and RPO for a given application.

However, there are usually several ways to implement a given solution, depending on the following considerations:

  • Not all solutions support all platforms.
  • Different workloads (e.g., read-intensive, write-intensive, batch update, etc.) place significantly different demands on system and network resources.
  • Some replication and backup/restore techniques are good for some kinds of data (e.g., database, application files, system images, logs, etc.) and bad for others.
  • Pricing structures of different vendors mean the cost of a given solution vary widely. Furthermore, a real solution usually combines components from several vendors and might include a healthy bit of customization. In such cases, component interoperability would be a major consideration.

Your last question might be: "What problems does the service disruption continuum address?" Very simply, you avoid two undesirable outcomes:

  • Gross overkill. In short, you avoid buying a tractor-trailer when you really need pick-up truck. Of course, if you already have a tractor-trailer ready to transport your household goods from coast-to-coast, it might make sense to transport your car at the same time. But you might decide to drive your car coast-to-coast since you are traveling anyway. And, you might ship valued personal effects via a secure delivery service.
  • Insufficient capacity. In short, you wouldn't try to haul a boat with a golf cart. It makes little sense to throw money away on a solution that won't do the job or can't scale to handle peak loads.

About the Author
Bob Baird is a senior solutions architect for Symantec, Business Continuity Practice. He has been a performance, storage systems, high availability, and disaster recovery architect for IBM, HP, and Symantec over his 40 year career. Bob also holds numerous patents in the area of data and storage management. You may reach him by telephone at 408-529-6594 or by e-mail at bob_baird@symantec.com.

[an error occurred while processing this directive]