Disaster Recovery Prevention:

A Proactive Approach for Business Continuity for Critical Facilities


Business continuity is the ultimate goal of critical facilities. Therefore disaster recovery planning is essential. However, an effective reactive plan for an unavoidable failure is not the only answer. Taking a proactive approach to preventing disasters before they even occur is not only an alternative to what is widely utilized today, but also a complementary philosophy for facilities that cannot afford downtime. The basic idea is: plan for the moment after a failure, but first—fight not to fail at all!

The first assumption of every Disaster Recovery and Business Continuity (DR & BC) plan must be this: a disaster will happen. From unpredictable mother nature to the vulnerable power grid, we cannot calculate the “when”, but by effectively hardening facilities and implementing DR & BC plans we can be ready for the “what now.”

The DR & BC plan covers both the hardware and software required to support the critical business applications in the event of a disaster. The list (right) shows the causes of critical failures: 3% natural disasters, 28% technology reasons, and the remaining 69% somehow related to a human factor. The human factor includes design, maintenance, testing and, of course, human error.

The DR & BC plan is not limited to finding a space to restore the critical processes, but should also take a proactive/ preventive approach to address each risk factor. A gap analysis that follows an all-levels risk assessment should generate action items such as: geo-redundancy considerations, reliability improvements, maintenance and operational procedures, and human error minimization. This approach should not be local or sub-system oriented, but all-inclusive in order to generate a good value for the investment. Ultimately, the continuity of business and operations must always answer to the bottom line.

The proactive BC strategy must include three major targets for hardening at the facility level: a risk assessment, an implementation procedure, and improved operational processes.

The first target, a risk assessment, is a four-step process that includes: a site evaluation, a vulnerability assessment, benchmarks, a gap analysis and of course recommendations.

The first step in the risk assessment is to develop resiliency metrics for mechanical, electrical, server, service, and application components. It is imperative to quantify reliability and recovery expectations at the multiple levels of power delivery.

Second, all single points of failure (SPOF) must be identified within all the critical systems. Ideally this should include not only the facility infrastructure, but also the computational, communication and software/application layers. While identifying SPOF, a probabilistic risk assessment (PRA) model must be developed that includes an evaluation of all redundancy requirements. Along with this important step, a significant database must be created and should be carefully organized in order to be effective for the following steps.

The third step is the gap analysis, which compares the database with the findings.

The fourth step is the outlining of recommendations for upgrades or alterations to optimize facility, plant, IT system, IT services, and application performance and resiliency.

The goal is to implement the recommendations presented in the risk assessment. Using reliability modeling, each design option needs to quantify performance (reliability and availability) against cost to make design decisions in the initial phase of the project. Since the costs associated with each reliability enhancement or redundancy increase are significant, sound decisions can only be made by quantifying the performance benefits and by analyzing the options against the respective cost estimates. An overall schedule must be developed containing all the project phases. Here, commissioning is a key component to complement the implementation phase. Commissioning, simply stated, is the documented and systematic process of ensuring that all building subsystems perform interactively according to their intended design and operational function. Why is this so important? Because commissioning minimizes the occurrence of hidden malfunctions…ie: FAILURE. The commissioning process is site-specific for verifying the performance of individual system components. Test procedures must be uniquely designed for each manufacturer's equipment and application to measure and verify specific performance parameters.

Following the verification of individual modules, integrated testing of major systems must be performed. This testing procedure is a cumulative exercise to verify the reliability of the design and compatibility among all critical systems (electrical, mechanical, IT and environmental) and it must be tested not only in standard operating modes, but also in failure and safety modes to ensure there is redundancy within and among all systems. The intent of such testing is to simulate any “real life” disaster conditions that the facility could undergo.

The improvements at the operational level should include comprehensive maintenance procedures that correlate with the understanding of the failure mechanism of the equipment.

A proposed all-inclusive methodology for facility maintenance should be implemented during the proactive BC program. This program combines preventive maintenance, reliability-oriented maintenance and corrective maintenance in the various stages known as total maintenance.

Developing the BC strategy at the facility level is not enough. It should be planned for at the enterprise/plant level. The best analogy here is the idea of a Local Area Network (LAN) versus a Wide Area Network (WAN). Global financials or major ISPs are operating facilities able to work in “stand-alone” or “cooperative” modes. The geo-redundancy concept was created as a proactive approach to BC and DR. This concept has been popular lately as there has been a movement from the off-DR facilities to fully active redundant sites. The stand-alone mode includes building a facility with all the capabilities described above, including the comprehensive hardening process the facility underwent in order to be able to accomplish the business profile. But what if this is not enough and the hardened facility “A” must be able to cooperate with facility “B”? The requirements of cooperation may include:

  • Sharing databases
  • Real time mirroring of data
  • Distributed H/W
  • Complementary coding systems
  • Access to distributed sensor networks

Thus the questions raised are:

  • Does Facility B have the same survivability standards as Facility A?
  • Does Facility B have the same protection to vulnerabilities as Facility A?
  • Did Facility B pass the same hardening process as Facility A?
  • Are there are any circumstances under which one of the facilities will be unavailable for the other facility’s needs during the mission?

The methodology suggests that the facilities must have the same hardening capabilities to accomplish the business objective.

An example of this is an Internet shopper. Users hitting the “BUY” button are immediately sent from a low availability platform to a high availability platform to complete the sale. In the event of a server failure at the facility, the high availability platform will switch to a mirrored facility to complete the sale with no significant delays (unbeknown to the satisfied shopper).

Experience has shown that one of the most important pieces in the whole geo-redundant scheme that can really boost the proactive BC plan is the availability of the IT fail-over mechanism between facilities.

The best approach to BC is to take proactive steps to ensure operational continuity as opposed to concentrating a facility’s efforts on disaster recovery after the fact.

Yes, disaster recovery planning is a must, but it is just another piece in the BC plan, which of course is the overarching goal. Additional practices suggested are the hardening of the facilities, improving operational availability and physical spread with a high resiliency, fail-over mechanism.

This article was published in the Disaster Resource GUIDE for Facilities (Fall 2006).

About the Authors
Rick Einhorn is the Chief Marketing Officer of EYP Mission Critical Facilities®, Inc., (www.eypmcf.com) a global E/A firm with over 300 professionals and offices in ten cities across the US and the UK.

Kfir L. Godrich is a Principal and Director of Technology Development. His role in research and development for new technologies strengthens EYP MCF’s thought leadership in the mission critical industry.

They can be reached at (212) 277-0099 or reinhorn@eypmcf.com and kgodrich@eypmcf.com.