The Anatomy of a Disaster (and how to prevent one)
Information Availability & Security
Written by Ross Clurman   
12p_090

The word disaster should not be thought of as literal in the term “Disaster Recovery Planning,” but instead as more of a generic term for “downtime.” And, no business owner likes downtime, because time lost usually translates to money lost.

So, when planning for disasters (or downtime), it is important to understand the stages of a disaster and how each relates to your overall prevention strategy.

Disaster Scenarios

There are two scenarios in which “disasters” occur. There are planned disasters and unplanned disasters.

Planned disasters include server maintenance, upgrades, system tests, etc. In the case of a planned disaster, your data is (hopefully) backed up and you have the staff in place to address any issues that may arise.

In most cases, planned disasters are simply categorized as downtime and can be recovered from without problems – if you are prepared.

Unplanned disasters include natural disasters, power outages, bugs, and human errors – someone spills their coffee on the server rack, powers down the wrong server, etc. Unplanned disasters can cause data loss, extensive downtime, and can be very disruptive to your business and your end users. They also tend to happen at the most inconvenient times possible, further compounding the frustration surrounding the event.

Four Stages

In either case, there are four identifiable stages that take place, and if your business relies on functioning Tier 1 applications, it is important to understand the characteristics of each stage. Knowing these characteristics will in turn enable you to understand how to effectively deal with issues that arise and how to plan for (and prevent) disasters.

Stage 1 – A problem occurs

The power fails. The application unexpectedly quits. The network goes down. Regardless of the problem, this is how a disaster starts. Your DR solution should monitor the following:

  • Hardware and power failures
  • Operating system performance
  • Application configuration and stability
  • Network congestion and isolation
  • Data availability, access, and corruption

In addition to monitoring them, you should be able to set tolerance levels and define the key performance indicators (KPIs) for each of these areas.

Stage 2 – You (or your end users) realize there is a problem

Email is hanging. Your database is not accepting write commands. In many cases, this is the longest stage of a disaster and can be the most disruptive if your customers, or end users, experience any downtime.

For that reason, it is also important that you are automatically notified of issues. If not, your users are likely to notify you, and they can be slightly more abrasive.

The best DR solutions don’t simply rely on email as a notification system, because an email won’t reach the recipient if your network is down. Some systems can send a page and/or SMS notification, in addition to an email.

Stage 3 – Determining the course of action

Now that a problem has occurred, and you are aware (either by an angry voice, or an automatic notification), you must now determine how to address it. Due to the varying nature of disasters and unlimited number of solutions, it is impossible to say which solution is right for your organization. Some things your organization should know prior to determining the appropriate course of action:

  • The cost – not only in terms of time, but how that time converts into dollars
  • The steps to implement and how long it will take
  • The benefits of one method versus another
  • The alternatives – if there is a better, more permanent solution which can be implemented

Don’t spend too long determining a solution, because every second counts (and costs you money).

12p_091

Stage 4 – Failover and recovery occurs

The fourth and final stage of a disaster involves implementing the solution. This stage is the home stretch, but depending on the disaster, may be as taxing as the prior three.

It may involve enabling an emergency power supply, replacing a network switch, loading a set of backed up data or rebuilding an entire server farm.

Depending on the disaster, and the solution used to recover, you may still have work to do. Some disaster recovery involves failing over to a secondary or remote server. Be sure to accommodate the process of getting back to the primary site in your recovery plan.

Your DR solution should proactively monitor KPIs and allow you to set the tolerance levels. In addition, you should be able to perform a manual failover and failback with relatively the same amount of effort – one should not be more taxing on your system and staff.

Your solution should reduce the amount of work required to recover from a disaster. This could include automatically updating internal DNS to reroute users to the “new” production servers, logging the exact point of failure so the amount of data lost can be determined, and even something as simple as notifying users that there was an outage, and that they may experience slower load times and degraded performance as repairs are made.

Whatever the solution – make sure you fully understand the anatomy of the disaster, so you can prevent future occurrences. Disaster recovery is never a perfect science, so review what went well and what didn’t so that you’re better prepared for the next disaster.


About the Author
Ross works with Neverfail to better align the product strategy of their award-winning suite of DR, high/continuous availability, and migration assistance software solutions. For more information, you can reach Ross via email rclurman@neverfailgroup.com or by telephone (512) 327-5777. www.neverfailgroup.com