Disaster/Recovery Darwin List

No matter how smart the technologist, there are always glitches and gotchas when planning for a disaster. The problems I have seen and stories I have collected always demonstrate that the best laid plans are not necessarily the best plans.

The problems stem from assumptions made in the planning process or the exclusion of non-technical personnel from the planning process. I have encountered remarks from the non technical person that, when considered, are insightful and right on the mark.

The following 16 situations have been collected from clients, seminar and conference attendees and friends. They may seem funny, even ridiculous. Remember they are real occurrences, not from a joke writer. I hope these will stimulate you to brainstorm your planning process with a wide number of personnel and consider what appear to be off the wall or outrageous thoughts. If you have some stories to tell as well, please send them to me at delphi-inc@att.net, so I can continue the Darwin collection and prepare another paper.

  1. When Did We Last Try This? – An enterprise installed a backup power generator. It was sized correctly. Testing the generator worked as planned. The enterprise felt very comfortable that in the event of a power outage, the generator system would support their environment for days. The power outage occurred. The generator started and worked for about one minute, then stopped. It could not be restarted. The enterprise discovered that in not exercising the generator configuration, rust had accumulated in the fuel tank. Once the rust encountered the fuel filter, the filter became completely clogged. It took hours to determine the problem and resolve it. Always, always exercise the backup facility at least once a month to ensure successful backup.
  2. UPS Will Fit in the Closet – As the closet becomes the focal point for PoE and network infrastructure, the use of backup battery systems (UPS) in the closet have proliferated. At this one site, the enterprise had budgeted for the UPS investment. The closet space survey determined that there was not enough floor space for the battery system to lay out on the floor. So one of the technicians decided that the batteries could be stored vertically at the end of the closet. The facilities staff found out about the solution and stopped it immediately. The weight of the batteries, when stored vertically, would exceed the floor weight limitations. The facilities staff had to go to the lower floors and reinforce the floor above to allow the battery arrangement. The solution was successful but the budget, with the reinforcing material, far exceeded the UPS budget.
  3. We Powered What? – One of the smaller telephone companies built a network operations center for their infrastructure transmission facilities. One day they discovered that there was no backup power available at the site. Nothing had happened, they were planning ahead. The backup power was installed and demonstrated successfully. Eventually there was a power failure, but the equipment that was to be powered did not work. The electrical outlets were color coded to show where that backup power was connected. Unfortunately they got the color codes reversed. The only thing that worked after the power failed was the Christmas tree.
  4. Trying Out the Disaster/Recovery Plan – An organization developed and implemented a disaster/recovery plan. They wanted to try out the operation to see if they were successful and thorough in their plans and implementation. So on a Monday they caused a failure. Everything worked as planned and there was no major interruption to the users. Of course it worked because everyone was informed on the previous Friday of the planned failure. The users all downloaded what they needed into their laptops so they would not be bothered. The plan was successful because everyone already had the resources they needed independent of the infrastructure.
  5. The Vent is Where? – An electrical generator was installed for backup power. Everything worked as planned. There were no problems at all. However, when someone went into the closed garage, they realized that the generator exhaust was vented into the closed garage. The exhaust design needed to be changed with a commensurate increase in the budget.
  6. Saving the Data – A company that anticipated the affects caused by hurricane Katrina, decided that their data center site might be completely destroyed. In anticipation of the hurricane, the companies' data files were loaded onto magnetic tapes and given to multiple employees. Sure enough the data center was damaged. One magnetic tape was not returned for nine months. The company assumed that the employee would be locatable after the hurricane and that the employee would automatically return the tape. It turned out that this employee was evacuated out of state. The company did not plan for the size of the hurricane disaster and had not created an employee contact and location plan.
  7. Safe IT Equipment – An organization knew that there was a possibility that the first floor IT closets might be flooded under extreme conditions. So all of the equipment racks were filled with equipment above flood level leaving the first three feet of rack space empty. No one had talked to the electricians. All of the electrical connections were installed at floor level, well below the flood level. In a flood, the equipment was protected but the electrical connections would be damaged.
  8. The Data Center is Safe? – A new data center was constructed in an existing office building. They planned for every eventuality including building closure due to protesters, employees staying on site for days, backup power …… Not one thought of the building itself. This data center, that connected all of the other data centers together, was directly above the loading dock. There was one critical pillar holding up that end of the building. If a truck damaged the pillar or if it collapsed, this critical data center would be destroyed.
  9. The Closet Generator – An enterprise installed a power generator for backup power. The installation was a success. When they tried out the generator, everything worked fine. Then they went to turn off the generator. The door would not open. The air intake for the generator was inside the generator room and it caused a vacuum. The staff had to pry open the door frame to get inside to shut it off.
  10. The Almost Complete Plan – A financial institution developed a comprehensive disaster/ recovery plan. Unfortunately, six months after the plan was finalized, the building burned down. The employees had alternate working locations at temporary business centers, hotels and at home. They all had a place to work. What was missing from the plan was a directory and location information. Employees called each other at home. It took about two weeks to complete directory and location files.
  11. The Data Center Power Worked – The data center designers had included a well planned backup power system in a California site. There were many very short power fluctuations (seconds) that did not interrupt the data center operation. Finally, a major power outage occurred in the evening. What the designers forgot were the electrically controlled secure doors to the building. No one could get in. The impromptu solution was to call on a cell phone and have someone go to the entrance and let the person in every time someone needed entry.
  12. Backup Phones in Conference Room – A manufacturer installed a VoIP system with PoE. The cost of UPS was beyond their budget. The emergency communications plan called for the installation of CENTREX phones to be located in all the conference rooms. When there was a power failure, the conference room phones could be used. Finally a power failure occurred. The conference rooms had no windows and no one had installed backup lighting. With the lights out, the employees had to hunt around for flashlights to locate the CENTREX phones in the dark.
  13. Fueling the Center – Another lesson from hurricane Katrina was the number of closed roads. The closures lasted in some cases for weeks. Those organizations that depended on diesel generators were able to operate for several days. The assumption was that fuel trucks would eventually arrive. They did not. After about a week of operation, these locations went down due to lack of fuel. A school system in the south decided that they would have to provide shelters during disasters and were concerned about the fuel delivery. They designed their generator operation to run on natural gas which was less likely to stop functioning during a disaster.
  14. The Disaster/Recovery Staff – One of my early projects was the automation of a central bank. It was important that operations continue even if they had to be carried out manually. So we designed a manual backup system with adequate equipment and staff that would work. Our system had an automated recovery system that worked whenever we had a problem. One day I decided we needed to exercise the manual backup procedures to ensure they still operated correctly. Nothing worked. The equipment was there, but not the people. The different departments gained so much confidence in our system, they reassigned their staffs so that the manual system could not operate.
  15. The Emergency Host – An organization decided to update its emergency procedures. They proposed that every building in their campus have an emergency host that would stand outside the building to direct emergency services. Sounds good until you consider explosions, chemical leakages, terrorism and earthquakes. The plan also assumed that the host would only be needed during work hours. They finally realized this cheap plan had many flaws and did not adopt it.
  16. It Worked Before – A data center with many computers on multiple floors was designed with a diesel generator system for backup power. The system was tested successfully for three years. Finally the power was lost. The generator would not start. The generators on the first floor depended on fuel tanks in the basement. Without electrical power, the fuel pumps did not work. Fuel cans had to carried to the first floor to start the generator to start the fuel pumps to keep the generator working.

Thinking about the possibility of a disaster and the means to recovering the IT operation is much more than technical planning and implementation. There will always be some flaw or assumption that may prevent a successful recovery as the stories above illustrate. So what do you do? Get the non technical, and business oriented personnel in the organization not only to participate, but be encouraged to criticize any disaster/recovery plans. So follow these rules:

  • Get everyone possible to participate.
  • Open up the discussion to all equally. Do not lead with conclusions.
  • Do not discard any idea or problem even if it seems beyond what is reasonable. There may be a gem there.
  • List all the assumptions that influenced the plan.
  • Document all comments, then review them.
  • Finally, test the plan under real conditions even if your management objects, otherwise you will never know how well the plans work.

 


About the Author
Gary Audin may be reached at delphi-inc@att.net