Written by Kathleen Lucey, FBCI   

 

Blackswan

Hunting the Black Swans in Your Continuity Program


This is Vol. II, No. 9 in the DRG ongoing series regarding hunting and mastery of the black swans in your continuity program.

“Black Swans” in your Continuity Program are those events that remain outside the range of normal expectations, and may well produce a significant negative impact when they occur. For reasons of budget, culture, or simple lack of awareness, we just do not see or deal with these potentially devastating exposures in our enterprise continuity capability. This series discusses some of the most common of these “black swans” in business continuity programs, those that are really staring us in the face and screaming for attention.

Already published:

Volume I

Quarry 1: Employee Availability for Response Activities.
Quarry 2: The Level of Individual Employee Commitment to BCM
Quarry 3: Exercising Your Plans
Quarry 4: Exercising Your Plans: Objectives and Annual Programs
Quarry 5: Exercising Your Plans: Business Unit Continuity Plans
Quarry 6: Exercising Your Plans: Technology Recovery Plans
Quarry 7: Exercising Your Plans: Logistics, Communications, and Support Plans
Quarry 8: Lessons Learned
Quarry 9: New Year's Resolutions
Quarry 10: 10 Steps to Building a Black Swan-free Business Continuity Management Program
Quarry 11: New Year's Resolutions
Quarry 12: Developing "Black Swan Sighting" Skills: Warm-up Exercises

Volume II:

Quarry 1: The Centrality of Power: Seeing the Connections
Quarry 2: Power Outages: Isolation Effects
Quarry 3: Power Outages: How Employers Can Get Involved
Quarry 4: Cascading Effects on the Support Fabric
Quarry 5: Deeper Dives to Narrower Terrains: Dive 1
Quarry 6: Deeper Dives in Wider Terrains: Dive 1
Quarry 7: Cascading Black Swan Events
Quarry 8: Cascading Black Swan Events 2: Avian Flu Outbreak

Volume II: Quarry 9: Black Swans in Our Midst: Debugging Your Response Preparations

The auditors have accepted your exercise program results for business continuity and disaster recovery and crisis management for years, and have checked that box as “passed”. Your management is pleased with the audit reports.

But wait, is this really good enough? While this type of exercise is certainly important for formal compliance purposes, it may not uncover the gaps, flaws, and omissions in data backup or recovery processes (notification, mobilization, deployment), nor will it make clear that those detailed technical procedures necessary to reconstruct information systems and synchronize data are incomplete and may be incorrect.

IF YOU ARE NOT LOOKING FOR ERRORS OR OMISSIONS, IT IS UNLIKELY YOU WILL FIND THEM.

The very first item on your to-do list is the identification of these errors or omissions. For example, if you are using the same technical staff who wrote the procedures to execute those procedures during a system re-creation exercise, this is not an adequate test of the completeness and correctness of the procedures. Only deployment of alternate personnel not so intimately involved with the design of the recovery strategy and its documentation is likely to pinpoint the errors or incomplete information within that procedure documentation.

And so while it may seem self-evident that if we want to ensure the effectiveness, completeness, and feasibility of our plans, we should be testing to FIND ERRORS rather than to DEMONSTRATE compliance, this is in fact not usually the case. It may sound like a small point, but as you will see, this orientation can make all the difference in how you actually find and correct inadequacies in individual plans as well as fine-tuning such fundamental processes as evacuation, convocation, damage assessment, team member notification, and the actual details of locating backup data and installing staff in their new environment.

Testing for compliance purposes alone may conceal rather than reveal plan inadequacies in such disparate areas as staff evacuation, recall of backup data from offsite storage, relocation of staff to an alternate work area, technical changes to accommodate large numbers of staff working from home, and on and on.

In order to identify flaws in your plans, you need a specific scenario, where you can take into account the impacts of the event as they cascade through the environment. Yes, it is a lot more work to do this than to do a straightforward re-creation of systems at an alternate site, with relocation of business department representatives to test the data after the re-created environment has been tested. Here you will have assurance that necessary data will be delivered from their offsite location, be already loaded in tape libraries, and successfully used to re-create the validated systems prior to the arrival of business staff members on-site.

But this is playing with dolls and dollhouses, rather than accepting the necessity to duplicate at least some of the conditions of a real event.

What will you do if the roads are impassable? Can you get a helicopter to bring in your data? What if the alternate location you normally use has been compromised by the same event? If your A-Team technical staffers are simply not available, either because they are tending to their families or because they cannot get from their homes to the alternate location (and you have no plans in place to cover this possibility) or they are on vacation far away or for any reason? What if telecommunications are down or unavailable except for the internet? How will you adjust to this?

There are those who contend that it is not useful to construct such a scenario because it is not possible to anticipate every aspect of a disaster event. But this is like saying that if you cannot be certain to have covered everything, you should do no scenario play at all. If you cannot be confident of covering all possible events, you should not try to do anything other than patently unrealistic testing. On the other hand, by defining scenarios that flesh out the challenges of dealing with the potential cascading impacts of a disruptive event, you will be able to uncover inadequacies that are likely to otherwise escape your notice.

Here are just a few examples from the where backup and/or testing operations are flagrantly inadequate, and yet they sneak into business continuity programs nonetheless. I have seen all of these first-hand, and I can tell you that the situation is not pretty.

  1. We know that one of the most common sources of delay in the declaration of a disaster event is difficulty in contacting the executive members after an off-hours interruption. And then once assembled, they do not know what information they need to make a decision…..because they have never rehearsed this or any other scenario that requires their input. Some of them do not even know what their default actions should be in case communications are disrupted. Because they have never participated in any exercises that helped them to learn how to work together, sorting out the various roles they will be asked to undertake under usually very challenging circumstances will never have been performed before. They are therefore ill-equipped to take on these responsibilities and are likely to induce recovery delays at a time when a speedy response may be the most critical.

  2. For reasons of technology operational efficiency, you have adopted a tape backup creation strategy of “full once, incremental ever after”. This decision has several advantages to an IT production environment: minimization of backup run time, and minimization of staff and equipment costs. Your media and off-site storage costs may have crept up over the years, but these year-on-year costs are easily absorbed in your operating budget. If you stack incremental backups from several systems on a single tape volume, you can also save on media costs.

    Because of the very high cost of bringing back so many tape volumes from storage, you create special tapes in advance of any IT recovery test, sending them to offsite storage to be recalled to re-create the systems during the test. Your management may or may not be aware that these “special backup tapes”, used only for the planned re-creation test, have been created.

    This means that you are NEVER using the backup tapes that you create every day under the “full once, forever incremental” strategy, and will never experience the horrific volume of tapes that you will need to bring back, mount, and read in order to restore your interrupted systems to their most current state.

    If you should experience a physical failure in just one of those incremental volumes, the game may over. Ditto if you fail to correctly re-create your backup creation and control system to accept your “special” backups.

    Because you have never experienced a test under these realistic conditions, i.e., recall of hundreds if not thousands of tape volumes, extreme head contention in the tape library, and possibly even restoration failure because of an inability to read one specific incremental volume in a series of hundreds...your organization has no idea that this threat exists. On the contrary, your organization has been happy with your current backup strategy and its relatively low costs.

    Yes, if there is never a need to do a system restoration, they may never know just how ineffective that strategy could become when they need it most. You cannot identify those skeletons until you take them out of the closet. The right time to do this is during a system re-creation exercise. Anything else is patently dishonest, and can have a devastating effect on your organization when it is already weakened by a serious interruption event. Your choice is to make this deficiency visible, or roll the dice that you NEVER will need to use these backup tapes. In which case, it can be argued, why are you creating them? This can be a very expensive delusion.

  3. Here is yet another hidden black swan. After running significantly over budget and over time, you are finally completing the last step in the roll-out of a very complex ERP (Enterprise Resource Planning) system. It is now time to write the procedures for re-creating that system in the disaster recovery plan for this application. You follow the standard documentation guidelines. Your Business Continuity Coordinator points out that your new system interacts with 43 other applications for input and/or output data and you need to document how you are planning to re-synchronize the ERP database with all of these applications, which may or may not have failed at the same time. And which may or may not have the same backup strategy as your ERP system. The project plan for the ERP project did not take into account the potential difficulty in re-synchronizing data for all of these applications after a failure of ERP and/or several of its interfacing applications, and you do not think that you will succeed in obtaining additional funding to do this work. Therefore you request permission to defer until the next fiscal year the consideration of how you will re-synchronize data to a common point in time when there is a failure of the ERP system and/or one or more of its 43 systems that provide input to ERP or use output from ERP. Permission is granted.

    Therefore the work is not done when the system recovery documentation is created. A few months later you release responsibility for this system to a maintenance team, without specifying that the issue of data synchronization in recovery has not been addressed. In the meantime, your local Business Continuity Coordinator is replaced with a new person who is unaware that this requirement has been put aside. When a failure does occur that affects 23 of the interfacing applications but not the ERP system (which has itself failed over to a secondary node and a database with fully synchronous replication at another site, the gravity of the situation quickly becomes apparent as the restored systems, many of which have differing data restoration points, begin once again attempting to interact with the ERP system.

Yes, this is extremely challenging, and yet such issues are at the very heart of the business continuity profession. If we cannot step up to our responsibilities to reveal serious exposures when we see them (and it is our professional responsibility to see them), how can we expect those with custodial responsibilities for our organizations to do so?

 

About the Author:

Kathleen Lucey, FBCI, is President of Montague Risk Management, a business continuity consulting firm founded in 1996. She is a member of the BCI Global Membership Council, past member of the Board of the BCI, and the founding President of the BCI USA Chapter. IBM chose her as the first winner of its Business Continuity Practitioner of the Year Award in 1998. She speaks and publishes widely in both North America and Europe. Kathleen may be reached via email at kathleenalucey@gmail.com