Hunting the Black Swans in Your Continuity Program
This is the third in the DRG ongoing series regarding hunting and mastery of the black swans in your continuity program. Look for it on the first Wednesday of each month.
"Black Swans" in your Continuity Program are those events that remain outside the range of your normal expectations, and may well produce a significant negative impact when they occur. For reasons of budget, culture, or simple lack of awareness, we just do not see or deal with these potentially devastating exposures in our enterprise continuity capability. This series discusses some of the most common of these "black swans" in business continuity programs, those that are really staring us in the face and screaming for attention.
Quarry 1: Employee Availability for Response Activities.
Quarry 2: The Level of Individual Employee Commitment to BCM
Quarry 3: Exercising Your Plans
When you think about it, we have the emphasis dead wrong. Most of the articles you read and the vast majority of the commercially available BCM planning products are designed to BUILD plans, not to exercise them. But even if you spend 3 years completing BC Plans for all areas of your business, you will spend the rest of your tenure in that position in two areas: exercising the plans and updating them to reflect changed conditions. It turns out that this "forever after" phase is in fact far more difficult than the development period, no matter how complex or large that development phase may be. And maybe this is reason why there are so few products that address it directly and so little literature on what this phase entails….and how to do it.
Because this is such a complex subject, I am going to address it in 5 parts. Today I will talk about exercise types. Over the next few months, I will talk about techniques for setting annual exercise objectives as well as setting individual exercise objectives, and then go on to tackle the specifics of exercising business unit plans, technology plans, and the most difficult of all, logistics, communications, and support plans. It never ceases to amaze me how little theory and how few tools are available to help continuity practitioners to do all of this.
There are three basic types of exercises: Notification, Tabletop, and Displacement. Each can range from short and simple to much longer and far more complex. You will sometimes find different information from this in professional standards and guidance documents, but it is really just a matter of terminology. Note that I am saying exercise in the sense of a "rehearsal", where the intent is to find what works, and equally to find what does NOT work. Much harm has been done to our profession by the concept of "passing a test" because this defeats the fundamental purpose of the exercise process, which is to uncover flaws. We are not so wonderful or so gifted that we get it all right the first time around. No show hits Broadway before having first worked out the bugs in regional theaters (except maybe "Spiderman", but that is a long and complex story).
The types of exercises are based on the fundamental concept of displacement of people and operations.
In this simplest type of exercise, people associated with a particular plan or business unit are contacted. This may involve a call tree where a Team Leader calls everyone or where each person contacted then has the responsibility to call the next person on the list. Automated Notification Systems can execute whatever call sequence you have defined, as many (or as few) times as you wish, and will keep a written audit trail of all communication attempts via a wide array of channels. The most sophisticated of these allow for alternate pathing based on contact success or failure and can be integrated into complex logic chains within BC Plans.
When only manual call trees are used, these exercises were often flawed:
- The contact information was usually verified and updated if necessary just prior to the exercise in order to be sure to "pass the test". Often those to be called were also briefed on the timing of the exercise in order to make sure that they would be available.
- Manual audit logs needed to be kept and callers needed to be assigned to make the calls, making the conduct of the exercise challenging, especially if was conducted during off-hours.
- When each person on the call tree was to call the next person on the tree, the communication often broke down when an answering machine answered the call or there was no response. Often the action to be taken by the caller in that case was not explicitly specified, resulting in confusion and broken call sequences.
- Notification exercises could be effectively faked by creating false audit logs, particularly for those exercises involving a small number of people in a single unit.
All of these issues can be avoided by using one of the many automated notification systems. Such systems are available with a wide range of features.
If a notification exercise involves a single plan for a single business unit, technology system, or logistics team, it is called a Unit Exercise. If it involves more than one of any of these, it is called an Integrated Exercise. There are many levels of Integrated Exercises, from a group of related business units to all units within a single building, campus, or Division with units in more than one location, for example.
No Notification Exercise, whatever its complexity, requires the relocation of any of the participants.
In a tabletop exercise, the participants gather in one or more locations (commonly conference rooms or other large spaces with communications). A tabletop exercise is based on a specific artificial and yet realistic scenario, a bit like a movie script, where the exercise leader confronts the participants with a situation that requires that they execute their plans. Generally at least 4-10 different business units are involved, with critical recovery support units such as Incident Management, Business Continuity Coordination, Employee Support, etc. also participating. Often external participants such as First Responders from specialized units such as EMT or Hazardous Materials (HAZMAT) also participate. Internal participants then consult their recovery plans and follow the scenario rules for communications.
The "Director" of the tabletop sets the rules for interaction and communication with other areas of the organization or other locations where additional organization staff may be gathered, and provides information about the scenario as it happens, using video clips and other information such as newscasts or bulletins, to inform participants about the unrolling scenario.
Tabletop exercises generally are geared to complete within 2-4 hours, but may run for multiple days with complex multi-site scenarios. These multi-day events may be referred to as "war games", from which they originate.
Tabletop exercises are notable for the high level of benefits that they provide at a very reasonable cost. In recent years, they have gained in popularity. However, as you may imagine, writing the scenario for a complex tabletop is a little like writing an original screenplay. Running the exercise is like being the director of the movie where the actors are making it up as they go along. You can make your job easier by involving some of your internal people in the writing of the exercise, and by getting professional help for really complex events.
The biggest flaw of tabletops is that they are not done because practitioners do not have the requisite skills and do not know how to get the training they need. Much good work can be done by using a team approach, however. In the end, the effectiveness of an individual tabletop exercise is directly related to the ability of the scenario writer and exercise manager to make participants forget that this is just an exercise. This means that they will engage their brains as they would during a real event, and so this is a rehearsal of emotions as well as logistics and communication. This means that you have created an effective rehearsal vehicle that allows participants to develop the group dynamics they would use in a real event.
Technology Recovery Exercises
This is the grand-daddy of continuity exercising: these were called Disaster Recovery (DR) tests when they became commonplace in the 1980s, and are often still called DR Tests.
A technology recovery exercise most commonly involves re-creation of a subset or all of the applications from a centralized data center at a third-party or internal alternate site. Extreme care must be taken when using any internal site to make certain that there is NO interaction with the Production Environment and therefore no chance to take down production.
Recovery of a single complex system (programs and data) would constitute a UNIT Exercise; recovery of multiple systems constitutes an INTEGRATED Exercise. This is what most people think of as a recovery test.
Many variables may be injected into this type of exercise, such as forced unavailability of the "A" team staff, last-minute notification of a need for exercise deployment, combinations of related applications in order to test data synchronization strategies. Such variables enhance the similarity of the exercise to what would occur during a real interruption, and therefore are extremely valuable. Even in a technology recovery exercise, it is important to include the logistics and communications phases because commonly these are where much delay is induced. And if you never exercise these phases, you will never be able to lessen these delays.
A technology recovery exercise may or may not involve relocation of staff. When there is a local alternate site, staff will generally relocate. If the alternate site is remote, at least some staff may work remotely, either from their office locations or their homes. As you do not know what will be the source of the interruption, it is advisable (and is generally done) to relocate staff to the alternate site.
The costs associated with the contracting of an alternate site can be considerable. If you vary your technology recovery scenarios, timing, and participating staff, your benefit will be much higher than if you do not. Using specific interruption scenarios, rather than always targeting a "smoke and rubble" event where the entire data center is lost, is beneficial. Such events occur much more often and offer different recovery challenges than those associated with a total loss.
Business Continuity Exercises
This type of exercise generally involves multiple business units, and is often conducted in cooperation with a technology recovery exercise. A common scenario involves the relocation of business unit staff to an alternate site to access re-created technology. However, companies are increasingly requiring their business staff members to regularly work from an external alternate site or from their residences to perform their ordinary business functions. And so this would generally involve a displacement of the employee from the normal work location – although it is not clear if working from home is really a displacement!
Many exercises that involve both technology recovery and displacement of business unit staff to one or more alternate sites use business unit staff to validate recovery of the systems to a pre-defined prior date. However, as more and more systems involve advanced failover and replication strategies, it will certainly become more common to use business unit staff to validate that these systems have been recovered to a level that is well within both the RTO and RPO for the test system. In this case there may be no required displacement of business staff; they may simply connect via the internet from their usual access devices to the re-created system. Such techniques are evolving rapidly now, but still require great care. Remember that the first rule of business continuity exercising is NOT TO AFFECT PRODUCTION PROCESSING.
In August we will take a look at how to design effective annual exercise programs using all of these exercise types.
About the Author:
Kathleen Lucey, FBCI, is President of Montague Risk Management, a business continuity consulting firm founded in 1996. She is a member of the Board of Directors of the BCI, and the founding President of the BCI USA Chapter. IBM chose her as the first winner of its Business Continuity Practitioner of the Year Award in 1998. She speaks and publishes widely in both North America and Europe. Kathleen may be reached via email at firstname.lastname@example.org.