BlackswanHunting the Black Swans in Your Continuity Program

This is the sixth in the DRG ongoing series regarding hunting and mastery of the black swans in your continuity program. Look for it on the first Wednesday of each month.

“Black Swans” in your Continuity Program are those events that remain outside the range of your normal expectations, and may well produce a significant negative impact when they occur. For reasons of budget, culture, or simple lack of awareness, we just do not see or deal with these potentially devastating exposures in our enterprise continuity capability. This series discusses some of the most common of these “black swans” in business continuity programs, those that are really staring us in the face and screaming for attention.

Already published:
Quarry 1: Employee Availability for Response Activities.
Quarry 2: The Level of Individual Employee Commitment to BCM
Quarry 3: Exercising Your Plans
Quarry 4: Exercising Your Plans: Objectives and Annual Programs
Quarry 5: Exercising Your Plans: Business Unit Continuity Plans

Quarry 6: Exercising Your Plans: Technology Recovery

Today is the time of the autumn monsoon it seems, after the heat of summer in the City. And our subject today is back to where it all began: Disaster Recovery, or DR. What started out as a relatively simple mainframe recovery with locally connected terminals, has now become extremely complicated and complex. We have an absolute dependence on external networks. Our databases are huge, and our backup techniques are varied. Our hardware configurations are extremely complex. Technology reaches ever farther into our business functions, touching even such areas as critical care for hospital patients. And an ever-increasing number of our operations are outsourced – whether to "the cloud" or to a more traditional provider model. Our challenge is therefore multiplied: managing recovery of our own technology operations, as well as ensuring that our outsource partners are capable of meeting our availability requirements. And so this leads us into a far more complex challenge than we had back in the early days.

In addition, our technology staffs are now busier than ever, as they struggle to keep up with an ever- increasing demand for applications, as well as a need to streamline computer operations to minimize power usage and become more "green". Have our exercise techniques kept pace with the increase in the number of applications and the divergence of suppliers? Have we increased our annual exercise time allocation as well as added staff to ensure that such activities are well-planned and effective, and meet the defined exercise objectives for the year? Are we making demonstrable progress toward greater maturity and verisimilitude within our exercise programs year-on-year? For our external technology partners, are we successfully working with them to ensure that they have the ongoing capability to meet our requirements?

The result is that technology recovery plan exercising has become the Wild West once again, and at the same time has increased geometrically in importance because of the extreme dependence of almost all business functions on an ever-increasing number of technology applications. It is impossible to cover all of the aspects of managing recovery of your technology operations in this column, but let's at least discuss some of the most important areas you need to address

  • Electrical power providers/onsite supplemental diesel generators
  • Network carriers
  • Outsourced individual applications
  • "Cutting-edge" applications
  • The "Cloud"
  • Data Center in a third-party co-location site
  • Full outsourcing of IT staff, services, equipment, and site(s)

Yes, of course there are other challenges, some of which will become the subject of individual future columns.

Let's talk first about electricity and network services providers. These are the basic building blocks of technology (and incidentally of our modern society).

Many organizations have installed or are considering the installation of supplemental diesel generators. Make sure that you have a regular maintenance contract with a trusted service provider for your generator, as well as a realistic test program….one that tests the generator at full normal load. If you do not want to test during prime time, you can bring in a load bank to simulate your normal usage. And also make sure that you have two or even three fuel providers under contract to deliver additional fuel to you on a priority basis if there is a lengthy regional power outage. Remember that loss of power is still the proximate cause of the majority of technology outages.

Use of multiple telecommunications carriers is a common technique to ensure continued network services. You may also want to have two divergent points of entry to your site (DMARC) to provide two connections between your facility and the location where you connect to each carrier (local loop). However, in many locations your local loop is managed by only one carrier. AND, as we learned to our horror in the aftermath of the fall of the World Trade Center towers, often cables from multiple carriers are routed through a single physical conduit! Find out what routing information you can from your carriers. Even with carrier diversity, provision of telecommunications services to your sites is likely to remain a single point of failure.

If you have multiple sites connected by an internal network, multiple-pathing architectures can achieve some resilience. If a link breaks, there will always be an alternate path that can be used. If you want to know whether the capacity of this backup path is sufficient for normal operations, you must test it. It certainly can be frightening to find out during an exercise that this backup capacity is radically insufficient to support a normal load; the answer is to be prepared to switch back to the primary seamlessly if the impact of using the backup is too great. But surely it is better to learn this from an exercise than to find out about this insufficiency when you really need that alternate path. After all, we exercise to identify just this kind of inadequacy!

Cutting-edge applications or new infrastructure designs may carry unknown or untested risks. (Note: any application whose recovery is untested does carry unknown risks.) The time for you as the responsible person for technology recovery to assess this risk is BEFORE you install a new application or BEFORE you commit to a radical new infrastructure design. And of course you should exercise the recovery of this application BEFORE it goes into full production. I know this is not always feasible: not all system life cycles even include testing the recovery of the application prior to its move to production. But this is your job, nonetheless, and you should not sign off on any new application for which production status is requested until that application or architecture has had at least an initial recovery exercise to identify any serious omissions or errors. And until you have had the opportunity to review the correctness and completeness of its recovery plan documentation.

Outsourced applications, whether run by the provider, "in the cloud", or at a co-location site that you contract, need to undergo basic recovery testing prior to being implemented into production. These systems need to have at least a preliminary recovery exercise to ensure that both the detail of the recovery documentation is at a sufficient level, and that the external staff recovering your application is logistically and technically capable of meeting the agreed-upon recovery requirements, both for data recovery (RPO) and full system recovery (RTO). The right time to do this is immediately upon implementation of the applications or as soon as possible thereafter.

If you outsource your entire technology environment, you should have special concerns for recovery exercises on an ongoing basis. The amount of time and resource to be allocated to recovery exercises should increase as the number of applications increases. As with the systems running in an environment you control fully, you should aim to increase the maturity and therefore reliability of the exercise program for your outsourced environment year-on-year. This needs therefore to be written into the outsource contract.

All of this is to say that technology recovery exercising is usually jam-packed with black swans. A swarm of black swans. A particularly strong and pernicious black swan is a recovery from a less-than- total outage. If you run exercises that are based on specific interruptions rather than the "worst-case scenario" of a total outage, you will discover a raft of unanticipated and difficult issues. The outdated assumption that we should be doing technology exercises only from an implicit "total outage" scenario prevents us from seeing the cascading risks that occur in a specifically defined outage scenario. For example, the specific nature of data resynchronization issues among multiple interfacing systems cannot be understood unless you use a partial outage scenario designed to expose these issues. Another issue is developing recovery team bench depth in the deployment of teams to address the most critical systems first, and then less critical systems, with specific scenarios that include different sets of applications. Many practitioners talk about this, but how many have conducted the kind of large- scale exercising that will rehearse and debug planned recovery activities based on developed "muscle memory" of existing interfaces among applications?

There are many technical, logistical, and staffing issues to be addressed. You can exercise to develop expertise and even some "muscle memory" about how to best deploy resources to achieve maximum results in multiple scenarios (as well as getting a much better idea of what your current capability can actually deliver). But all of this requires a significant amount of personnel and financial resources, generally more than has traditionally been allocated to this function.

These are not simple issues and they are not easily addressed. This is a very large subject that deserves to be treated in its own very large volume. But at least we should attempt to maintain a realistic view of what a given technology recovery capability can deliver. The problem is not solved by pretending that a recovery capability is more effective than we know it to be.

And please remember one last point: meeting regulatory exercise requirements may not be nearly enough to deliver what your organization needs when the next unexpected major interruption occurs.

About the Author:
Kathleen Lucey, FBCI, is President of Montague Risk Management, a business continuity consulting firm founded in 1996. She is a member of the Board of Directors of the BCI, and the founding President of the BCI USA Chapter. IBM chose her as the first winner of its Business Continuity Practitioner of the Year Award in 1998. She speaks and publishes widely in both North America and Europe. Kathleen may be reached via email at kathleenalucey@gmail.com.