Meet The Experts

 

Operations Resilience: Bridging the Gap

Their information was replicated. They thought they were prepared. But when the unforeseen suddenly happened, the damage that ensued was devastating.

It’s a story repeated over and over among major brands and organizations of all kinds.  Experiencing extensive downtime during a vital sales period, the online retailer American Eagle Outfitters lost millions of dollars in revenue and suffered millions more in damage to their brand. JPMorgan Chase lost their online banking capabilities for several days, causing their CEO to issue an apology. Bank of Singapore saw their entire mainframe environment go down for eight hours, although they were using the most sophisticated technology. When at least 24 state agencies in Virginia lost full IT support, problems for some agencies continued for weeks. A major European bank was fined $2.6 billion for losing tapes. And within the first half of 2011, United Airlines, US Airways, Alaska Airlines and Southwest Airlines all experienced significant data disruptions. Outages in recent years at Cisco Systems, Google, and PayPal further underscore the reality that nobody is immune.

The truth is that we have grown increasingly reliant on evolving technologies such as replication for IT disaster recovery. At the same time, we have diminished our emphasis on knowledge of the big picture as a defensive strategy. It should come as no surprise that today, gaps in these defenses are as plentiful as the variety of hardware and software solutions on which we rely. And our failure to understand the nature and significance of these gaps can have devastating consequences.

Widespread risk revealed

There’s evidence many of those directly involved fail to recognize that mission-critical data isn’t getting the protection it deserves. A new Enterprise Disaster Recovery Survey by z/Journal1, an independent publication targeted at users of IBM mainframes, tells a revealing story. More than 72% of all respondents believed their IT departments were sufficiently prepared for a major business disruption or disaster to ensure that systems, applications and data would be accessible within the timeframe they expect. But more than a third of respondents already had experienced a major business disruption or declared a disaster, with over a quarter of those going down for more than eight hours.

Unaware and unprepared

A February 2011 commissioned study2 conducted by Forrester Consulting on behalf of 21st Century Software paints an even more disturbing picture. Forrester found that only 22% of organizations surveyed had not declared a disaster or had a major business disruption. In addition, 43% of organizations had experienced some data loss during their disruption, and 46% had suffered more than 8 hours of downtime. With downtime costing an average of $145,000 per hour, the cost of an eight-hour disruption is nearly $1.2 million. How can this be happening? A full 62% of survey respondents admitted to missing a recovery time objective (RTO), recovery point objective (RPO) or service level agreement (SLA). Obviously they were unaware of the true extent of the risks when they set those objectives. Forrester attributes this gap between expectation and reality to three causes: failure to validate application recoveries; insufficiently frequent testing of DR plans; and RTOs that don’t account for the time needed to assess a situation, declare a disaster and run time-consuming processes such as restarts.

A costly disconnect

This lack of preparedness isn’t so surprising in light of the disconnect reported in a May 12, 2011 article in InformationWeek4. The article cites a study by the Ponemon Institute, sponsored by Emerson Network Power, of outages at data centers based in the United States. In this investigation, only a third of the responding IT staffs felt their resources were adequate to recover from a failure. Yet when it came to assessing whether senior management fully supported efforts to prevent and manage unplanned outages, 75% of responding senior managers agreed, versus just 31% of employees at supervisor level and below. Of the 41 data centers studied, 95% had experienced one or more unplanned outages during the previous two years. In this case, the average estimated cost of downtime ranged as high as $11,000 per minute for telecom service providers and e-commerce companies, making losses of $1 million or more per outage a familiar event.

A range of imperfect strategies

The Forrester study revealed the variety of approaches organizations use for mainframe disaster recovery (DR) to protect against failure of an entire site, and high availability (HA), to protect against failures within a single site. The 34% who reported relying on a high availability approach, such as consistency groups, to guard against individual component failures still remain exposed to site failures, corruption and accidental deletions. So do the 20% who employ synchronous replication within a metro area. The 32% using local backup to disk for operational recovery lack the speed for true disaster recovery, as do the 22% using even slower tape backup. Periodic point-in-time copies give 28% of respondents faster recovery abilities than backup can provide, but need to be sent off-site for site-failure protection, and may be unrecoverable if they don’t include a properly synchronized backup of cross-application data. Asynchronous replication, used by 28% of respondents, balances the advantage of storing data at extended distances with some data loss upon site failure. And the 24% of those surveyed who use both asynchronous and synchronous replication still lack protection against corruption and accidental deletion. Within this group a false sense of security exists because they largely ignore the way their applications use data. For example legacy batch applications may use input datasets that are on tape. These datasets, though not physically on tape any more, are not restored synchronously with disk data. This seemingly small oversight could cost an organization hours to work around and cost millions due to unmet business SLAs

Why have so few organizations adopted a combination of approaches sufficient to eliminate these vulnerabilities?  Perhaps it’s because the way they approach their DR challenge is flawed from the start.

A purpose beats a plan

Today, most organizations test their IT infrastructure from the perspective of a plan: “If this happens, my plan says I should react like so.” Few organizations instead consider the ultimate purpose of their efforts: “My aim is to have data in my customers’ hands within this period of time—let’s test to discover what possibly could prevent that from happening.” The result is that the maximum downtime stipulated in Service Level Agreements frequently bears no relation to the outage experienced when reality strikes. Did Bank of Singapore plan to be down for eight hours? Did JPMorgan Chase plan to have their online banking down for parts of three days? Did American Eagle Outfitters plan for their eight days of chaos? Without knowledge of your data’s usage, even the most detailed plan can’t anticipate the full consequences of a disruption, and therefore serves no real purpose.

The goal is operations resilience

A realistic assessment of cost versus risk drives an organization’s adoption of disaster recovery as a strategic business initiative. All business processes are not critical. The cost of ensuring uptime all the time for these supporting capabilities would be prohibitive. But mission-critical processes are a very different matter. Even 99.99% availability equals 52.41 minutes of downtime a year—an eternity when processes involving such vital matters as e-commerce, public safety and plant operations are at stake.

Gartner, the world’s leading information technology research and advisory company, has found that typically about 20% to 30% of an organization’s IT portfolio falls into the critical category3. For these essential processes, the key to achieving extreme availability is to recognize that every system will fail, and know how to fix things in the blink of an eye. A service outage that users never notice is, for all practical purposes, no service outage at all. This is what operations resilience allows you to achieve—using an agile, automated application-recovery approach that truly incorporates the big picture.

Where have all the experts gone?

As IT infrastructures have gained complexity over the years, the C-level executives with ultimate legal responsibility are in a far more precarious situation than many of them realize. These CXOs place their trust in their IT Infrastructure, Disaster Recovery and other IT Managers, Applications Managers, and outside managers.  However the pure nature of a business’s silos, and a lack of commitment or knowledge within the business, can limit the resources available to the overall process. Worse yet, the silo nature of the different IT groups can put solutions in place with little or no consideration of the effect on another group within IT.  When it comes down to storage versus applications versus DR, who talks to whom? Meanwhile, vital legacy knowledge in many organizations has evaporated through consolidations, sales, retirements, layoffs and ongoing outsourcing and insourcing. No matter how well the current IT staff may think they understand the latest and greatest technology they’ve acquired, they no longer possess the insights of those who helped build the information edifice on which those new technologies rest. What are the vulnerabilities of this fusion of new technology with legacy applications and unknown nuances that they may hold? In many cases, it’s anyone’s guess.

Different priorities

The quest for assurance by the CXO typically follows a predictable path in which a call from the business leads to attempts to fulfill the request and meet the organization’s DR requirements. This might involve assigning tasks to the DR Manager/Director, or to the Infrastructure/Storage group, or some other similar area. However, the call drives whomever is tasked to seek individual assurances from their data storage and applications managers, and to rely on applications processes or technologies that they assume will work and meet all their needs. Because the conversations rarely cross the silos, the gaps begin. This is driven by the reality that storage and applications differ significantly in the ways they view their IT challenges.

Storage views things from a bit-and-byte perspective. Their goal is to maximize the bits they can store for every buck they invest. Data sets or files are grouped into storage pools based on criteria such as size and usage pattern. The location of these files may move around daily, based on subroutines that were automated years ago to maximize storage utility. So if a volume is inadvertently deleted, it can be hard to tell which applications are affected.

The applications manager, on the other hand, is preoccupied with maximizing uptime. It’s critically important to them what a given piece of data actually means to an application. Unfortunately, storage and applications rarely talk to each other. The result is that storage does a lot of things in its management of application data of which the applications manager may be totally unaware. And vice versa.

The gap that becomes a chasm

Outsourcing to large outside vendors doesn’t do anything to shrink this gap. While the involvement of highly reputable names and their leading-edge technology may help the CXO sleep better, those experts in storage and applications aren’t responsible for each other’s performance. Individually, your storage manager and your applications manager and their supporting vendors might offer an A+ level of DR capability. But together, the response to a disruption might merit a C- or worse. When trouble arrives, the outside vendors on whom your managers rely may not bear responsibility for dropping the ball. Their business will not be losing face in the marketplace, and though you can take action, ultimately it is your business, your name, your reputation and your profits at stake. In the end, your management finds themselves watching that dropped ball bounce into a gap that’s suddenly become a chasm.

When passing is failing

Don’t these vulnerabilities always become evident when people test their IT infrastructures? Clearly not. There’s a reason so many well-known organizations with competent IT departments have been surprised by costly data disruptions. It’s because the tests they run are inherently designed to allow adequate preparation to deliver the results they desire. The z/Journal survey found that more than two-thirds of respondents invested between 51 and more than 200 extra man-hours preparing for their tests. This takes us back to our earlier discussion of why having a purpose beats having a plan. The typical test looks at anticipated failure scenarios for which a response has been thoroughly planned in advance. It doesn’t approach the challenge with the overarching purpose of ensuring operational resilience that meets service level agreement criteria for data availability under all possible eventualities. When you test what you plan for, it’s not hard to get a passing grade. Meanwhile, in the dark recesses of the gap between the management of storage and applications, the vulnerabilities that can shut you down remain unknown and unaddressed.

Business Process Model

Conventional backup-and-restore solutions can’t begin to address adequately the extensive, systemic, and largely invisible threats to business continuity. What’s needed is a Business Process Model that incorporates the needs of all process participants, and guides IT based on an informed assessment of risk.

In some organizations, executives who bear responsibility for business outcomes cannot see the details of the IT process as a whole. They’re said to be “on the hook and in the dark.” Sometimes specific individual processes may be well understood, while others are not. Such situations allow cross-functional responsibility gaps that present significant but unquantified risk. And while many organizations implicitly acknowledge the risk by claiming to make continuously available infrastructure a priority, only between 1% and 5% actually are using it right now.

As organizations recognize their vulnerabilities, we are seeing a trend in which DR, concerned with data operations and applications, is becoming more integrated with Business Continuity Management, which directs the steps that need to be taken from a much less technical perspective. Simultaneously, modernization efforts are being seen in several key areas of IT.

Integrated recovery management

Gartner identifies backup and recovery modernization, continuously available infrastructure and data center modernization as the foundations of the process leading to reduced IT risk. Active/Active architectures employing multiple, independent processing nodes in different locations and automated fail-over and restart mechanisms for seamlessly rapid recovery play a vital role in this process. Along with consistency points and disk-type solutions, automated application recovery strategies are essential.

But what happens when your modernized architecture must operate with mission-critical application data  within VSAM, non-VSAM or tape format that was never written for an Active/Active configuration? An approach known as integrated recovery management can provide definitive answers.

An integrated recovery management solution pulls storage and applications together to reveal the true data recovery readiness of all your mission-critical applications. It tells the storage manager how applications are using data, and the applications manager how storage is storing it.

Invaluable insights

Using an integrated recovery management approach, you can test with a purpose instead of a plan. The ability to run a simulated test on multiple applications lets you determine whether your Service Level Agreements actually will be met, not simply whether a given preplanned response effectively deals with an already foreseen problem. You can gain insight into application dependencies you may never have envisioned. And you can run simulated application recoveries that ensure the procedures your application owners have in place to recover their data will work. As a key component of DR modernization, the infrastructure provided by integrated recovery management lets you ask critical questions that may never otherwise be pursued. Questions that can help ensure the quest for operations resilience fulfills its ultimate purpose of keeping data in the hands of those who need it—no matter what.

 

1 2011 z/Journal Enterprise Disaster Recovery Survey

2 Enterprises Are Not Properly Protecting Mission-Critical Data, A commissioned study conducted by Forrester Consulting on behalf of 21st Century Software, February 2011

3 Uptime All the Time, June 2011

4“Data Center Outages Generate Big Losses” by Chandler Harris, May 12, 2011, InformationWeek


About the Author

Rebecca M. Levesque is Executive Vice President and Chief Operation Officer at 21st Century Software, Inc. During the last 20 years, she has helped hundreds of companies develop business continuity and disaster recovery strategies. Today, her regular interactions with clients, industry leaders, industry subject analysts and vendors make her a widely recognized expert on business recovery matters. As the executive responsible for shaping 21st Century Software’s corporate vision, Rebecca frequently shares her expertise on enterprise operational integrity and recoverability in presentations on a multitude of DR issues to organizations around the world. Her insights also have been published and quoted in Disaster Recovery Journal, Contingency Planning & Management and IBM Systems Magazine, among other industry publications. She may be reached at (800) 555-6845 or via e-mail: rebeccal@21stcenturysoftware.com