Q&A with Josh Mazgelis of Neverfail

What role should disaster recovery planning play in organizations today?
•Disaster recovery planning and assurance needs to be embedded within every company's set of on-going IT operations management disciplines. With increasing expectations from the business around service levels and availability of key systems, a proliferation of external threats, and an increasing choice of affordable disaster recovery solutions, no company should operate without an up-to-date, relevant DR plan in place. Understanding what is at risk is the first step in building a DR plan so one needs to model an inventory of IT assets, understand how these inter-operate and what overall service level such collaborating systems need to deliver to meet business imperatives. Of course, any service level targets need to fit within economic constraints, so the plan should take into account the costs of deployed business continuity infrastructure and potentially model the impacts of tightening or relaxing service levels on the cost base for protecting critical systems. Any costs also need to be set within the context of an understanding of the overall impact on the business, both in terms of economic loss as well as employee productivity should a service outage occur. Once you have a holistic idea of the overall "business value" of your IT systems and your business critical data you can start to assess the kind of business continuity solutions you can afford to put in place. Finally, with the dynamic nature of virtualized server infrastructure, business continuity planning should be performed as part of a continuous set of operational disciplines. A DR plan that is six months behind current infrastructure topology is worse than having no plan at all, since an out-of-date plan wrongly sets an expectation with the business that all can be recovered according to the plan.

What mistakes are organizations currently making when it comes to planning?
•In many organizations, no matter how mature their standard IT processes are, business continuity planning is still executed on a fairly informal basis. Recent natural disasters together with embarrassing datacenter outages at some well-known Cloud service providers serves to heighten awareness of the need to revisit DR planning on a frequent basis, with many IT organizations just beginning to formalize and standardize their IT processes in a BC/DR context. One mistake that organizations frequently run into is failing to understand the interdependencies of system components, and how the recovery time of any one component can affect the complete disaster recovery plan. For example, a company may roll out a new business service delivered by multiple, collaborating potentially tiered, applications across a mix of physical and virtual machines. Application components deployed on virtual infrastructure may take advantage of supplied DR capabilities within their virtualized infrastructure. However, these virtualized components may also depend on a legacy database server that still resides on physical hardware protected by a completely different set of DR infrastructure. While the recovery time objective (RTO) associated with virtual machines may be shorter than that of the legacy servers, the end result is that the overall RTO for the end-to-end business service will be defined by the slowest component to recover from a disaster. Expectations may be incorrectly set if the recovery performance of entire service is incorrectly associated with the RTO performance of new virtual servers without taking into account their dependence on a legacy database server protected by a different set of recovery infrastructure. In the event of a real disaster, IT operations management run the risk of learning about critical dependencies the hard way, and the cost of complex, multi-component systems being down can become a tremendous burden on the organization.

Is managing disaster recovery for physical servers in a virtualized environment critical to an organization's overall DR plan?
•The realistic and most practical answer is "absolutely", but it can vary based on the organization and the role of those physical servers. As companies implement virtualization, it's the low-hanging fruit that gets converted first. As the number of remaining physical servers drops, the likelihood that those servers host some critical component of a larger business function increases. In tandem with this phenomenon, whilst administrators are rolling out investments in new recovery tools for their virtualized platforms, protection of legacy physical systems is often overlooked and their recovery performance is left behind. This leads to a common recovery scenario where virtual servers are restarted quickly but then admins work to restore the physical servers using legacy, often manual processes. Since those physical servers are often key application components, the end-to-end recovery time for the overall business service is limited to the recovery time of the legacy physical servers limiting the organization's capability for fast recovery. By enabling companies to manage both physical and virtual together, the recovery tasks can be automated throughout the datacenter with a more predictable, higher recovery performance.

What does an example failover sequence look like for a hybrid environment?
•The idealized environment is one that accommodates both single server/application failover as well as the "big red button" site-wide failover. While a company may take different approaches to protecting different platforms or different tiers of the infrastructure, ultimately multiple DR infrastructures need to work in unison for site-wide failover to work in a predictable manner. In addition, the failover should occur from tier-1 down regardless of the platform or technology used, assuring that the most critical services are restored before moving on less critical ones. In a hybrid physical/virtual datacenter, this may mean that different platforms need to fail over together in order to maintain that tiered sequence. One example might be a line-of-business CRM application that lives on virtual machines but depends on a SQL Server on a physical machine. In this instance, the physical machine hosting SQL Server needs to fail over before the virtual machines can be restarted. This pattern would continue for each of the defined recovery tiers.

With so many organizations deploying virtualization, what is the real need to protect physical servers?
•While there are very few companies out there who aren't yet using virtualization, there are probably far fewer companies who have moved to a 100% virtualized model. Even for those who are committed to virtualization, the process of getting to that stage takes time. Research conducted by VMware shows that its customers have virtualized 36% of their x86 servers, and are planning to virtualize 76% of newly acquired servers, leaving a significant percentage of workloads on physical servers. While it's easy to relegate protection of those machines to legacy backup technologies, as business needs and capabilities advance so should the level of protection across all of a company's servers.

What does the future look like for DR planning in virtualized environments?
• In today's virtualized datacenters, there's a bewildering choice of different approaches, infrastructure and tools available to protect a business service delivered by a set of collaborating virtual servers. Each approach has different strengths, weaknesses, costs, levels of protection provided and resource requirements. While a fully-virtualized environment will make it easier to simply "dial up" the level of protection needed this comes with an associated infrastructure cost. In order to manage the trade-off between recovery performance and cost, DR Planners will need to carefully model the interdependencies between business services and the underlying virtual servers the deliver these services, taking into account required services levels and costs on a service-by-service basis. Higher densities of virtual workloads, coupled with the pace of change enabled by virtual server provisioning and migration puts great pressure on keeping DR plans current. In the near future, IT professionals will turn to a new breed of DR planning and automation tools to meet these challenges.



Josh Mazgelis has been working in the storage and disaster recovery industries for close to two decades and brings a wide array of knowledge and insight to any technology conversation. He is currently working as a senior product marketing manager for Neverfail Group. Prior to joining Neverfail, Josh worked as a product manager and senior support engineer at Computer Associates. Before working at CA, he was a senior systems engineer at technology companies such as XOsoft, Netflix, and Quantum Corporation. Josh graduated from Plymouth State University with a bachelor’s degree in applied computer science and enjoys working with virtualization and disaster recovery.