5


Have Faith in Your DR Plan

We’ve reached a point in the maturity of DR technology where you don’t have to make the tradeoffs you were forced to make even just a few years ago. Today, it truly is possible to have your cake and eat it too.

DON'T SETTLE ON YOUR RECOVERY OBJECTIVES

One of the most significant ways that DR technology has matured is in its ability to deliver a reliably low Recovery Time Objective (RTO) and Recovery Point Objective (RPO).

Avoid Committing to a “Mostly Achievable” RTO

When it comes to responding to a disaster, one could argue that no variable is more important than recovery time (Figure 6). You can sort out everything else later; but while the organization is unable to function, the only thing that matters is restoring functionality quickly.

Many businesses today don’t even know what recovery time they’re capable of. Those that do have some idea—either from experience or from their careful estimates—may have RTOs that are many hours to many days in length. It’s not uncommon to hear of RTOs measured in weeks. These businesses are forced to sit and wait while backup data is rehydrated and VM conversions are completed. There’s just no reason to settle for this anymore.

When the organization is crippled and infrastructure is down hard for many days, that’s bad for business. There’s DR technology available today that can easily meet a four-hour SLA—and it’s not outlandishly expensive. In fact, it’s possible today to achieve an RTO that’s measured in blinks of an eye—in other words, nearly zero.

An OK RPO Is Not OK

If you’re using legacy backup and DR software, you’re likely measuring RPO in hours on the low end and days if you’re closer to average. When you run weekly full backups and nightly incrementals, you’re really leaving valuable RPO time on the table in the 2020s. There’s technology available now that can improve your recovery granularity to increments as small as five minutes.

Respondents ranked the following DR solutions in order of importance:

0.

Recovery time objective (RTO)

0.

Ease of use

0.

Ease of failback from DR location

0.

Ability to keep VMs in their native format

0.

Recovery point objective (RPO)

Figure 6: Survey respondents ranked Recovery Time Objective as the most critical factor in choosing a DR solution

DON'T ACCEPT OPERATIONAL INCONSISTENCY DURING CRISIS

Primarily for fiscal reasons, it’s not uncommon for the primary and secondary data centers to be heterogeneous. Different hardware, different software, different architectures. Unfortunately, skimping in this area can seriously backfire.

When a disaster strikes, the last thing you want to see is your team grasping at straws, trying to figure out how to operate the unfamiliar infrastructure at the DR site.

Of course, some organizations can justify spending enough to create a mirror image of the primary data center at the secondary location. This does solve the problem of inconsistency, but it’s wildly inefficient financially.

Fortunately, there’s a third option for DR in the 2020s. By leveraging VMware Cloud on AWS to create an on-demand software-defined data center (SDDC) when and only when you need it, it’s possible to create a DR site that uses the same vSphere interface that your team uses at the primary site and only pay for that infrastructure when it’s in use.

For somewhat relaxed SLAs, a VMware SDDC can be provisioned in 90 minutes or less, and VMs can start coming online at the DR site in less than two hours. For those with tighter SLAs, there’s a pilot light configuration where the SDDC is already provisioned and your first critical VMs can start coming online in five minutes.

This third option offers the best of both worlds: It easily fits within modest DR budgets, but also affords administrators the luxury of a homogenous user experience between the primary and DR sites.

When a disaster strikes, the last thing you want to see is your team grasping at straws, trying to figure out how to operate the unfamiliar infrastructure at the DR site.

DON'T ACCEPT RUNBOOK FRAGILITY

Many DR runbooks involve lots of manual steps that humans must perform. When those humans are under immense pressure, those steps can be especially error-prone. Further, most automated DR runbooks in use today are somewhat brittle. If everything doesn’t go exactly according to plan, your failover bombs. And during a major outage is precisely the time you do not want to be working out kinks in the system.

The inoculation is to perform regular failover testing. And while that sounds relatively simple, it has historically been anything but. Not so long ago, it was common practice to schedule full teams of IT staff to run through DR simulations over the weekend (causing an outage to the production environment) to make sure that all systems were recoverable as expected.

Today, however, this sort of large-scale wargame is wholly unnecessary. The technology exists today to simulate a failover event in a sandbox in the middle of the business day with your users none the wiser. Your DR vendor may even include a block of hours to allow you to conduct failover testing without incurring any additional fees. You can validate your failover capability at any time, and confidently tell your boss and your auditors that you can meet your SLAs.

DON'T OVERCOMPLICATE THINGS

Traditionally, backup was offered as a standalone product. Replication and DR was a second, independent product. Both of those interfaced with your storage platform(s), which is yet another different product. If data security was a concern—and where isn’t it?—then encryption software was often a separate product (which interfered with storage efficiency and WAN optimization, by the way). You get the point.

In a DoorDash world where we can’t even be bothered to get off the couch to get our own dinner, there’s an expectation that IT shouldn’t be such a Rube Goldberg machine. There should be a unified platform for all data functions. As we take on the 2020s, it’s realistic to expect a single platform to manage primary storage, backups, DR, encryption, and mobility.

Quick quiz: What happens in a troubleshooting scenario when you have five different vendors involved? You know what happens—they all point fingers at each other and no one does anything, meanwhile you’re down hard and your boss just kicked a trash can across the room.

Having a single point of support for primary storage, backup, and DR means that when disaster strikes, you’re going to actually call a vendor who’s got your back and you’re going to get help right away.

Having a single point of support for primary storage, backup, and DR means that when disaster strikes, you’re going to actually call a vendor who’s got your back and you’re going to get help right away.

CONSIDERATION #5

Modern disaster recovery for this new decade should be simple, reliable, and provide operational consistency during a failover event. Aggressive RTOs and RPOs are in reach for most organizations today and you shouldn’t settle for anything less than what the business needs.