Test, test and test some more: The importance of disaster recovery testing
With Gartner estimating that the average cost of network downtime is $5,600 per minute or $336,000 per hour, few would argue that regular testing of a robust disaster recovery (DR) plan is essential for organizations. Even if you omit the financial implications, the lost productivity, missed opportunities, brand damage and potential data loss and SLA pay-outs associated with system downtime should be enough to keep even the most hardened IT professional up at night.
So, why are fewer organizations than you may think doing it? In recent research we conducted, which surveyed 150 technical and business decision makers from organizations drawn from a wide cross section of UK enterprises, we found that DR testing frequency is remarkably low. In fact, 57 percent are only testing annually or at less frequent intervals. Whilst 6 percent didn’t test their DR at all. Moreover, of the organizations testing less frequently, the results of their last test led 44 percent of them to believe that their DR may be inadequate, while 22 percent encountered issues that would have led to sustained downtime.
Why aren’t we testing?
One explanation for this lack of DR testing is fear of what the tests will show up. There are probably a substantial number of IT professionals who suspect strongly that their DR isn't fit for purpose, but they haven't got the means necessary to replace it with one that is. It’s a time-limited approach to risk management, but probably not as uncommon as it should be.
The level of confidence that enterprises have in their Plan B is determined largely by the extent and frequency with which they test it. Many organizations have SLAs to their own customers for services or contracts to deliver a certain quality of service. A tried and tested Plan B is crucial if they are to be sure that those SLAs and contractual obligations can be met. Given how dynamic -- and complex -- infrastructure typically is in larger organizations, the frequency of DR testing needs to mirror the pace of change.
Given the frequency and likelihood of outages, the infrequency -- or absence -- of testing among respondents to our survey is an eyebrow-raising finding, particularly given the continual message that comes through from much of our research on cybersecurity: that the strategy and solution focus is very much on remediation of breaches rather than prevention. If your strategy for undoing the damage from a ransomware attack is to recover copies of locked data, regularly testing the way in which you plan to do so is probably a good idea -- particularly as more clever examples of ransomware also target back- ups. Recent history is packed with an abundance of cautionary tales of those who failed to do so.
Time is important when it comes to testing. Almost every organization is subject to budgetary and resource constraints. Technical skills shortages are endemic and much of this resource is expended on day-to-day production and application availability. DR testing simply gets pushed down the never-ending to-do list. Also, some DR approaches, be they secondary data centers or DRaaS, can be quite tricky to test. If you have to bring production systems down to test or schedule this out of hours there is extra cost involved, which compounds the problem of low prioritization.
Testing should be more than just an afterthought
Any reader of this article is likely to be hyper-aware of the dynamic and fragile nature of technical ecosystems -- which makes the finding that more than half of organizations represented in our research are testing annually at best rather worrying.
The results of infrequent testing are predictable. Almost half of those practising infrequent testing fear their DR may be inadequate. Failure to test DR will, at some point, lead to a recovery failure. It really is only a matter of time.
Most reasons why DR testing doesn’t happen as frequently as it needs to relate back to the importance afforded to DR within the organization itself. Do executives truly understand the definition of a disaster? Like cybersecurity, the most successful DR strategies involve DR being considered at the inception of new applications and services, rather than as an afterthought once they’ve gone live. Only if DR is integral to an application, can solutions be aligned with confidence, and a consensus that testing has to be as frequent as change can be established. Business input is also vital to ensure a consensus on what is business critical. DR test schedules should reflect the varying levels of importance of different applications, data and services.
We’re aware that, for many organizations, DR has the potential to seem every bit as complex as the infrastructure it is protecting and our research has demonstrated that a significant number of businesses are looking for DRaaS solutions to help them manage this complexity -- as well as reduce the costs inherent in running a second physical datacenter. DRaaS solutions like our own can help to simplify organizations’ disaster recovery plans, offering increased flexibility, customized runbook functionality, optimized RPOs, and near-zero RTOs so businesses have more control over their disaster recovery plans. As outlined above, today most traditional DR environments are typically only tested every few months or once a year, if at all. These manual tests can be time consuming, expensive, and highly disruptive, but DRaaS can provide the necessary solutions to solve these challenges. With the right solution, IT no longer has to wonder if backups are viable or if data can be recovered. Testing can be conducted easily with faster results and done as needed, providing IT teams with absolute confidence in the environments.
Photo Credit: Olivier Le Moal/Shutterstock
Justin Augat is VP of Marketing at iland