Understanding disaster recovery options for SQL Server
One of the challenges IT and database administrators confront when implementing disaster recovery provisions is choosing from among the myriad options available. Existing high availability configurations designed to minimize downtime for critical applications may not be adequate for recovering fully from a widespread disaster. And existing disaster recovery provisions may not be as comprehensive or cost-effective as they could be. This article provides practical guidance to help administrators optimize disaster recovery plans for SQL Server, beginning with the disaster recovery plan.
A good disaster recovery plan builds on the organization’s Business Continuity Plan. BCPs generally cover the entire organization, and in most organizations, IT is one of the most if not the most critical department. There are two aspects of the BCP that are fundamental to DR planning: the business impact analysis and the threat assessment. The former determines which applications are mission-critical and the latter identifies those disasters the organization is most likely to experience.
While planning, keep in mind that the difference between "failures" and "disasters" creates important differences between the high availability (HA) and disaster recovery (DR) provisions needed. Failures are small in scale and short in duration, affecting a server, rack, or the power or cooling in a datacenter. Disasters are large in scale and long in duration, affecting entire datacenters in ways that preclude rapid localized recovery. Think hurricane, tornado, earthquake, fire and other major events.
These differences are determinative because the redundancy (systems, software and data) required to recover from a failure can be local -- on a Local Area Network -- while the redundancy required to recover from a disaster must be "long distance" across a Wide Area Network. For database applications that require high transactional throughput performance, the ability to replicate data synchronously across the LAN is a major advantage. It means the standby instance can be "hot" (in sync with the active instance), ready to take over immediately in the event of a failure.
To avoid adversely impacting on the throughput performance, data replication across the WAN must be asynchronous, with updates to the standby instance always lagging behind updates being made to the active instance. This replication lag makes the standby instance "warm" resulting in an inevitable delay during in the recovery process.
The inherent delay in DR provisions across the WAN is tolerable because disasters are rare and usually affect more than just the IT infrastructure. It also leads to differences in the Recovery Time and Recovery Point Objectives (RTO and RPO) established for HA and DR purposes.
RTO is the maximum tolerable duration of an outage. Mission-critical applications always have very low RTOs, on the order of a few seconds for HA, and high-volume online transaction processing applications generally have the lowest. For DR, RTOs of many minutes or even hours are common owing to the extraordinary cost of implementing provisions capable of fully recovering from a widespread disaster in just a few minutes.
RPO is the maximum period during which data loss can be tolerated. If no data loss is tolerable, then the RPO is zero. Because most data has great value (Otherwise why capture and store it?) low RPOs are common for both HA and DR provisions. For HA, synchronous data replication makes it easy to satisfy a low or zero RPO.
For DR, a low RPO can create the need for a tradeoff with RTO. Here’s why: To satisfy an RPO of zero, manual processes are needed to ensure that all data (e.g. from a transaction log) has been fully replicated on the standby instance before the failover can occur This extra effort has the effect of lengthening the recovery time.
The DR Options
With a recognition that DR is very different from HA, and that recovery times of many minutes or even hours are acceptable when recovering from a disaster, IT and database administrators have tremendous flexibility when choosing different DR provisions for different applications.
In the public cloud, DR offerings vary by service provider. Some have what could be called DIY (Do-It-Yourself) DR guided by templates, cookbooks and other tools. DIY is a viable option because, compared to HA, DR is relatively easy to implement by replicating data to "warm" standby instances in another availability zone or region. Some cloud service providers now have complete managed DR-as-a-Service offerings. DRaaS automatically replicates active instances (software and data) to one or more standby instances, also in another availability zone or region for protection against widespread disasters. Recovery for both approaches is manual.
While DR is different from HA, it is possible (and usually desirable) to add DR to an existing HA configuration. These multi-node, multi-site HA/DR configurations have the advantage of being implemented with a single solution vs. requiring separate provisions for HA and DR -- potentially being different for different applications.
There are two popular options for combined HA/DR solutions for SQL Server. One is SQL Server’s own Always On Availability Groups feature. This a robust solution capable of satisfying demanding RTOs and RPOs. But it protects only the user databases and not the entire SQL Server instance, and it requires the more expensive Enterprise Edition.
The other combined HA/DR option is third-party failover clustering solutions that are purpose-built for virtually all applications running on Windows Server and Linux on premises, in public, private and hybrid clouds. These solutions are implemented entirely in software may include some or all of the following: real-time data replication, continuous monitoring for detecting failures at the system and application level, and configurable policies for failover and failback. Some provide both HA monitoring and data replication, while others provide data replication that integrates with Windows Server Failover Clustering to enable use of SQL Server Failover Cluster Instances (FCIs) spanning datacenters or cloud regions. SQL Server FCIs are supported by both the Standard and Enterprise Editions of SQL Server for Windows.
Whatever your choice(s), keep in mind that the only thing harder than planning for how to recover from a disaster is needing to explain why you didn’t.
David Bermingham is Technical Evangelist at SIOS Technology. He is recognized within the technology community as a high-availability expert and has been honored to be elected a Microsoft MVP for the past 8 years: 6 years as a Cluster MVP and 2 years as a Cloud and Datacenter Management MVP. David holds numerous technical certifications and has more than thirty years of IT experience, including in finance, healthcare and education.