Architectural options for ensuring high availability and disaster recovery in SQL Server
Architecturally speaking, what’s the difference between configuring SQL Server for high availability (HA) and configuring it for disaster recovery (DR)? The terms HA and DR are so often paired that one might imagine them to be the same -- when in fact they are distinct approaches to mitigating different risks.
Let’s start by distinguishing the risks. Let’s say your mission critical SQL Server infrastructure resides in a datacenter in Orlando, Florida. It doesn’t matter whether it’s your own data center, a co-location facility, or a cloud data center. A category 5 hurricane causes a catastrophic failure of the data center and your critical SQL Server infrastructure goes offline.
That’s the scenario for which a DR solution is built. To protect your mission critical solution, you’d configure a second instance of SQL Server in a data center at a significant geographic distance -- far enough from Orlando that the hurricane (or earthquake or Texas-sized power grid failure) is unlikely to affect it. The DR solution would enable you to route your SQL Server workloads from the Orlando data center to the distant data center and continue operating even though the Orlando data center was offline.
But let’s say the problem isn’t as disastrous as a hurricane. Let’s say that a network segment inside the Orlando data center fails, or that a server goes offline due to a hardware failure, a software bug, or even human error. Any of these incidents could cause you to lose access to your vital SQL Server infrastructure -- and your business will not regain access until that event is corrected.
That’s a scenario for which an HA solution is built. To ensure that your SQL Server workloads can be accessed no less than 99.99 percent of the time -- the definition of HA -- you need to build a second instance of SQL Server in a separate but nearby location. Should your active instance of SQL Server suddenly go offline, the HA solution will immediately fail over to this standby instance of SQL Server, enabling your organization to carry on with minimal interruption.
Configuring infrastructure for HA and DR
While both these HA and DR scenarios rely on a secondary instance of SQL Server, the manner of configuring each solution differs.
An HA configuration enables rapid, automated failover from one instance of SQL Server to the other. It also provides services to continuously replicate data in storage attached to the primary instance to storage attached to the secondary instance (or instances). That way, if the secondary instance of SQL Server is called into service suddenly, it can go to work with an identical copy of the database that had been in use before the primary instance went dark.
There are two ways to configure this kind of highly available infrastructure. SQL Server itself includes an Always On Availability Group feature that builds on Windows Failover Clustering Services (WFCS) to create a multi-node Availability Group (AG). Each node in an AG runs its own instance of SQL Server, and each node can be situated in a separate but nearby location. A high speed network connects the AG nodes, and synchronous data replication services keep the key SQL Server databases synchronized among the nodes.
Alternatively, you can use a third-party SANless Clustering tool to enable HA. SANless clustering software integrates with WFCS to create a multi-node SQL Server failover cluster instance (FCI). Block-level data replication services in the SANless Clustering software ensure the synchronous replication of data among the FCI nodes. Should the primary node running SQL Server go dark for any reason, WFCS fails over to the secondary node, where the clustered instance of SQL Server can immediately begin interacting with a mirror image of the database that had been running on the primary node.
While both the AG and SANless Clustering approaches automate failover between nodes and both offer high-speed, synchronous data replication services, there are differences between these approaches that may be costly for some organizations.
Only the user-defined SQL databases are replicated between nodes in an AG; system databases (such as those managing passwords and jobs), as well as any non-SQL Server files in storage, are not. Replicating those databases and files is a separate management challenge. Moreover, if you want to replicate more than one SQL Server database from the primary to secondary nodes in a single AG, you must run SQL Server Enterprise Edition on each instance. If you have multiple SQL databases currently running on SQL Server Standard Edition, you’ll find that an expensive upgrade.
In contrast, the SANless Clustering approach to HA simply replicates blocks of data from primary to secondary storage. It is content-agnostic and will replicate user-defined and systemdatabases alike, as well as any other file in storage. Moreover, because it is just replicating ones and zeros, the SANless Clustering approach does not count how many SQL Server databases you are replicating. That means that if you have multiple SQL databases on your primary instance of SQL Server and still use SQL Server Standard Edition, you can use SANless Clustering to create just as effective an HA solution without having to install the more expensive SQL Server Enterprise Edition on each instance.
As for DR? The key differences between a DR architecture and the HA architectures described above are these: The physical distance between the primary and secondary infrastructures in a DR solution generally require the use of asynchronous rather than synchronous data replication. The data in secondary storage may be out of sync by several seconds or minutes, and for this reason a DR solution is usually configured for manual rather than automatic failover. These factors can result in a longer operational interruption than you would encounter in an HA configuration, but a shorter interruption than you would encounter without a DR solution or if the secondary instance had to be rebuilt from log files or backups.
HA and DR combined
As you might imagine, a combined HA/DR solution would deliver the highest levels of availability. A multi-node HA solution in a single geographic region would ensure 99.99 percent availability in the event of a localized failure of the primary SQL Server instance. Adding a third node to the AG or SANless Cluster -- located in a distant region -- would ensure that you could fail your infrastructure over to the distant data center if that category 5 hurricane threatened to flood your regional data centers and compromise your HA infrastructure.
But, again, such a combined configuration is likely to be constrained by cost. Supporting three nodes in a combined HA/DR configuration using an AG is going to require SQL Server Enterprise Edition in all three infrastructures (and keep in mind that SQL Server Enterprise Edition is licensed per CPU, not per node). If you use the SANless Clustering approach, you could build the entire solution using SQL Server Standard Edition and save a considerable amount of money.
Image credit: dizain / Shutterstock
Dave Bermingham is the Senior Technical Evangelist at SIOS Technology. He is recognized within the technology community as a high-availability expert and has been honored to be elected a Microsoft MVP for the past 11 years: 6 years as a Cluster MVP and 5 years as a Cloud and Datacenter Management MVP. Dave holds numerous technical certifications and has more than thirty years of IT experience, including in finance, healthcare, and education.