Failover clustering in the Azure cloud: Understanding the options
A number of options are available for providing high availability protection for applications running in the Azure cloud. Some of these options are cloud-based services. Some are in the operating system or application software. And some are purpose-built by third-parties. The numerous permutations and combinations available can make it extraordinarily difficult to choose the best and most cost-effective solution for each application.
In general, failover clusters are the best option for assuring high availability. Historically, failover clusters were relatively easy to configure and test in the enterprise datacenter using shared storage and standard features built into Windows Server. But in the Azure and other public clouds, there is no shared storage. This creates a need to find other options for running mission-critical applications in a public or hybrid cloud environment. This article examines the options available for providing high availability (HA) for applications running within the Azure cloud. Special emphasis is given to SQL Server as a particularly popular application for Azure.
Options within the Azure Cloud
The Azure cloud offers redundancy within datacenters, within regions and across multiple regions. Within datacenters, redundancy is provided by Availability Sets that distribute servers across different Fault Domains in different racks to protect against failures at the rack level. Availability Sets afford redundancy for some hardware failures, but provide no redundancy for a datacenter-wide failure, such as the one that occurred in Azure’s South Central US Region in September 2018. Azure offers a 99.95 percent Service Level Agreement for Availability Sets, but the SLA defines downtime as when no server in the Availability Set has external connectivity.
For protection from single datacenter-wide failures, Azure is rolling out Availability Zones (AZs). Regions that support AZs have at least three datacenters that are networked with sufficiently high bandwidth and low latency to accommodate synchronous replication. Azure provides a 99.99 percent SLA for configurations using AZs, but again, the SLA only guarantees that at least one server will have external connectivity.
To protect against major disasters, Azure offers Region Pairs, where every region gets paired with another within the same geography, such as the US, Europe or Asia. The pairs are separated by at least 300 miles and are strategically chosen to enable rapid recovery during widespread network or power outages, or major natural disasters. Pairing also enables Microsoft to perform planned maintenance with minimal downtime .
While Azure certainly offers the infrastructure needed to deliver high service levels, it is incumbent upon the customer to leverage that infrastructure to ensure high availability at the application level.
Options in Windows Server and SQL Server
Windows Server Failover Clustering (WSFC) comes standard in the operating system. For this reason, many applications leverage this proven and powerful feature in high availability configurations in enterprise datacenters. But WSFC requires shared storage, and with no shared storage available in the Azure cloud, additional provisions are needed.
In the Datacenter edition of Windows Server 2016, Microsoft addressed this problem with the introduction of Storage Spaces Direct (S2D), which is software-defined storage that creates a virtual storage area network using locally-attached storage. S2D requires that the servers reside within a single datacenter, however, making it incompatible with Availability Zones. This makes S2D a viable choice only for a single-site HA configuration. More robust, multi-site HA/DR protection will require the use more flexible data replication and HA solutions.
Like many commercial and open source software offerings, SQL Server has two of its own HA/DR features: Failover Cluster Instances and Always On Availability Groups. The use of FCIs (available since SQL Server 7) affords two major advantages: it is available in SQL Server Standard Edition; and it protects the entire SQL Server instance, including system databases. A major disadvantage has been its need for cluster-aware shared storage, but that changed in SQL Server 2016 with its support for S2D.
Always On Availability Groups is SQL Server’s most capable offering for both HA and DR. This option can deliver a recovery time of 5-10 seconds and a recovery point of seconds or less. It also offers readable secondaries for querying the databases (with appropriate licensing), and places no restrictions on the size of the database or the number of secondary instances. Among its disadvantages are the need to license the more expensive Enterprise Edition, which is cost-prohibitive for many applications, and its lack of protection for the entire SQL instance.
A notable disadvantage with all application-specific options is the need for administrators to implement different HA and DR provisions for different applications. The use of multiple HA/DR solutions can substantially increase complexity and costs (for licensing, training, implementation and ongoing operations), making this another reason why organizations increasingly prefer using application-agnostic third-party solutions.
Third-party Failover Clustering Software
With its application- and platform-agnostic design, purpose-built failover clustering software can provide a complete HA/DR solution for virtually all Windows and Linux applications in private, public and hybrid cloud environments.
Being application-agnostic eliminates the need to have different HA/DR provisions for different applications. Being platform-agnostic makes it possible to leverage, while not depending on, various capabilities and services in the Azure cloud.
These complete solutions include, at a minimum, real-time data replication, continuous monitoring capable of detecting failures at the application level, and configurable policies for failover and failback. Most failover clusters are able to satisfy mission-critical recovery time and recovery point objectives, and most also offer a variety of value-added capabilities.
Here is a comparison of three HA/DR configurations commonly used in the Azure cloud.
Clustering with Confidence
All of these options, whether implemented individually or in various combinations, can have a role to play in making HA and DR protections more effective and affordable for the full spectrum of enterprise applications -- from those able to tolerate some data loss and downtime, to those that demand five-9’s of uptime with minimal or no data loss. Just be sure that the option(s) you choose provide protection at the application level for all foreseeable failure scenarios.
David Bermingham is Technical Evangelist at SIOS Technology. He is recognized within the technology community as a high-availability expert and has been honored to be elected a Microsoft MVP for the past 8 years: 6 years as a Cluster MVP and 2 years as a Cloud and Datacenter Management MVP. David holds numerous technical certifications and has more than thirty years of IT experience, including in finance, healthcare and education.