Disaster recovery vs. business continuity
IT professionals thinking about disaster recovery configurations for critical SQL Server deployments in Windows environments naturally think in terms of remote sites and recoverability. If the primary datacenter goes offline in a disaster, the organization needs to be able to fail over to a separate datacenter somewhere unaffected by the same disaster.
But disaster recovery and business continuity -- your ability to rapidly resume critical business functions during emergency events -- are not the same. Planning for business continuity is a much more holistic endeavor, and while disaster recovery is an important part of that plan, it is just that: part of the plan. Before you can effectively figure for disaster recovery, there needs to be agreement among your organization’s key stakeholders about which elements of your IT infrastructure are truly mission-critical. Once that’s been agreed upon -- and that’s not always an easy -- you’re in a position to implement a disaster recovery plan that will truly reflect the business continuity goals of the organization.
Business continuity is all about that: continuity. From an IT perspective, you’re focusing on ensuring minimal disruption when it comes to enabling access to your mission critical systems. But which are your most important systems? And who makes that determination? You need to know what, from a business perspective, you need to prioritize -- because knowing which systems need to come online immediately and which can wait will be critical in the overall success of the business continuity plan.
One way to look at the question involves the cost of downtime. What is the cost to your business -- both monetarily and in terms of brand reliability -- if different systems go offline? The answer attached to components of your ERP system will differ from the answer attached to components of your DevOps systems or your office productivity systems. The latter may be important for individual productivity but having your SAP system offline for a day may be much more disruptive for business than having your DevOps systems offline for a day. Or maybe not! That’s the point: Every organization needs to determine its own priorities and communicate those priorities in the business continuity plan.
Then, dive even deeper: what’s the cost of having those critical systems offline for five minutes? for an hour? for 10 hours? How much data can you afford to lose? Those insights will help you have a meaningful discussion with the occupants of your C-Suite about the choices that need to be made and the application availability that the organization can expect if a disaster does strike. Getting agreement on these expectations from the C-Suite --read: the people holding the purse strings -- will also give you the insight you need to determine how to configure a disaster recovery solution to deliver the agreed-upon business continuity goals.
Configuring for continuity
Cloud service providers such as Azure, AWS, and Google offer infrastructure options with service level agreements (SLAs) promising 99.99 percent availability. You can configure a Windows failover cluster instance (FCI) designed to ensure infrastructure in a remote data center can take over quickly if a disaster brings down the data center in which your production systems are running. However, those SLAs apply only to virtual machines (VMs) running on the provided infrastructure, not to the applications running on those VMs. If the VMs running your mission-critical applications in the disaster-stricken data center fail over to VMs in another data center, the VMs in the second data center won’t be able to deliver the business continuity you’re expecting if you have not ensured that they can interact with all the data they had been using in the other data center.
Keep in mind that you can’t configure a cloud-based FCI with a shared storage area network (SAN) the way you can in an FCI running on premises. Each node in a cloud-based FCI needs its own copy of the data used by your critical applications. As part of your business continuity plan, then, you need to configure a mechanism to replicate local data from the cluster nodes in the primary data center to the standby nodes in your secondary data center.
When designing a solution intended to restore your mission critical applications as quickly as possible, there are several ways to do this. Some applications offer built-in services to replicate data among the nodes in a cluster. One question you’ll want to consider, though, involves whether those application-specific synchronization mechanisms provide support for all the data you want replicated. The Always On Availability Groups (AG) feature of SQL Server Enterprise Edition, for example, only replicates user-defined SQL Server databases. Other system databases -- the ones that hold SQL Agent jobs, users, and passwords, for example -- are not replicated to the secondary cluster nodes. Nor, for that matter, does AG replicate any non-SQL Server data that might be important to other mission-critical applications whose availability you might want to ensure.
If you’re not using Microsoft SQL Server -- or if you don’t want to incur the expense of using the Enterprise Edition of SQL Server when the Standard Edition will do -- you can orchestrate data replication using a SANLess Clustering solution. This approach provides block-level data replication services that are application agnostic, so configuring your DR solution using SANLess Clustering will ensure the replication of data associated with all the applications you are trying to protect (including any edition of SQL server from 2008 forward). Indeed, you may want to rely on a SANLess Cluster even if you are using Microsoft SQL Server Enterprise Edition because the application-agnostic properties of the approach ensure that you have a mechanism for data replication that is more comprehensive than the mechanism built into SQL Server Enterprise Edition.
With a cloud-based infrastructure configured to provide geographically-distributed DR support for your truly critical applications and a data replication solution designed to ensure that those applications have access to the data they need to be up and running within moments, you’re well on your way to a disaster recovery solution designed with business continuity in mind.
Dave Bermingham is the Senior Technical Evangelist at SIOS Technology. He is recognized within the technology community as a high-availability expert and has been honored to be elected a Microsoft MVP for the past 11 years: 6 years as a Cluster MVP and 5 years as a Cloud and Datacenter Management MVP. Dave holds numerous technical certifications and has more than thirty years of IT experience, including in finance, healthcare, and education.