SQL Server: Recovering operations after a disaster
Datacenters disappear. Not often, but it happens. Hurricane Sandy in 2012 swallowed whole datacenters in Lower Manhattan. Hurricanes Irma and Maria wiped out virtually all the datacenters in Puerto Rico in 2017. Cloud datacenters are not immune, either. The Azure Central Region outage of 2018 occurred after a lightning strike queued up a sequence of automated responses that failed in a spectacularly Rube Goldberg-like manner and proceeded to take multiple Azure availability zones offline for the better part of a day.
Sometimes it’s possible to anticipate the arrival of a disaster; other times, the disaster arrives without warning. Either way, the production infrastructure your organization depends on is going to be offline for some time. That does not, however, mean that your operations must go offline. It means that you need to have a disaster recovery (DR) plan in place that can provide an alternative means of keeping your operations running when your primary infrastructure is out of commission.
Fundamentally, the ideal DR plan for SQL Server involves the creation of infrastructure sufficient to support your SQL Server operations from a geographically-distant location. The cloud is great for this, as all the major cloud service providers have availability zones built out in multiple regions around the globe. If your production systems are running on premises or in a cloud availability zone in the Southwest, for example, you might build out a DR infrastructure in an cloud availability zone in the Northeast (even, possibly, in an availability zone in a different vendor’s cloud), where the likelihood of any disaster affecting your infrastructure in the Southwest is infinitesimally small. In the event of a disaster affecting your Southwestern infrastructure, you would switch over your SQL Server operations to the DR infrastructure in the Northeast, enabling you to run your critical operations with minimal interruption.
But achieving this ideal raises questions. How can the infrastructure in the Northeast access the production data that had been active in the Southwest? A traditional data back-up solution won’t suffice if your goal is to keep what are called your recovery time objectives (RTO) and recovery point objectives (RPO) to a minimum. RTO is a measure indicating the maximum allowable time to get your solution back up and running. RPO is a measure indicating the maximum acceptable amount of data loss. If your last backup is four hours out of date when the Southwestern data center goes offline, you would end up restarting your database in the Northeast without the last four hours of transactions if you started up your DR infrastructure using the last backup. You might be able to reduce that four hour loss by rebuilding the database from the last four hours of log files (if you shipped them to the DR infrastructure), but that rebuilding effort would consume additional time and further delay the return of your SQL Server infrastructure to service.
A better approach would be to use a solution that replicates your SQL Server data from the production systems in the Southwest to the DR infrastructure in the Northeast in something approximating real time. Several solutions can accomplish this.
You could, for example, configure the Southwest and Northeast infrastructures as an Availability Group using the Availability Groups (AG) feature built into SQL Server. Alternatively, you could use a third-party solution to join the infrastructures in what is known as a SANless cluster. Both approaches cause data written to the production database (in, say, the Southwest) to be replicated to the DR infrastructure in the Northeast. Because of the physical distances involved, both approaches would rely on asynchronous replication technologies, which can introduce a lag in the writes to the DR infrastructure. However, the disparity between the production and DR databases would be minimal.
If the Southwest data center suddenly went dark, you might start up your instance of SQL Server in the Northeast missing a few seconds worth of data, but that’s likely to be well within your established RPO. Indeed, if you had ample warning of an impending disaster -- as one might in the face of a devastating hurricane -- you can plan accordingly. If you were to pause any new transactions on your production system for a few seconds -- long enough for the last transactions to be written to the DR infrastructure -- you could then start up your DR infrastructure and make that your production system with no loss of transactional data. And because all the infrastructure and data is already ready to go in the DR infrastructure, you simply need to bring it online to continue operations. Your recovery time would be a matter of seconds, not minutes or hours.
The key differences between using SQL Server’s AG solution or a third-party SANless clustering solution for DR boils down to price and functionality.
If you have more than one SQL Server database you want to replicate using the AG feature, you’ll need to deploy SQL Server Enterprise Edition in both your production and DR infrastructures. If you don’t otherwise need the features of SQL Server Enterprise Edition, that becomes a very expensive upgrade for the replication functionality you seek. The AG replication service also replicates only user-named SQL databases, so other files and databases in storage (including the SQL Server system databases for managing passwords, jobs, and other functions) will not be replicated to your DR infrastructure. Those will be missing unless you find another way to replicate them.
Third-party SANless Clustering tools are application agnostic, so which version of SQL Server you’re using, how many SQL Server databases you have, or what other files may be in storage on your production systems -- none of these details matter. The SANless Clustering tools simply replicate blocks of data at lightning fast speeds, so the data in storage on your DR infrastructure always reflects the data that was stored on your production systems. If disaster strikes, you’re ready to go.
Dave Bermingham is Director of Customer Success at SIOS Technology. He is recognized within the technology community as a high-availability expert and has been honored to be elected a Microsoft MVP for the past 12 years: 6 years as a Cluster MVP and 6 years as a Cloud and Datacenter Management MVP. Dave holds numerous technical certifications and has more than thirty years of IT experience, including in finance, healthcare, and education.