Facebook outage 2021: A simple mistake with global consequences
In October, the internet was shaken by the Facebook outage that affected dozens of big-name companies, as well as millions of brands and businesses that advertise on Facebook’s platform. Because of something as simple as a misconfigured Domain Name System (DNS) record, every device with the Facebook app integration started DDoS-ing recursive DNS resolvers -- DDoS meaning "Distributed Denial of Service." This, in turn, caused overloading in countless cases across the board.
You might be thinking to yourself, "So, what? A few sites were offline for a couple of hours." But the outage brought to light other issues. Communications for the very Facebook employees that could fix this issue were crippled. Some of these hindrances went so far that people were unable to enter buildings because the physical badge system wasn't even online.
So, now that the digital dust has started to clear, it's time for the post-mortem. What can we learn from this event and how can you prevent it from happening to your organization? Because let's face it: If a massive platform like Facebook can experience a sprawling outage, companies large and small should be taking notes.
Configuration Management Matters
Configuration Management (CM) is a system engineering process for setting and maintaining consistency. This consistency includes items like the system's performance, functionality, and physical status. Essentially, CM allows for a programmatic approach to ensure things don't run off the rails.
A server isn't responding after an update? Halt the update from the rest of the fleet -- in this case, the other servers -- and alert the proper individuals about the event. After this, execute a rollback on the update on the unresponsive server to resume services. Basically, by using CM for automation, you can easily build in checks against human errors.
The Importance Of Testing In The Pipeline
The advent of DevOps has given us amazing power to automate as many manual processes as possible. This has allowed teams to push code to production from an average of once-a-sprint to minutes from submission to a code repository. That said, a common problem we see is the lack of a testing stage within this pipeline -- other than the basic linter (static code analysis tool) -- via Jenkins in GitHub.
The type of testing that needs to be added to the pipeline is pushing the code to staging servers. Essentially, these staging or dev servers are a sandbox to examine what would happen before the changes hit production. While this issue with Facebook came from a configuration -- not purely developer code -- the sentiment is still the same.
It's imperative to have that testing area to ensure confidence that what you are moving or changing in the production environments won't result in being the lead story on the front page of WIRED tomorrow. Lastly, always make sure to match the testing servers to production as close as possible for the most reliable testing.
Rollback Planning And Drills Are Critical
So, you've done your due diligence and taken every step to make sure this simple update will go smoothly -- but after you push the update, you find out something unrelated broke from the change. While that is a case of not following orthogonality when it comes to development and design, it happens to the best of us, so don't fret.
Moments like these are when you refer to your rollback plan discussed earlier. It's what you must do to get back to the previous state -- before the change or update was pushed. If you've taken my CM advice to heart, you should already have this plan in place.
If not, develop a plan for rolling back -- preferably before your next push to production. Once that plan is in place, you should run a mock scenario to make sure said plan will work beyond the documentation of it. This is one of the many useful reasons to run periodic cyber wargaming scenarios, which is an admittedly fun, interactive technique to test your cybersecurity preparedness in an attack context.
Communication Alternatives Are A Necessity
With the increasing percentage of the workforce switching to remote environments, it's more important than ever for reliable communications. So, for a company like Facebook, it makes sense to use a homegrown SaaS, like their proprietary Messenger platform. However, scenarios exactly like this are also why you should have a predetermined fallback.
While I'm sure this sounds like an obvious step, don't just lump a plan into the onboarding material for your team. That said, if you're a smaller company and you don't have turnover at hundreds a year, you may not even have a formal, established onboarding process.
And secondly, the alternative communication method can change at any time. So, one way to get around this is to keep some updated documentation on protocols if the main form of contact is cut for whatever reason. Another tip would be to have your IT department execute an automated message detailing the backup method -- to email or SMS employees should this happen.
Depth And Redundancy Is Key
Now, this is very closely tied to the last section talking about communication alternatives -- but it applies to all the lessons learned above to some degree. This certainly applies on a case-by-case basis depending on your company but set redundancy to a level that would make any doomsday prepper jealous. And if you think it's overkill, go one step further.
A question you should constantly be asking yourself for every possible scenario is: Does this have a backup? This is where our rollback plan for redundancy comes in, should you have a world-breaking change in production. Configuration Management is a redundancy to human manual monitoring, so if you have at least one level of backup, the entire livelihood of your company isn't at stake. At the end of the day, you can save yourself a plethora of headaches (and heartaches) by going back to basics and prioritizing redundancy throughout your environments.
Image credit: Sergei Elagin and thinkhubstudio / Shutterstock
Cody Michaels is an Application Security Consultant at nVisium. With over 10 years of secure programming and development experience, Cody has worked with individuals from startup levels all the way up to Fortune 500 companies. He has won hacking events including the Compuware Hack the Museum at the Henry Ford Museum. He is also a contributor to the Arctic Code Vault for his open source code contributions. Cody is known for presenting at Defcon meetups, various local security talks and the HackMiami conference.