What the 'Summer of Outages' showed us, and what we can do about it
Summer 2019 was a rough one for the internet, with systemic outages occurring frequently and in quick succession.
Some of these outages were caused by internal errors, others external, but two overriding causes emerged: greater network complexity and the frequency and pace of code change. In aggregate, these outages serve as a painful reminder of just how fragile the internet is, especially as networks and services grow increasingly interconnected and co-reliant.
The main outages were:
- On June 2, Google experienced outages which the company blamed on "high levels of network congestion in the eastern US" . Several of its most popular services, including Search, Nest, YouTube and Gmail ground to a halt. Not long after, Google Calendar went down, jokingly giving many end users an excuse to declare a day off.
- Cloudflare went down on June 24 due to a minor network leak, affecting domains relying on this leading content delivery network (CDN). End users were locked out of popular services including Discord, Google, Amazon and more.
- On July 3, Google and Cloudflare were both hit by additional outages.
- Also on July 3, Facebook had problems loading images, videos and other data across key apps and services, including Instagram, WhatsApp and Messenger. Facebook blamed this on "an error triggered during a routine maintenance operation."
- Apple joined the club a day later, with a widespread three-hour cloud outage impacting the App Store, Apple Music and Apple TV.
- Finally, on July 11, Twitter experienced an hour-long web and mobile app outage, resulting from what the company called "an internal system change".
You can’t prevent such outages from happening, but you can better insulate your organization from such wild unpredictability by focusing on these five categories:
Keep a vigilant watch for outages in as many geographies, and from as many network perspectives, as possible: Whether or not your various end-user segments can access a website or service depends on a long chain of performance-impacting elements standing between them and your datacenter. This includes CDNs, the cloud, regional and local ISPs, mobile networks and more.
Since the first step in being prepared for/responding to an outage is to proactively detect it, this will be nearly impossible if you are only testing availability nationally or in limited geographies. The same holds true if you’re only tracking from a small number of network vantage points, like the cloud or a handful of ISPs or mobile carriers. Such a narrow approach will leave you with significant blind spots. A broader reach gives you advance notice of more outages and provides a better opportunity to put backup plans in place, if available, or to communicate proactively with impacted end users, letting them know you’re working on the problem.
Reduce mean time to detect and mean time to repair: While early detection and notification of an outage is useful, end-user goodwill will only last so long. It’s not enough to simply know an incident is happening; you also need to find out what is causing it, and fast. In some cases, the problem will be something within your own firewall that you can fix. In other cases, the faulty will be something beyond your direct control, like a cloud service, CDN or carrier network.
Even if the problem is something you can’t directly address, this knowledge is power -- because it means you aren’t sending your IT Ops teams and site reliability engineers (SREs) into wasted hours of war-rooming, leading to alert fatigue, burnout and lost time where they could be proactively focused on improving availability over the long-term.
Enable BGP route tracing -- The internet is basically a circuit relaying data signals and packets across different network paths. Several protocols manage this data flow, one of which is Border Gateway Protocol, or BGP. BGP governs how data is transmitted between various autonomous network entities. The internet relies on it to work, but misrouting can arise due to hijacks, policy misconfigurations, route flaps and peering issues. This can lead to packets being inadvertently sent to the wrong destination, or expiring altogether.
One visible example of a BGP leak involved Google last November. In a case of “grand theft internet,” Google services traffic from a variety of countries and websites were directed to IP addresses belonging to overseas ISPs including TransTelekom Russia and China Telecom, instead of to Google servers. This resulted in the packets being sent to various unintended destinations before being terminated, or black-holed.
Initial reports of the incident suggested this might have been a malicious BGP hack, given that the countries involved have histories of internet censorship. However, it was later discovered that faulty redirects were actually the result of human error; in this case, peering misconfigurations between Google and MainOne, a Nigerian ISP, which Google had established to better support its growing Nigerian presence.
As network build-outs continue at a rapid pace, such BGP mishaps may become more common. While you may not be able to do much about an incident when it affects an external provider, you can more closely track BGP leaks within your own application delivery chain, to allow quicker identification, rule out certain causes and proceed to remediation.
Automate testing early and often: It’s never a great idea to run new code directly on a production system. But in the rush to release code, this often happens, leading to problems. Google conducts tens of thousands of new code deployments a day to thousands of services, seven of which have more than a billion users each around the globe.
Not surprisingly -- SREs, who have expertise in IT ops and coding and who bear responsibility for maintaining system availability in the face of near constant software change -- recently reported that incident management is a huge part of their job. At the time of the survey, almost half of respondents noted they had worked on a service incident over the course of the past week.
With the pace of software rollouts not expected to slow anytime soon, organizations must become more adept at balancing velocity and quality. Increased automation of functional software testing, conducted at the earliest possible phases of the development cycle, is critical to this, as are comprehensive regression testing and rollback capabilities.
Measure third parties and hold them accountable: Third parties, ranging from software components integrated into your site to external infrastructures like the cloud and CDNs, can have a huge impact on your site’s availability. Any organization relying on external third parties must keep a close eye on them in order to ensure their own availability.
When it comes to the cloud specifically, businesses should avoid putting all their eggs (data and apps) in one basket (a single cloud service provider). Implementing a multicloud strategy as a form of backup and protection can involve a fair amount of time and effort, including testing failover strategies in advance and ensuring cloud-to-cloud interactions (supporting replication) are fast and reliable. This is actually one good use case where monitoring from the singular vantage points of various clouds is appropriate; however, as referenced above cloud-only monitoring should never be used to comprehensively gauge real end-user experiences.
Conclusion: The recent spate of outages has reinforced the fact that the internet is very much like a house of cards, and it is virtually impossible to avoid major outages and their cascading impact. As the web grows more interconnected, the likelihood of unplanned downtime impacting your business will only grow. Fortunately, there are steps businesses can take to better anticipate and respond to these events. It may be hard to hear, but planning for failure is a necessity. If it can happen to the likes of Google, Facebook and Apple, it can -- and inevitably will -- happen to you.
Mehdi Daoudi is the co-founder and CEO of Catchpoint, a leading digital experience intelligence company. His team has expertise in designing, building, operating, scaling and monitoring highly transactional Internet services used by thousands of companies that impact the experience of millions of users. Before Catchpoint, Mehdi spent 10+ years at DoubleClick and Google, where he was responsible for Quality of Services, buying, building, deploying, and using monitoring solutions to keep an eye on an infrastructure that delivered billions of transactions daily.