4 types of outages to avoid in 2024 (with cautionary tales from 2023)
Throughout 2023, the tech world witnessed several high-profile outages. Collectively, these incidents cost millions of dollars, and affected tens of millions of end users. Luckily, these negative impacts were not all in vain -- outages like these serve as cautionary tales for organizations as they face the ongoing challenge of maintaining seamless operations in increasingly complex, interdependent environments.
Reflecting on these incidents helps us see how we can safeguard against similar disruptions happening in the future. In this article, we’ll delve into four major types of outages, with lessons to help companies enhance their resilience in 2024 and beyond.
Type 1: Infrastructure failures
2023’s cautionary tale: Datadog
In March 2023, Datadog experienced a significant service outage which took almost two days to resolve. The issue impacted their web application and cost the company about $5 million. It also took multiple shifts of hundreds of engineers to diagnose and fix.
The outage was partially attributed to an operating system update that disrupted network connectivity in some of their compute clusters. A post-mortem revealed that the specific trigger was a security update, which led to network connectivity problems across thousands of nodes. This issue directly impacted the infrastructure layer, affecting the network stack and the communication between containers, which in turn caused service disruptions for users. For almost a day, users were unable to access the platform, services, or APIs.
What can we learn from Datadog?
Infrastructure outages can have widespread effects, particularly in distributed systems like Datadog's, where an issue in one part of the infrastructure can cascade and cause broader system failures. In Datadog’s case, the underlying issue was not a lack of monitoring, but due to an update that was done in parallel across all of their nodes.
The severity and scope of this outage underscore how the complex dependencies of cloud-based, high-scale, distributed systems make root cause analysis so challenging -- even for companies like Datadog, who have top-of-the-line monitoring in place.
Type 2: Unknown dependencies
2023’s cautionary tale: Cloudflare
In November 2023, Cloudflare, along with Workday, experienced significant outages stemming from a power outage at a data center in Oregon. This impacted Cloudflare’s customer-facing control plane and analytics services. They were able to restore partial access via a disaster recovery facility, but full resolution took 2 days.
The company acknowledged that their high availability systems, designed to prevent such incidents, failed due to unknown dependencies. The outage revealed that some critical systems depended on the specific data center that experienced the power failure. These dependencies were not fully recognized or accounted for in their contingency planning.
What can we learn from Cloudflare?
This incident highlights the importance of understanding all system dependencies, not just those that are apparent or planned for in disaster recovery scenarios. It also underscores why it’s important to have an architecture and a comprehensive monitoring solution that’s always fully online.
The outage at Cloudflare was not just a data center infrastructure failure; it was compounded by the presence of critical system dependencies that were not caught in time, leading to a more significant disruption than the company's high availability strategies could mitigate.
Type 3: Application-level outages
2023’s cautionary tale: Instagram
In May 2023, Instagram experienced an outage that prevented users from accessing the platform for about 75 minutes. More than 175,000 users reported problems at the peak of the outage, according to Downdetector.
Users of the platform encountered an error message stating “Couldn’t load feed” and were unable to resolve the issue through refreshing. The stream of errors indicated server-side issues, such as server instability or authentication mismatches. However, the problem was more likely due to a single point of aggregation failure within the application, as the network itself was functioning and packets were forwarding.
What can we learn from Instagram:
This outage shows how difficult it can be to trace an outage to its real root cause. In Instagram’s case, the fact that the network was operational and packets were forwarding correctly implied that the underlying network infrastructure was not the source of the problem. Furthermore, the error message that users were seeing pointed towards a problem at the application layer rather than network or hardware infrastructure. The presence of 5xx errors, which are indicative of server-side issues, suggests that the problem originated from within the application's backend infrastructure, such as issues with server stability or authentication mechanisms.
Outages like this illustrate how fragile production environments can be: if even one element of a complex service delivery chain fails, the whole application can be rendered unusable. To avoid these types of disruptions, companies should implement a comprehensive monitoring system that not only tracks application performance, and alerts engineers to anomalies or degradations in real-time, but is also able to trace the issue to its true root cause. This system should monitor various metrics, including error rates, response times, and system load.
Alongside monitoring, having a well-defined incident response plan ensures quick and effective action when issues arise. This plan should include procedures for identifying, diagnosing, and resolving issues, as well as communicating with users about the status of the service.
Type 4: Network infrastructure outages
2023 cautionary tale: Microsoft Azure
In June 2023, Microsoft experienced a series of outages affecting various cloud services, including their Azure portal. A “spike in network traffic” was the preliminary root cause of this outage. In response, Microsoft implemented load balancing and auto-recovery operations to mitigate the issue.
These outages significantly disrupted Microsoft's cloud services, affecting a range of applications and users. Because they affected several cloud-based services like Azure, OneDrive, Teams, and SharePoint Online, the incident was indicative of a widespread network issue rather than a problem isolated to a single application or service.
What can we learn from Microsoft:
The primary cause of the outage was identified as a significant increase in network traffic, which suggests issues related to network capacity or traffic management. It was later revealed that a hacktivist group, "Anonymous Sudan", carried out DDoS attacks against the Azure portal in an attempt to overwhelm the network infrastructure.
This incident underscores the importance of robust network infrastructure and traffic management strategies, whether the spikes in traffic come from DDoS attacks or any other unexpected surges. It also shows how using advanced observability tools (in addition to cybersecurity tools) is crucial. These observability tools can detect unusual spikes or patterns in traffic that could indicate a potential problem or an impending attack, as well as trace the real root cause of anomalies and outages when they occur.
A quick summary of lessons learned in 2023:
In 2023 we saw examples of the many different kinds of issues that can lead to far-reaching service outages: operating system updates causing disruptions, unrecognized dependencies creating cascading effects, backend infrastructure failures leading to server-side issues, and – of course – attacks by threat actors.
These outages caused great suffering to the companies that experienced them, both in terms of monetary cost and reputational damage. But they were also valuable opportunities for reflection, and for making infrastructural and operational improvements to improve resilience for the future. For those of us observing from the outside, they serve as cautionary tales, showing us how we can protect our own environments from disruptions and threats in 2024 and beyond.
Amir Krayden is CEO and co-founder of Senser.