Weathering the alert storm
The more layers a business adds to its IT and cloud infrastructure, the more alerts it creates to detect issues and anomalies. As a business heads towards a critical mass, how can they prevent DevOps teams from being bombarded by ‘alert storms’ as they try to differentiate between real incidents and false positives?
The key is to continuously review and update an organization's monitoring strategy, specifically targeting the removal of unnecessary or unhelpful alerts. This is especially important for larger companies that generate thousands of alerts due to multiple dependencies and potential failure points. Identifying the ‘noisiest’ alerts, or those that are triggered most often, will allow teams to take preventive action to weather alert storms and reduce ‘alert fatigue’ -- a diminished ability to identify critical issues.
By ensuring that critical issues aren’t overlooked or hidden among the noise, effectively addressing and minimizing the impact of alert storms will help prevent service outages, reduce downtime, and avoid operational disruptions.
Overwhelmed by alerts
Alert storms happen when an organization’s monitoring platform generates an excessive number of alerts all at once or in quick succession. Microservices architectures can be particularly susceptible due to numerous service dependencies, potential failure points, and upstream and downstream relationships. For example, teams will receive an alert when one service experiences an issue. But they will also receive an alert from every other service trying to communicate with that problematic service.
Consequently, alert storms can cause confusion, delay incident response, and lead to alert fatigue.
Many organizations depend on threshold and anomaly configuration to reduce alert storms. However, temporary measures such as manually muting and unmuting alerts can be difficult to replicate on a large scale. A lasting solution must be scalable, easy to maintain and, where possible, automated.
Identify and evaluate
The priority is to identify the primary components of an alert storm -- the alerts that are triggered most often. These can be broadly categorized as either predictable or unstable.
Predictable alerts form consistent patterns and include notifications about the start and end of automated backups, as well as warnings about regularly occurring issues that, while undesirable, tend not to pose any real danger. Unstable alerts, on the other hand, occur at an unnecessarily high frequency, typically informing teams that something is switching back and forth between different states. In both scenarios, such alerts can quickly overwhelm a team, potentially causing genuine warnings to be missed in all the noise.
Once identified, these alerts can be adjusted by extending a monitoring tool’s evaluation window, thereby reducing their frequency, and increasing their effectiveness. The evaluation window defines how often relevant data is evaluated and compared with alert conditions. By extending the evaluation window, the system will consider more data points before triggering a decision. This means an alert will only be issued when a limit violation occurs continuously.
Control, correlate and automate
Managing who receives notifications -- and when -- can be very helpful. Alerts can be limited only to those people who can handle the issue in question, or the teams affected by it. In addition, notifications can be suspended entirely in the event of maintenance work or upgrades. It’s possible to schedule deliberate downtime for such instances or implement one as needed during an unplanned outage, to prevent IT teams from being overwhelmed by a flood of notifications.
It's important to note that alert storms tend to involve several similar notifications. To help reduce the overwhelming number of alerts, they can be grouped together based on certain criteria or via notification channel, so that teams only receive one notification for each issue. For example, high-impact events and incidents may require an email alert or a pre-defined workflow, while less urgent alerts might be easier to consume in a chat channel because the related alerts and timelines are more visible. A good monitoring solution needs to provide various channels for notification to allow customers to choose the ideal one for any situation. Furthermore, event correlation enables teams to address related alerts as a single event, rather than as separate issues that require separate handling, thus preventing duplicated efforts. Categorizing alerts helps reduce noise and allow teams to concentrate on resolving root causes.
Finally, automation plays a crucial role in preventing alert storms. An approach where alerts trigger predefined actions to resolve issues without the need for human intervention can decrease the severity of an incident as well as MTTR (mean time to resolve).
Resilience and reliability
Alert storms can have an adverse effect on an organization by causing delays in incident response, increasing the risk of service outages, reducing performance, and leading to alert fatigue. As the number of both genuine and false-positive alerts continues to increase, organizations must implement scalable techniques that will improve the resilience and reliability of their monitoring system. An integrated approach to incident management also needs to be kept top of mind to enable teams to seamlessly switch between normal DevOps operations and focused incident work. Only by implementing these techniques and cutting through the noise will businesses truly be able to weather the alert storm.
Stefan Marx, Senior Director of Platform Strategy, Datadog. Stefan has been working in IT development and consulting for over 20 years. His main areas of activity are planning, developing and operating applications with a view to addressing the requirements and problems behind specific IT projects.