Taking control of the scary things: churn, incidents and downtime
Three little words herald major impact (and fear) for organizations: churn, incident, and downtime. Given that Gartner reports companies might be at risk of losing up to half a million USD each hour from severe incidents (based on losses and time to remediate), boards should take the health of a company’s digital operations seriously. Thankfully, those responsible for digital operations and incident response have a plethora of capabilities and services at their disposal that can drastically reduce the impact downtime and instability has on their organization.
With a long recession forecast for the UK, leveraging these tools to better understand, plan, and predict is crucial. Achieving this state of operational maturity means businesses are equipped with the right analytics, communications, understanding, and ability to take action to manage all threats and incidents -- and try to prevent as many as possible from occurring in the first place. True operational maturity goes beyond the technology in place to also cover the people and processes involved. These ‘human’ elements are no less vital since they are associated with important metrics and outcomes such as hours worked, staff burnout, and attrition.
What is operational maturity?
Every organization falls into one of five stages of operational maturity, from manual to preventative. The goal is to achieve the preventative state of operational maturity, but many organizations find themselves much less prepared. The five stages can be described as follows (each building on the former):
1. MANUAL -- there are no inbound integrations with observability tools (incidents are initiated manually).
2. REACTIVE -- the organization has only some inbound integrations but no defined processes for managing incidents.
3. RESPONSIVE -- there are defined call-out schedules and multiple escalation levels; with teams moving towards full-service ownership.
4. PROACTIVE -- inbound and outbound integrations, service dependencies, change events, and response plays are all in place to fix issues before customers are aware.
5. PREVENTATIVE -- the organization adopts event intelligence features and/or consumes analytics to allow predictive remediation.
As a business ascends the operational maturity 'ladder' towards the preventative state, it will find with each rung that incidents are managed more smoothly, quickly and with reduced resources.
There are two critical factors that underlay the ladder of maturity: responsiveness and proactiveness. Simply put, responsiveness is how quickly and efficiently an organization is able to manage urgent, unplanned and mission-critical work as it appears. An organization’s responsiveness is the result of the training, processes, and solutions it has to identify and remediate an incident that occurs. Important questions to ask when identifying an organization’s operational maturity level include:
- "How long does it take for an incident to be acknowledged?"
- "How quickly are we able to mobilize responders?"
- "How much time does it take us to resolve incidents?"
- "How many hours of disturbance and interruption do our responders have in a typical month?"
If responsiveness is how an organization responds to an incident, proactiveness should be thought of as how quickly an organization identifies that incident. Too often, customers are the first to notice and alert a business to the problem. A team internal to that business then manually creates a ticket, and the incident response process can finally begin. But there’s a better way. With the right approach to digital operations, an organization can be the first to know when an incident has occurred and resolve it -- even before a customer is impacted. When determining a company’s level of proactiveness, it’s important to consider:
- "Who or what is identifying our incidents?"
- "What is the process for alerting the appropriate team about the incident in question?"
The road to maturity
Achieving the end state of full operational maturity will depend on where you’ve come from and, specifically, the state of the business’s IT operations and infrastructure. If those functions are focused on mere survival, begin by acknowledging and supporting the teams who keep the plates spinning, and then map out a strategy to reach stability. Lack of resources doesn’t mean a plan shouldn’t be made -- be prepared.
Greater levels of operational maturity and adoption of digital transformation introduces benefits such as a faster response to incidents and the ability to manage workloads within core hours. This is important as it allows even distribution of work across teams and reduced toil and stress which will result in lower attrition. With defined call-out schedules and escalation procedures, reliability of response improves. This will directly impact the stability of the operational environment and dependent applications, reducing the costs incurred and/or reputational damage caused by unexpected events and, in turn, will reduce customer dissatisfaction and churn.
There are numbers behind this. PagerDuty’s 2022 State of Digital Operations Report demonstrated, based on customer data, that 42 percent of technical teams worked more hours in 2021 than the year before. Most (54 percent) were interrupted outside of normal working hours with break-fix work. Those with greater operational maturity suffered less from costly, unplanned work.
Operational maturity ensures excellence, removes worries
Together, operational maturity, DevOps, and full-service ownership offer this model of accountability and control of the digital environment. Automation is inevitably a critical part of this advanced state: such tools support a rapid, focused response to operational events and incidents. Under the hood, these tools often use machine learning to filter out the 'noise', alert operatives only when needed and remove the 'alert-fatigue' that has typically been associated with on-call engineering roles.
Now, more than ever, it’s important that the board appreciates the extent to which maturity in digital operations supports their organization’s bottom-line - by being proactive and preventative in the management of incidents, and in attempting to ensure that small fire risks never become blazing infernos. To that end, senior leadership must not only invest in, but understand how the challenges of churn, incidents, and downtime are best combated. Every business is a digital business, to a greater or lesser extent, and must pay more than lip service to their digital operational needs if they are to survive and thrive.
Photo credit: pathdoc / Shutterstock
Lee Fredricks is Director Solutions Consulting, EMEA of PagerDuty.