AIOps of the future: Building confidence in your brand
Technology dominates just about every sphere of modern-day society. If you are like most, you see it in your everyday lives. We increasingly buy online, with U.S. retail e-commerce sales now totaling $768 billion. Likewise, we increasingly work online, with 58 percent of Americans, or 92 million people, now telecommuting at least once a week.
For the most part, online consumers and remote workers take the technology behind their personal and professional activities for granted. We need groceries, so we open a grocery app, fill our virtual carts, check out and -- voilà -- the order is at our door in just a few hours. We apply the same expectations to remote work tools and, well, just about every technology we encounter throughout the day. It should just… work.
So what happens when technology doesn’t work? Well, look at Facebook and its companies WhatsApp and Instagram. In October 2021, all of these services went down and stayed offline for (cringe!) six hours. Users took to the internet to roast the company, productivity in offices crashed and revenue plummeted an estimated $99.75 million.
Although the scale of damages varies, outages like Facebook’s affect companies every day, all day long -- to great effect. Poor performing technology hurts sales, and sometimes worse, tarnishes the brand and stunts long-term growth.
While customers might not have turned away from a Goliath like Facebook, consider what you’d do if your usual grocery app went down for six hours. You’d likely switch to another one of the numerous available grocery apps. And, you’d probably open that new app for future purchases.
Consequently, companies are under immense pressure to deliver continuous service assurance. That’s easier said than done though. Incidents are inevitable in our digital world. To remain competitive, companies must continuously deliver bigger and better technologies and implement them into complex, ephemeral and often fragile IT environments.
Knowing they walk a fine line between achieving faster continuous development and maintaining availability of their apps and services, companies increasingly turn to artificial intelligence for IT Operations (AIOps) to help detect incidents and outages. But, as we’ll see, digital-first, cloud-reliant companies need advanced AIOps that moves beyond just incident detection.
Why events-based AIOps is insufficient
First, let’s look at how AIOps technology came to be: As modern microservices and ephemeral architectures emerged, manual monitoring and traditional systems management tools were no longer sufficient. The sheer amount of data overwhelmed these old ways of monitoring, and dynamic infrastructures rendered rigid, rules-based tools woefully inadequate for responding to more modern, unforeseen incidents.
Enter AIOps. AIOps applied machine learning (ML) and other related technology to event data, setting a baseline for system performance and then observing fluctuation in this performance. If the AIOps tool found anomalies outside of normal operating behavior, it would notify the appropriate team members and provide the context necessary to resolve the disruption.
Although this AIOps solution was a vast improvement over the previous monitoring methods, there was a problem: These AIOps tools, which unfortunately are still in use today, only told DevOps practitioners and SREs when an incident had occurred. Meaning they were reactive, and end users likely were already experiencing downtime.
In fact, my company recently released the Moogsoft State of Availability Report, which finds that 45 percent of customers alert companies to a problem before their tools do.
That’s obviously not an optimal user experience and does not bode well for customer confidence. Indeed, your users might have found a new grocery app by the time your team even realizes there’s an issue.
Companies can no longer wait for tools to detect incidents that already impact the end user. They must adopt modern AIOps solutions that analyze more data, provide early incident detection and drastically shorten the mean time to detection (MTTD).
Modern AIOps tools detect incidents early
Modern companies need to know something is wrong before their customers do. They need to anticipate outages. That’s why next-generation AIOps solutions add traces, metrics and logs to event data. By ingesting data from across the IT ecosystem, these tools can detect anomalies early in the incident lifecycle.
Yes, the unusual operation of an app or service could be a fluke. But usually not. Modern AIOps tools uncover the system anomalies that often become performance issues or outages. As the solutions’ algorithms automatically find anomalies, they also de-duplicate alerts to eliminate noise and non-incidents.
Of course, just like their predecessors, these tools don’t stop at detection. They also correlate alerts, connecting time-series metrics and events with service details, to find the incident’s probable root cause. And the solution automates the entire incident response, routing, remediating and auto-closing incidents -- all in one place. Aggregating this data is becoming mission-critical for DevOps and SRE teams that are managing and maintaining an average of 16 monitoring tools and, in some cases, up to 40.
What are the outcomes?
By automatically finding pertinent anomalies, DevOps practitioners and SREs can decrease their MTTD, reacting to incidents early, rather than waiting to respond until they impede the customers’ experience. In fact, it’s not uncommon for teams to see time-to-detect shorten by as much as 50 minutes. And teams can resolve with confidence. Because advanced AIOps provides probable root cause and takes the guesswork out of the incident lifecycle, teams naturally achieve faster mean time to recover (MTTR).
A seamless user experience isn’t the only upshot of modern AIOps. By reducing unplanned work and automating toil, the tool gives time back to DevOps practitioners and SRE teams. (This is much needed, considering most teams spend far more time on monitoring than any other activity.) As a result, they further improve the customer experience by building innovations like new features and platform improvements.
As our digital economy grows, so too do the number of consumers and workers relying on technology -- and expectations for seamless user experiences. Modern companies hoping to stay ahead of consumer sentiment while keeping pace with the change and complexity of modern IT environment don’t just need AIOps. They need advanced, intelligent AIOps that supports continuous availability by detecting problems before they scare off customers.
Image credit: Momius/depositphotos.com
Richard Whitehead, is Chief Technology Officer for Moogsoft and brings a keen sense of what is required to build transformational solutions. He’s a DevOps Institute Ambassador, and serves on the DevNetwork AI/ML Advisory Board. A former CTO, and Technology VP, Richard brought new technologies to market, and was responsible for strategy, partnerships and product research.