The shift from on-call engineers to agentic incident management


Every engineering team I’ve been a part of has had a 10x engineer. They were active contributors to design reviews -- developing a deep, intuitive understanding of the product and every new feature. Beyond writing code, they reviewed every pull request, tracked every change across the product, and kept a mental map of how all the pieces fit together. They were in the right Slack channels, constantly evaluating process or infrastructure changes to understand how their team might be impacted.
They built operational dashboards, and spent the first 15 minutes of their day scanning key metrics to learn what “normal” looked like so they could spot anomalies instantly. They knew their upstream and downstream dependencies, tracked bugs and releases, and stayed up to date on the tools and platforms their team was built on. All of this context led to one inevitable, risky outcome: whenever something broke, they were the only one who knew where to look and how to fix it.
They became the go-to person for everything. And soon, they were on call 24/7. Eventually, these engineers either got promoted into leadership or they burned out and left all together, taking all of that critical context with them.
The team would struggle, only for someone else to step up in their place -- and the cycle would repeat.
This is a pattern engineering leaders have come to accept for years. But when you take a step back, you have to ask: does it really have to be this way?
What if every engineer had access to a system that functioned like a 10x engineer on-call, available at all times?
This isn’t a new problem. Previous attempts to solve it were limited by the underlying technology -- relying on brittle, hand-scripted responses to known failure modes. These systems required predefined branches, hardcoded thresholds, and assumptions about what could go wrong. But real production outages are rarely caused by what you expect. They’re caused by unknown unknowns.
That’s where the latest generation of LLMs has made significant strides. Their ability to reason across large amounts of unstructured data -- to plan, write and critique code, interpret logs, even understand raw inputs like images and video -- finally brings the cognition layer needed to address this challenge head-on.
So, are we there yet?
We’re getting closer, but we’re not quite at the point where every company can benefit from this approach at scale. From experience, this isn’t a problem that can be solved in isolation. To make real progress, there are a few key shifts that will need to happen first:
- Tech stacks will require much more standardization than they have today:
The way companies build and deploy software (even within the same org) can look radically different. But we’re starting to see patterns emerge: centralized platform teams curating more coherent developer experiences, and fragmented layers like front-end stacks beginning to converge around common standards. These foundational shifts will be critical in enabling AI systems to reason more effectively across environments. - What we define as an incident will need to change:
Most incident response today is reactive by design. Engineers instrument systems, then handpick a few signals to trigger alerts. But often the real user-impacting issues aren’t flagged until something breaks. The next wave of observability will come from agentic platforms that can differentiate between a harmless CPU spike and a meaningful error caused by a production code change. For example, we’ve already seen success with systems that flag newly modified files with suggested downstream checks. This is absolutely critical in catching issues before they cascade. - The mindset shift will be just as critical as a technology shift:
In many orgs, the current state of ops is tolerated as “good enough.” Solving this problem requires investment in parts of the stack that don’t always feel urgent -- until something goes wrong. Scaled B2C companies like Netflix are ahead when it comes to this because service quality is the product. Imagine a streaming service going down during the Super Bowl -- it’s unthinkable, and that urgency drives a fundamentally different approach to reliability and tooling. The same needs to happen in B2B environments.
Rethinking Reliability in the Age of AI
We’re at a real inflection point. AI tools are finally starting to match the complexity of modern systems, but getting value from them takes more than dropping in an LLM and hoping for the best.
With the right structure and tooling, you can start to externalize the context that usually lives in one engineer’s head. This change will let more people operate with the same level of visibility and effectiveness, and the person who used to be on call 24/7 will no longer be the single point of failure or perpetually on the edge of burnout.
Image Credit: Twoapril Studio / Dreamstime.com

Anand Sainath is Head of Engineering and Co-founder, 100x, the leading AI solution for software troubleshooting. He leads the development and scaling of 100x's AI agent, which analyzes tickets, alerts, logs, metrics, traces, code, and knowledge to identify and remediate production issues. Previously, Anand held engineering leadership roles at Moveworks and Tableau, where he played a key role in scaling operations and driving innovation.