Delivering resilience for IT operations in 2021
Enterprise operations leaders today are facing a challenge: Support the rapidly growing and evolving needs of the business without losing control of the complex infrastructure that is needed to do so.
In recent years, and especially in the accelerated digital transformation spurred on by the pandemic, it’s become common practice to increase productivity by siloing development, with multiple teams working autonomously to rapidly deploy code. In simpler times, in organizations running on a handful or applications, it was possible to operate according to a linear, predictable blueprint of development. The dev team was able to identify and de-bug code to keep their applications, and therefore the business, running smoothly.
However, over time, as each new piece of code is released, connected, or stacked onto existing code, the network evolves organically into a sprawling, interconnected architecture. Incidents that occur may not be tied to the code within an application, but to how multiple applications interact with each other in ways that the dev teams could not have foreseen. The breakneck speed of growth and change make the network dynamic and unpredictable.
In many ways, this is a good thing, as this enables the enterprise to become agile and keep pace with business demands. This is the environment in which innovation can thrive. It’s also the environment where unpredictability and uncertainty lurk, so it’s imperative that operations leaders adapt their operational model to assure the health of the organization.
Tasking developers with identifying and fixing bugs in their code can be a time-consuming and costly endeavor when an incident turns into a brown-out or cascading failure.
If that’s not the answer, then how does an enterprise operations leader contain -- or even prevent -- these unpredictable interactions between applications before they threaten the health of the entire system?
The solution is observation and control to manage incidents and create resilience.
When an EMT arrives on the scene of an emergency call, she makes an on-the-spot assessment that includes measuring the patient’s vital signs. The priority is always first to stabilize the patient -- administer CPR, staunch the bleeding, set the bone. The search to determine the root cause of the patient’s state of distress is a secondary concern.
This same philosophy of stability first should be adopted by operations leaders responsible for the body of applications and the ultimate health of the enterprise that depends on it. The best way to accomplish this is to implement a solution that provides both observability and control.
Just as code fixes are no longer a sufficient means to manage complex cloud networks, the traditional solutions for monitoring applications do not provide full visibility over the interactions happening between them, or the means to control the incidents that arise from those interactions.
Operators should instead seek out a solution that goes beyond observability by providing highest-level visibility and a set of vital signs (redundancy, latency, concurrency, and bandwidth) that will alert them when something goes amiss. Again, the EMT’s priority is to apply pressure and dress a wound to prevent the injury from causing more serious harm to the patient. Similarly, the operator can install a solution that enables him to see a sudden spike in requests and exert classic backpressure to keep stabilize the network and prevent cascading failures that have ripple effects throughout the enterprise.
By detecting a disruption in the network and responding in real time to exert control, the operator is able to easily maintain stability of the environment as a whole. No time or money is lost to outages while dev teams hunt for the root cause in the code.
As already large-scale, cloud-native environments grow more complex, the shift to an operational model that pairs high-level observability with the critical measure of control is going to become increasingly necessary to keep pace with business demands. It will become the key to operational excellence for a well-managed, resilient network.
Danial Faizullabhoy is the Chief Commercial Officer at Glasnostic. Prior to joining Glasnostic, Danial was the CEO of enterprise infrastructure solution Cypherpath and the President and CEO of BroadLogic and has an extensive background in IT infrastructure, security, and engineering.