Logs, metrics and traces -- unlocking observability [Q&A]

Ensuring observability has always involved three pillars: logs, metrics and traces. However, the reality is that most organizations simply store this information in silos which are incapable of communicating with one another.

Jeremy Burton, CEO of Observe, believes organizations need to go beyond the three pillars of past failed solutions and instead view observability as purely a data problem. We talked to him to learn more.

BN: Why do we need observability? Why have alternate approaches seemingly fallen short in reducing things like mean-time-to-resolution, in your opinion?

JB: Imagine for a second that you're blindfolded and dropped into an unknown location. When removing the blindfold, you might see the Golden Gate Bridge to the west, the Bay Bridge to the east and the Transamerica Pyramid to the southeast. By correlating those three data points, you'd likely be able to determine that you're in the Marina District.

Of course, this example assumes you're familiar with San Francisco. Now, imagine that you’ve been dropped into an unfamiliar city, and you only have a couple of data points. And, those data points are low-res photos that you simply cannot correlate. The odds of determining your 'unknown' location are very slim, at best.

This second scenario is the perfect analogy for how most site reliability engineering (SRE) teams operate today. 'Unknown' problems -- things never seen before -- arise in production every day because modern applications are updated every day. SREs are using siloed tools -- with incomplete data and no ability to correlate -- in an attempt to find the root cause of unknown problems.

This is why despite pouring $17 billion into tooling each year, organizations are still seeing a negligible impact on mean-time-to-resolution (MTTR). Observability promises to solve this problem, but to deliver on its promise a new approach is needed. First, the data cannot be siloed, it must be in one place. Second, we can't have big gaps in the data -- it must be complete. Third, investigating unknown issues is about being able to quickly access related contextual information so the data must be easy to navigate. Finally, we have to make it easier. For a decade there has been a shortage of DevOps engineers, so we need to find more intuitive ways to access these insights.

BN: What business benefits can organizations expect if they embark on an observability strategy?

JB: Observability, when done right, promises to reduce the time taken to resolve incidents thereby leading to a better customer experience and better quality of life for internal DevOps, SRE and engineering teams. It should also lower costs -- despite the fact that modern applications generate an order-of-magnitude more telemetry than their predecessors.

BN: Can you give use a brief description of Observe and what makes its approach to observability different?

JB: Founded in 2017, Observe is the only company looking to solve the observability problem from the data on up. Working with brands such as Topgolf, Reveal and F5 Networks, Observe eliminates silos of logs, metrics and traces and instead stores all data in a central, low-cost, data lake.

As data streams into the lake, Observe curates it into a graph of connected datasets -- such as containers, pods and customers -- called the Data Graph. Access to related contextual data is immediate, cutting troubleshooting times in half.

Observe is capable of ingesting over a petabyte of data per day into a single customer's instance while providing a 'live' mode for interactive debugging. Observe also offers a range of generative AI features designed to make common tasks more accessible even for the most junior users.

BN: Why can't incumbent vendors adapt to meet the observability needs of modern organizations?

JB: Incumbent vendors view observability as being about deriving insights from logs, metrics and traces -- otherwise known as the 'three pillars.' Typically, they have pieced together point solutions via a mix of M&A and internal development. They tick the boxes but, because their data is siloed under the covers, they are unable to correlate those pillars -- for example navigating from a spike in a metric, to associated spans and then onto relevant logs.

In addition, these offerings were never architected for modern, microservice-based, applications or the associated data volumes they generate. They employ clunky methods of moving data in and out of their tools and customers experience unexpected overage charges when they least expect it.

BN: Modern applications and infrastructure seem to generate a lot of telemetry data -- estimates say it grows 40 percent per year -- how can this level of data growth be handled without blowing the budget?

JB: A new architecture is needed to handle these data volumes. In a new architecture, look for these two things:

  • A separation of storage and compute. Data must be able to be ingested quickly and cost-effectively -- ideally at the cost of blob storage such as AWS S3. Queries, which use compute, should be capable of being run (and billed!) on-demand and should scale elastically depending on query complexity.
  • A query engine that does not require an index. Old databases need indexes to deliver performance. They are expensive to build and maintain, particularly at large scale. Modern query engines are scan-based and efficiently organize and prune data. At scale, they are much more efficient yet still deliver the required performance.

These two capabilities combined reduce overall observability costs by 2-10x.

BN: Why do so many observability vendors refer to three pillars (logs, metrics, traces) and how does Observe handle them?

JB: Vendors know that when troubleshooting 'unknown' problems, it’s no longer enough to have a one dimensional view (i.e. just the metrics or just the logs). A decade ago, most organizations who bought APM tools also had a logging tool, "so they could investigate why they had a problem."

This might have been acceptable then, but that's because the rate of change of those applications was slow, with perhaps only a few, well-tested, releases a year. Today applications change every day and, whether you like it or not, we test in production. It is now critical to be able to seamlessly troubleshoot unknown issues and get to the root cause before ruining a customer's experience or, worse, damaging the brand.

Here at Observe, we took a very different approach. From day one, Observe was architected to ingest all types of event data -- logs, metrics and traces -- into a central, low cost, data lake. Observe then curates the event data to make it fast to query and easy to navigate. When the user can see how all the components in their application are connected -- and can easily navigate between them -- it becomes much easier to investigate 'unknown' problems.

Photo credit: nzozo / Shutterstock

© 1998-2024 BetaNews, Inc. All Rights Reserved. Privacy Policy - Cookie Policy.