Why real-time analysis is key to making better use of data [Q&A]
Businesses of all types generate ever larger quantities of data, but while this should be an invaluable resource to drive decision making the sheer volume can create difficulties.
Analyzing data in real time is the ideal but it can be surprisingly hard to achieve. We talked to Ariel Assaraf, CEO of data streaming specialist Coralogix, to find out how enterprises can face the challenges posed by real-time analysis.
BN: What challenges do companies face today when analyzing data in real-time?
AA: First of all, most companies are not actually analyzing data in real-time. For example, traditional logging solutions rely on indexing to analyze the data and provide insights. In that case, we're already looking at some latency before the data is flowing through the analysis pipelines and scheduled queries can be run.
On top of that, these solutions are notoriously expensive because the data needs to be stored in order to extract insights. With this approach, all data is essentially treated the same and is paid for at a base rate. The cost of analysis then grows proportionally to the data itself. For modern applications, exponential data growth means that the cost to analyze it is outpacing revenue.
These challenges and limitations of real-time analysis mean companies are generally forced to cherry-pick which data to analyze and this, of course, leads to coverage gaps. This makes it more difficult to identify issues, especially unknown issues. When an issue is identified, it's not uncommon that the data needed to troubleshoot is not available or is missing context.
In addition, processing data in real-time on its own is not enough, as it misses the value of long-term analysis which requires the data state. This is particularly true for modern applications which can see significant spikes and fluctuations over time.
What we really need to overcome these challenges is to combine real-time data analysis with stateful analytics and decouple that from storage so we can reduce costs and improve performance.
BN: What are the pros and cons of analyzing data in real-time versus storing large amounts of data in, say, data lakes?
AA: Teams need the ability to monitor their data at a granular level. This involves both performance trends over time and real-time alerts for quick issue remediation.
Data lakes are great for long term ad-hoc queries, but aren't good at high concurrency of queries for alerting and data enrichment use cases. Unlike data warehouses, data lakes can ingest both structured and unstructured data, which means achieving even low-latency analytics is a challenge.
For the most part, companies that store data in a data lake, employ subject matter experts that are able to extract insights when needed. Still, compared to traditional analytics solutions that index and store data, this may be a more economical option.
Real-time analytics are crucial for a proactive incident response and immediate issue resolution, but without data trends, real-time analysis can only take us so far.
For example, it's great to know that at this moment I have X latency between data calls, but knowing that it’s more than doubled over the past six months adds much more value and context (but this is a more expensive query).
The best way to achieve high-performance system monitoring is to combine real-time analytics with state transformations which allow us to track data trends over time. With effective anomaly detection, we can immediately alert when system behavior changes. This can dramatically reduce the time it takes to identify and resolve issues.
BN: What kind of applications and use cases require real-time data? How has the need for expediency changed in the last few years?
AA: We mostly see the need for real-time data in modern internet or cloud-native software companies, where every minute counts and thousands of customers notice every small lag or incident.
Having said that, these days, companies in just about every industry are becoming software companies. Traditional industries like insurance and finance, for example, are breaking into tech in a big way.
With the number of people depending on these companies and the implications if something goes wrong in their systems, there is a significant emphasis on reducing time to identify and resolve issues, as well as removing or reducing latency across all processes. Both of these goals depend on real-time data.
BN: How can companies ensure they are able to analyze data in a timely and cost effective way? What are some best practices?
AA: A lot of companies are now working on new approaches that solve the issue of exponential data growth. The approach that exists in the market today is to use storage tiers. But then you have to compromise the quality and the speed of analytics. In that case, you need to decide where you want to store the data before you really know what insights it might contain.
What we're doing at Coralogix is using Streama, our streaming analytics engine, to ingest and analyze everything in real time, including the most stateful transformation and stateful analytics. Then only data that is frequently searched is sent to hot storage and the rest can be sent to archive. Data in the archive can still be queried at any time with relatively low latency. We're able to return query results from the archive in about one minute.
This is essentially decoupling data analysis from storage which both improves performance and is more cost effective.