Building next-gen operational intelligence at scale
In today’s digital era, operational visibility is a prerequisite for businesses across sectors such as manufacturing, transportation and retail. However, managing this massive influx of rapid, real-time data can be challenging -- especially for organizations that don’t have the infrastructure in place.
This data generally takes the form of events such as clicks, telemetry, logs and metrics, often collected as time series or machine data. In contrast to transactional data collected via batch ingestion, this data is collected via real-time streaming.
The advent of streaming data has also revolutionized many industries and job functions. For instance, Site Reliability Engineers (SREs) can now gain unprecedented visibility into the root causes of issues, simplifying their efforts to ensure seamless backend operations and smoother user experiences.
Automation is also both easier to enact and capable of more precise actions. In areas where humans cannot respond fast enough, teams can create triggers to execute in response to specific events. This could take the form of an algorithm selling stock when its price rises above a certain threshold or a machine learning model automatically replenishing retail inventory when it falls below a specific level.
Analysts and executives are also empowered with more current (and more granular) data to better understand the effectiveness of product launches, market strategies and feature updates. They can make decisions more rapidly, enabling their organization to stay ahead of the competition.
Prerequisites
To leverage the full potential of real-time event data, teams need the best database available. While streaming data has been in existence for almost a decade, there are few databases today that are optimized for storing, querying or analyzing this kind of fast-moving data at scale.
The first requirement is a real-time pipeline. Traditionally, data was collected after the fact in batches: data comes in, is cleaned up and is organized in files before finally being ingested into a database. However, batch processing cannot keep up with the speed or dynamism of real-time data, so streaming data risks going stale and losing any time-sensitive insights it may have had.
The answer lies in solutions like Apache Kafka or Amazon Kinesis. Engineered for high throughput and minimal latency, these platforms manage millions of events per second without issues.
With streaming data, scale and concurrency are two key challenges. Streams can involve massive quantities -- up to billions of events per hour and perhaps even trillions per week or month. At the same time, analytics applications must accommodate hundreds of users generating thousands of queries per second, especially in a crisis like an application outage.
In some situations, teams may also need to analyze historical and recent data side by side, especially for outlier and anomaly identification. Most databases don’t have this feature, and instead teams must switch between a transactional database and a cloud data warehouse -- a cumbersome dance that risks delays when an organization can least afford it.
Today, the two most common types of databases are analytical (OLAP) and transactional (OLTP). Yet each has limitations -- OLTP databases excel in high-performance access but falter in analytical ability and scaling, whereas OLAP databases provide powerful analytics but fail when it comes to speed or supporting many users at once.
Databases also must be future proof to avoid expensive migrations, which can be some of the hardest projects any team can face. Migrations often arise from some fundamental mismatch between data architectures and operational requirements (which often change, especially as business grow their customer bases). A team may have chosen a platform that isn’t as scalable as they need, or perhaps they now need to carry out rapid, real-time analytics on their streaming data but their current database doesn’t support these operations.
Solution
To address these challenges, a database needs to ingest, process and analyze massive volumes of real-time data, all while supporting many users and a high rate of queries per second.
Apache Druid is one platform that handle all of these needs. Built for speed, scale and streaming data, Druid can seamlessly ingest data from streams and prepare them for immediate querying -- no more waiting on batch data or having to pre-aggregate data before the fact. Druid also enables a huge range of capabilities, from interactive visualizations and dashboards to real-time analytics.
Druid includes features designed specifically for the challenges of streaming event data. One capability is query on arrival. While many databases must ingest and persist events before they can be queried, Druid ingests each streaming event directly into memory on data nodes, where users can then immediately query the data. All of this data is subsequently processed and persisted into deep storage for added resilience.
Druid also guarantees consistency -- critical in data streams where millions or billions of daily events risk duplication. Druid’s exactly-once ingestion removes the need for developers to create their own custom code to remove duplicate data.
Druid can also keep pace with the velocity and quantity of streaming data. Druid separates querying, data management and other functions into different nodes and automatically rebalances data as nodes scale up and down. This ensures reliability, removes the need for downtime in upgrades and maintenance, and enables confidence in working with large quantities of data.
Customer story: Confluent Cloud
Founded by the creators of Apache Kafka, Confluent offers fully managed, Kafka-based solutions. Confluent leverages Druid as the foundation for several vital services that provide operational visibility, including Confluent Health+ for alerts, Confluent Cloud Metrics API for external-facing metrics, and Confluent Stream Lineage for mapping data relationships and exploring event streams.
Before Druid, Confluent utilized a NoSQL database. However as their data grew, they found that their existing database could not scale, accommodate high-cardinality metrics and time series data, or maintain subsecond response times in the face of increased users and query volumes. Not only was Druid capable of fulfilling all these requirements but it also included a native Kafka integration, removing the need for extra workarounds.
Today, Confluent operates 100+ historical nodes to handle batch data and over 70 MiddleManager nodes for ingesting and querying real-time data, all deployed on Amazon Elastic Kubernetes Service (EKS) clusters. Each second, this environment ingests more than three million events and responds to 250+ queries. In addition, Confluent keeps seven days of data in their historical nodes for querying and retains data for up to two years in Amazon S3 storage.
Image credit: stori/depositphotos.com
William To is senior product evangelist at Imply.