Change data capture: The critical link for Airbnb, Netflix and Uber
The modern data stack (MDS) is foundational for digital disruptors. Consider Netflix. Netflix pioneered a new business model around video as a service, but much of their success is built upon real-time streaming data.
They’re using analytics to push highly relevant recommendations to their viewers. They’re monitoring real-time data to maintain constant visibility to network performance. They’re synchronizing their database of movies and shows with Elasticsearch to enable users to quickly and easily find what they’re looking for.
This has to be in real time, and it has to be 100 percent accurate. Old-school extract, transform, load (ETL) is simply too slow. To fill this need, Netflix built a change data capture (CDC) tool called DBLog that captures changes in MySQL, Postgres and other data sources, then streams those changes to target data stores for search and analytics.
Netflix required high-availability and real-time synchronization. They also needed to minimize the impact on operational databases. CDC keys off of database logs, replicating changes to target databases in the order in which they occur, so it captures changes as they happen, without locking records or otherwise bogging down the source database.
Data is central to what Netflix does, but they’re not alone in that regard. Companies like Uber, Amazon, Airbnb and Facebook are thriving because they truly understand how to make data work to their advantage. Data management and data analytics are strategic pillars for these organizations, and CDC technology plays a central role in their ability to carry out their core missions.
The same can be said of just about any company operating at the top of their game in today’s business environment. If you want your company to operate as an A-player, you need to modernize and master your data. Your competitors are definitely already doing it.
Sub-Second Integration Is the New Standard at Airbnb and Uber
In today’s world, a strong customer experience calls for real-time data flows. Airbnb recognized the value of CDC technology in creating a great CX for their customers and hosts. They, too, built their own CDC platform, which they call Spinal Tap. Airbnb’s dynamic pricing, availability of listings, and reservation status demand flawless accuracy and consistency across all systems. When an Airbnb customer books a visit, they expect workflows to be very fast and 100 percent accurate.
For Uber, immediacy is arguably even more important. Whether a customer is waiting for a ride to the airport or ordering a food delivery, timing matters a great deal. Just like Netflix and Airbnb, they developed their own CDC platform to synchronize data across multiple data stores in real time. Again, a common set of requirements emerged. Uber needed their solution to be extremely fast and fault tolerant, with zero data loss. They also needed a solution that wouldn’t drag down performance on their source databases.
Change data capture for the Rest of Us
Once again, CDC fits the bill. In the old days, overnight batch-mode ETL might have been adequate to provide a daily executive update or operational reports. Today, real time is increasingly the norm. If information is power, then immediate access to information is turbo power.
That’s why CDC is rapidly becoming a foundational requirement for the modern data stack. It’s all well and good, though, that big companies like Netflix, Airbnb and Uber have the resources to build custom CDC platforms -- but what about everyone else?
Off-the-shelf CDC solutions are filling that gap, delivering the same low-latency, high-quality streaming pipelines without the need to build it from scratch.
Unfortunately, they’re not all created equal. Most companies operate a collection of systems that handle ERP, CRM or specialized operational functions such as procurement or HR. These run on different database platforms, with incongruent data models. If a company operates mainframe systems, then they’re likely dealing with arcane data structures that don’t easily fit alongside modern relational data.
This makes heterogeneous integration especially important. It requires connecting to multiple data sources and targets, including transactional databases like SAP, Oracle, DB2 and Salesforce. It means delivering real-time streaming data to platforms like Databricks, Kafka, Snowflake, Amazon DocumentDB, and Azure Synapse.
To drive artificial intelligence (AI) and advanced analytics, enterprises need to push their data to a common MDS platform. That means ingesting information from a variety of sources, transforming it to fit a unified model for analytics, and delivering it to a modern cloud-based data platform.
Change data capture technology serves as a critical link in the data-driven value chain -- first by automating data ingestion from source systems, then transforming it on the fly and delivering it to a cloud data platform. Real-time CDC automation ensures that the right information gets to the right place, immediately.
Because they focus only on data that has changed, streaming CDC pipelines offer tremendous efficiency advantages over the batch-mode operations of the past. The best CDC solutions can deliver 100+ terabytes of data from source to target in less than 30 minutes, with zero data loss.
The shift to cloud computing is well under way. Cloud analytics, in particular, offer distinct advantages for companies that truly understand the transformational role of data. Leading companies in every industry are aligning their strategic visions around data analytics. They’re digitizing their interactions with customers and using algorithms to study data, extract insights, and take action. AI and machine learning are ingesting vast amounts of information, discovering correlations, and identifying anomalies.
Whether you’re leading the way in digital disruption or simply trying to keep up with the pack, CDC technology will play a pivotal role in making the modern data stack a reality and opening the door to digital transformation.
As first published in VentureBeat.
Gary Hagmueller is the CEO of Arcion, the world’s only cloud-native, CDC-based data replication platform. Gary is a proven leader who has created over $7.5 billion in enterprise value through two IPOs and four M&A exits over his more than 20 years in the tech industry. Gary holds an MBA from the Marshall School of Business at the University of Southern California, where he was named Sheth Fellow, as well as a bachelor’s degree in Business Administration from Arizona State University. As the father to twin teenage boys, he is clearly experienced in project management and negotiation skills. For more information on Arcion, visit www.arcion.io/, and follow the company on LinkedIn, YouTube and @ArcionLabs.