Shifting left to improve data reliability [Q&A]
The concept of 'shifting left' is often used in the cybersecurity industry to refer to addressing security earlier in the development process.
But it's something that can be applied to data management too. Shifting left in this sense means performing data reliability checks sooner. The ability to execute data reliability tests earlier in the data pipelines helps keep bad data out of production systems.
We talked to Rohit Choudhary, CEO and co-founder of Acceldata, to find out more about how data teams can use this technique to quickly identify and fix the root cause of data incidents.
BN: We know that shift left is used in the security industry to refer to moving the process of security earlier in the development process. What does shift left mean when it comes to data reliability?
RC: As data analytics becomes increasingly critical to an organization's operations, more data than ever is being captured and fed into analytics data stores, with the goal of helping enterprises make decisions with greater accuracy. Given this data comes from a variety of sources, data reliability is essential to equip enterprises with the insight to make the right decisions based on the right information.
Shifting left, when it comes to data reliability, is an approach to ensure that data entering your environment is of the highest quality and can be trusted. This means performing data reliability checks before data enters the data warehouse and data lakehouse.
BN: What are the benefits of implementing this approach?
RC: Historically, most organizations would only apply data quality tests in the final consumption zone due to resource and testing limitations. The ability to execute reliability tests on data earlier in the data pipelines keeps bad data out of the transformation and consumption zones and helps data teams reclaim time and costs that get increasingly expensive the later that issues are detected.
BN: What challenges have organizations faced when it comes to data reliability, and what are the repercussions of an ineffective strategy?
RC: The biggest data reliability challenge organizations face is waiting too long to perform data quality tests. Data pipelines that manage data supply chains typically operate in one of three sections: The data landing zone (where source data is fed), the transformation zone (where data is transformed into its final format), and the consumption zone (where data is in its final format and is accessed by users). When tests are performed in the consumption zone, it's often costly and time consuming to make adjustments to any strategies formed based on bad data. This is why performing checks before data enters the transformation zone is key.
There are a few challenges organizations face when met with unreliable data. The first is increased operational costs, as flawed data leads to costly decisions that can create organizational inefficiencies. Usually, the discovery of poor data isn’t recognized until after considerable time and money have been misallocated towards misguided goals. The second is security and compliance risk, as bad data can lead to false positives, which in turn can carry security risk, creating vulnerabilities within the environment that even security teams might not recognize until after a breach has occurred. The last is growth limitations. When an organization uses bad data -- even if done inadvertently -- every touch point of the business is impacted. Companies cannot build the right products, service their customers effectively, or optimize resource allocation if they are operating off the wrong targets.
BN: What role does data observability play in shaping reliability?
RC: Shifting left is essential, but it's not something that can simply be turned on. Data observability plays a key role in shaping data reliability, and only with the right platform can you ensure you’re getting only good, healthy data into your system. Having high-quality data can help an organization achieve competitive advantages and continuously deliver innovative, market-leading products, while poor quality data will deliver bad outcomes and create bad products, which can break the business.
BN: What steps can organizations take to effectively implement the shift left approach?
RC: There are five capabilities organizations need to implement in order to effectively implement the shift left approach. The first is support for data-in-motion platforms: supporting data platforms such as Kafka and monitoring data pipelines in Spark jobs or Airflow orchestrations allows data pipelines to be monitored and metered. The second is support for files: files are often delivering new data for data pipelines. Performing checks on the various file types and capturing file events to know when to perform incremental checks is important. Third is circuit-breakers: These are APIs that integrate data reliability test results into your data pipelines to allow the pipelines to make decisions to halt data flow when bad data is detected. This prevents it from infecting other data downstream. The fourth is data isolation: When bad data rows are identified they should be prevented from continued processing, then they need to be isolated, and ultimately, need to have the ability to have further checks run to dig deeper into the problem. The last is data reconciliation: With the same data often in multiple places, the ability to perform data reconciliation allows data to remain in sync in various locations.