The separation of storage and compute [Q&A]
Business intelligence and analytic projects have traditionally been based on the concept of the enterprise data warehouse, which saw compute and storage combined in a monolithic platform to achieve the performance required for high-performance analytics. More recently, the trend has been toward data lakes, but this was similarly based on the approach of putting all data in a single environment -- initially Hadoop -- for storage and analysis.
We spoke with Justin Borgman, CEO for Starburst Data, on why he believes the separation of storage and compute in the data processing and analytics sector is a trend that will continue to gain momentum.
BN: Why is the separation of storage and compute such a hot topic?
JB: First, let’s start with what separation of storage and compute actually means. Before the cloud, you’d buy all your hardware upfront, along with the associated licenses and service contracts, and stock your data center with all the resources you might need to store and analyze your data. If your peak usage required 100 machines, then you’d buy 100 machines -- even if you only needed all those resources for a few hours a day. The rest of the time, your expensive hardware would be sitting dormant, rapidly depreciating.
Those costs were never recouped. They were sunk capital expenditures amounting to wasted money.
Now, with the advent of cloud architectures, you can leave your data in the cheapest storage layer (AWS S3, Azure ADLS, Google’s GCS, or on-prem S3-compatible object storage like Minio or Ceph), then spin up compute resources when you need them, and only for as long as you need them. You only pay for compute when you’re actually running your analytics.
BN: What are some other important benefits of this type of architecture?
JB: Obviously cost is one of the most important benefits of separating the two. If compute resources are separated from storage, and only have to be turned on as needed to interact with data or scale up or down, then businesses can save money by only paying for what they use. Shrink the cluster when you’re not using it, and save money. Similarly, you now have complete control over turning up performance as well. Turn the dial the other way and expand the cluster to apply more computing resources to the workload at hand.
Another benefit is the ability to point your compute at different types of storage so that you can access data wherever it lives, rather than having to load it all into one database. For example, you may have some data in Hadoop and Teradata on-prem and some data in the cloud. If you use query engine technologies that embrace storage/compute separation such as the open source Presto engine, you can now query anything, anywhere. This allows you to create one unified query layer across all of your data sources without having to move any data around.
BN: What industries benefit from this type of behavior?
JB: Let’s take large retailers as an example. If a CEO or marketing manager wants a report of every product sold at every location across the country from the day before, the month before, holiday season, etc., separation of storage and compute allows them to have this report in a matter of seconds by scaling up the compute resources on-demand to do the job and then shutting them down afterwards. In the past, the company would have to have the necessary hardware resources on standby in order to generate those results. The rest of the day? Those expensive resources would sit idle and depreciating.
Another example would be a large bank. After decades of mergers and acquisitions, the bank has accumulated dozens of different database systems. Trying to get a holistic view of the bank’s mortgage lending business would require creating massive ETL pipelines to get fresh copies of the data moved into a central location for analysis. With storage/compute separation, every database is just another data source that can be queried by one universal query interface. Analysts can now get their answers in seconds or minutes by querying the data where it lives rather than having to wait days or weeks for the data to get copied into one system for analysis.