The essential role of an open data stack in building an open lakehouse [Q&A]
There is a movement underway to bring about a set of intelligent data apps that will require a new type of modern data platform to support them. TheCube Research identifies this as the 'Sixth Data Platform' -- an open, multi-vendor, modular platform.
We spoke to Justin Borgman, co-founder and CEO of Starburst, who believes an Icehouse architecture is the ideal foundation for building an open data lakehouse, underpinned by flexibility and open technologies.
BN: What are the challenges posed by proprietary formats and non-standard SQL syntax from cloud data warehouse vendors, and how these contribute to vendor lock-in?
JB: Proprietary formats and non-standard SQL syntax from cloud data warehouse vendors present significant challenges for organizations, often leading to vendor lock-in. These challenges arise due to the lack of interoperability and portability of data and queries across different platforms. Firstly, proprietary formats restrict data mobility, making it difficult to migrate data seamlessly between different cloud data warehouses or even between on-premises and cloud environments. This lack of portability can result in significant migration costs and efforts, tying organizations to a specific vendor.
Secondly, non-standard SQL syntax complicates application development and maintenance. Developers must learn and adapt to vendor-specific SQL dialects, which increases complexity and reduces code reusability. This reliance on vendor-specific syntax makes it harder for organizations to switch vendors without significant rewriting of queries and applications.
Overall, these challenges contribute to vendor lock-in by creating dependencies on specific platforms and increasing switching costs. To mitigate vendor lock-in, organizations should prioritize solutions that support open standards and formats, facilitating data portability and reducing dependence on any single vendor.
BN: How does the emergence of more open data storage options increase customer choice and force vendors to compete based on business value?
JB: The emergence of open data storage options like Apache Iceberg dramatically expands customer choice and compels vendors to compete on business value. Apache Iceberg, for instance, is an open table format for large-scale data storage that offers benefits such as schema evolution, transaction support, and efficient data pruning. By embracing Apache Iceberg or similar open solutions, customers gain flexibility to manage their data across different storage systems without being locked into proprietary formats. For example, a company can store its data in Apache Iceberg format on cloud storage services like Amazon S3 or on premises and pair it with an open execution engine, such as Trino or Starburst. Even proprietary vendors like Snowflake recognize the need for this and they support querying Iceberg as well. (Of course, with Snowflake, you would run into the earlier topic of proprietary SQL syntax, but at least Iceberg gets you halfway there.)
This increased flexibility forces vendors to differentiate themselves by providing additional value beyond basic storage functionality. Vendors may offer value-added services like automated schema management, performance optimization tools, or integration with popular analytics platforms, ultimately driving better outcomes for organizations leveraging open data storage solutions.
BN: Why is it important to encourage interoperability and compatibility among different tools and systems in fostering innovation and flexibility for organizations?
JB: Encouraging interoperability and compatibility among different tools and systems is paramount for fostering innovation and flexibility within organizations. As I mentioned earlier, Apache Iceberg, for example, promotes compatibility across various data processing frameworks like Apache Spark, Apache Hive, and Trino (formerly PrestoSQL).
This compatibility allows organizations to store data in Iceberg format and seamlessly access and analyze it using different processing engines without the need for data movement or transformation. Additionally, tools like Trino provide a unified query interface across different data sources, including traditional relational databases, cloud storage services, and data lakes. This interoperability enables organizations to query and analyze data from disparate sources in a cohesive manner, fostering collaboration and accelerating insights. By embracing interoperable solutions, organizations can leverage the strengths of different technologies while avoiding vendor lock-in and compatibility challenges. This approach not only promotes innovation by enabling experimentation with new tools but also enhances flexibility by empowering organizations to adapt their technology stack to evolving business requirements seamlessly.
BN: How has community-driven innovation influenced the development of the open lakehouse concept, which you see as an Icehouse architecture, and what benefits does it offer compared to traditional vendor-centric approaches?
JB: Community-driven innovation has played a pivotal role in shaping the open lakehouse concept, demonstrated by what we refer to as the Icehouse architecture. This model combines the strengths of data lakes and data warehouses, leveraging open-source technologies like Apache Iceberg, Apache Spark, and Trino, among others.
Community collaboration has led to significant advancements in these technologies, enabling the integration of data lake storage with robust query engines and advanced analytics frameworks. For instance, Apache Iceberg provides a unified table format for data lakes, offering features like schema evolution and transaction support, which were previously only available in traditional data warehouses.
The Icehouse architecture benefits from the agility and innovation fostered by community-driven development. Unlike traditional vendor-centric approaches, which often lock organizations into proprietary formats and technologies, the open lakehouse concept promotes interoperability and flexibility. Organizations can leverage a diverse ecosystem of tools and systems, selecting the best-fit solutions for their needs without being tied to a single vendor. Additionally, the Icehouse architecture enables seamless integration of new technologies and methodologies as they emerge to offer unparalleled flexibility, scalability, and agility compared to traditional data warehousing vendor-centric approaches.
BN: How does decoupling storage and compute layers and embracing open standards for data management help organizations to maintain greater control over data sovereignty and governance practices within the open lakehouse model?
JB: By separating storage from compute, organizations can choose the most suitable storage solution for their needs without being locked into a specific compute platform. For example, with Apache Iceberg, organizations can store data in open formats on various storage systems like Amazon S3 or Azure Data Lake Storage, ensuring data portability and flexibility. Embracing open standards for data management further enhances control over data sovereignty and governance. Open standards like Apache Iceberg provide features such as schema evolution and transaction support, enabling organizations to maintain data integrity and consistency across different compute engines and analytical tools.
Image credit: Momius/depositphotos.com