Object storage: A better way to scale big data environments than traditional HDFS-based Hadoop
With technologies such as artificial intelligence, machine learning, IoT and advanced analytics hitting a critical mass, it’s no surprise that the big data market continues to grow rapidly. According to a forecast by Statista, the big data market reached $42 billion dollars in 2018 and is expected to reach $64 billion by 2021.
Big data presents major opportunities for organizations to gain new insights, deliver better products and improve operations, but the traditional storage approach to big data is fraught with many challenges. It’s time for another way.
Big Data Gets Bigger
To put big data’s rapid growth in perspective, consider Cisco’s latest Cloud Global Index report. Cisco projects that the volume of stored big data will reach 403 exabytes by 2021, a nearly 8x increase from 51 EB in 2016. What’s driving the surge?
The report explains, "The growing importance of data analytics -- the result of big data coming from ubiquitously networked end-user devices and IoT alike -- has added to the value and growth of data centers. They touch nearly every aspect of an enterprise, whether internal/employee-related data, communication or processes, or partner-and customer-facing information and services." As a result, according to the report, big data alone will represent 30 percent of data stored in data centers by 2021, up from 18 percent in 2016.
The type of data being generated by external and internal users offers enterprises the potential for rich new insights, as well as the opportunity to fuel innovative new applications involving AI and machine learning. For analytics, the data is collected and analyzed using a method that’s been around for over a decade. This common approach to big data analytics relies on two primary components: 1) an HDFS-based Hadoop platform combining compute and storage (e.g., Cloudera, MapR) to collect and process the data, and 2) analytics software (e.g., Spark, Hive, Presto) to mine that data.
With this approach, HDFS copies or moves data from primary, secondary or archive storage to a Hadoop platform so that it can be queried by the analytics software. The combination of compute and storage within the Hadoop platform has traditionally given this method one major advantage: performance. Five years ago, when deployments were smaller, the performance benefit made this approach the best way to tackle big data in most settings. However, with the rapid growth in big data, it has proven to entail significant disadvantages.
The Disadvantages of Traditional HDFS-based Hadoop Approaches
The biggest drawback is cost -- it’s expensive to scale Hadoop deployments to handle data growth because adding storage requires adding more compute at the same time, as the two are coupled together in this architecture. Compute is never cheap, so this becomes a significant waste of CAPEX as big data swells.
There are a few other major issues as well. Moving data into the Hadoop big data cluster for analysis is inefficient and time-consuming, making it difficult to derive value in a timely manner. Also, because Hadoop big data clusters operate as independent silos, they have to provide their own data protection mechanisms. As a rule, HDFS makes three copies of data for data protection, which requires more storage and, thereby, increases cost and complexity. Furthermore, HDFS has single point of failure, and workarounds to address this are neither easy nor efficient.
Reducing Cost and Complexity with Object Storage
How can organizations overcome the challenges of traditional HDFS-based Hadoop approaches to achieve their big data analytics needs in a cost-effective way? The answer is to incorporate object storage.
It is now possible to add external S3-based object storage to a Hadoop environment, which provides the ability to decouple compute and storage subsystems and scale them independently. In addition, benchmark tests have shown that performance of S3-based object storage is very similar to traditional HDFS-based approaches. As a result, users can achieve the scalability, flexibility and cost advantages of separating storage and compute without any significant tradeoff in performance. The decoupling of compute and storage also simplifies storage management and analytics workflows, resulting in increased overall operating efficiency at a significantly lower TCO.
In short, with big data growing so dramatically, organizations would be wise to consider whether their current infrastructure is best suited to their analytics needs moving forward. If not, object storage could provide an ideal solution for ensuring they can maximize the strategic value of their data in the most efficient and cost-effective way possible.
Gary Ogasawara heads up Cloudian’s global Engineering team, with operations in Silicon Valley, Milan, Tokyo, and Beijing. His responsibilities cover product development, deployment, and operations. Prior to Cloudian, Gary lead the Engineering team at eCentives, a search engine company. He also led the development of real-time commerce and advertising systems at Inktomi, an Internet infrastructure company. Gary holds a Ph.D. in Computer Science from the University of California at Berkeley, specializing in uncertainty reasoning and machine learning.