Data and containers and the keys to success
In the beginning, workloads, tools, and requirements for big data were simple because big data wasn’t really all that big. When we hit 5TB of data, however, things got complicated. Large data sets weren’t well suited to traditional storage like NAS, and large sequential reading of terabytes of data didn’t work well with traditional shared storage.
As big data evolved, the analytics tools graduated from custom code like MapReduce, Hive, and Pig to tools like Spark, Python, and Tensorflow, which made analysis easier. With these newer tools came additional requirements that traditional big data storage couldn’t handle, including millions of files, read-writes, and random access for updates. The only constant was the data itself.
Then the data evolved too.
Now it’s not just massive data sets, but also look up tables, the models themselves, and the code that creates the models. No longer are we dealing with just big files. Now we need to accommodate all types of data, from tables to streams and files and beyond. Through it all, the storage requirement never disappeared. In fact, with the proliferation of data types, storage has grown in importance.
Building a Modern Data Environment
Today, big data has expanded from traditional frameworks to a range of newer technologies such as Spark, Kafka, and SQL-on-Hadoop. Despite a number of different approaches to building a data environment, the requirements remain the same: reliable, scalable storage and flexibility.
Whether optimizing for speed, application portability, or flexibility, data must be available across all customer applications. Storage is the base that allows applications to access large data sets and preserve application state.
Optimizing for Speed
As data volumes grow, optimizing your data platform for speed becomes paramount. Capturing data quickly and persistently is already a challenge, but when that data is highly regulated, it becomes even more critical. Data storage and containers are the key to success.
U.S. financial services is an example of an industry where data is highly regulated. The SEC requires any company using algorithms to make trading decisions to be able to reproduce how each trading decision was made by capturing both test data and algorithms. With hundreds of thousands of portfolios, each with their own risk profile, some financial services companies are looking at datasets of 200TB or more. With datasets this big, you quickly realize that copying the data and algorithms used to make every decision isn’t scalable.
One way to optimize for speed is to run everything on containers and use a storage solution that supports data snapshots. Data snapshots are read only and immutable and become part of the "source of truth." Snapshots of data sets, models, and code give financial services companies the ability to turn back the clock to see exactly what analysis produced the decision to buy or sell a specific financial instrument.
Optimizing for Application Portability
For companies with clusters in multiple remote locations, application consistency and portability is a primary concern. Ensuring the same systems and software are being delivered to remote and edge deployments is complicated. Not only do models, code, and containers need to be managed, kept up to date, and secured, but data has to be guaranteed consistent and trackable from source to field and back.
Application portability can be achieved by using a data platform with storage that selectively mirrors from a central source to field deployments and also handles container images. Distributed storage is built out in each location then containers are used to run applications for remote deployment. When changes are made to existing applications – or customers buy new applications – mirroring and data replication functions built into the data platform pushes updates from the central cluster, ensuring customers have the newest applications. Mirroring is consistently seamless and runs analytics loads of different types and containers in each facility.
As a bonus, security is centralized so companies have absolute control over what is running, captured, seen, and used in each remote cluster.
Optimizing for Processing Application Flexibility
As companies grow and evolve, they often deal with scale and agility issues. So how can storage help with processing application flexibility?
A modern storage solution can support development platforms such as Docker, handle vast amounts of data, and seamlessly interact with containers and Kubernetes. A large-scale, distributed, scalable storage layer supports persisted and stateful data and can synchronize between multiple data centers, including backup data centers, cloud providers, and business continuity facilities.
Using mirroring to orchestrate containers across multiple locations gives companies the agility to update application versions and know precisely what is running in each location and with what data. With storage that scales as user data scales, more containers can be quickly deployed whenever additional compute power is needed.
Data Demands and Modern Storage
Enterprises have many requirements pressing against their massive data sets, including auditing, tracking, governance, source of record, immutable records, GDPR, and data sovereignty. By starting with storage needs first, companies are able to make choices about the application environment that works best for their requirements.
As data demands grow and mature, so too must storage capabilities. Modern storage deployments are no longer monolithic systems, but must integrate a myriad of technologies and accommodate on-premises, cloud, and the edge.
Despite a wide variety of requirements and approaches to big data across companies and industries, the goal is the same: handle large amounts of data to provide insights quickly. Ultimately, the underlying data platform ensures smooth development and deployment with successful outcomes.
Image credit: nevarpp/depositphotos.com
Paul Curtis is a Senior Systems Engineer at MapR, where he provides pre- and post-sales technical support. Prior to joining MapR, Paul served as Senior Operations Engineer for Unami, a startup founded to deliver on the promise of interactive TV for consumers, networks and advertisers. Previously, Paul was Systems Manager for Spiral Universe, a company providing school administration software as a service. He has also held senior support engineer positions at Sun Microsystems, as well as enterprise account technical management positions for both Netscape and FileNet. Earlier in his career, Paul worked in application development for Applix, IBM Service Bureau, and Ticketron. His background extends back to the ancient personal computing days, having started his first full time programming job on the day the IBM PC was introduced.