The challenge of creating high performance, fault tolerant storage [Q&A]
The growth of the Internet of Things, increased reliance on analytics to support decision making, and greater use of video means businesses are storing more and more data. That data has become a crucial asset and storing it so that it's accessible and safe is a major challenge. Solutions from major vendors are costly but data storage operating system specialist RAIDIX has come up with a product that can offer fault tolerance on commodity hardware.
We spoke to Sergey Platonov, product owner at RAIDIX to find out more about the solution and why data storage is now a major challenge.
BN: Are things like big data and virtualization driving demand for storage?
SP: They definitely are. As the name suggests, Big Data deals with enormous data volumes that need to be effectively processed and stored. In some scenarios, the data tends to have a short lifecycle, like in e-commerce, smart order routing, fraud detection, etc. Information falls into the pipeline, undergoes analysis, and gets removed for good. Other data-rich verticals, e.g. healthcare, scientific research or data warehousing, require long-term storage with instant data availability. Both scenarios pose challenges to the industry players in terms of performance and reliability.
In our experience, the demand for scale-out solutions accounts for 40 percent of all incoming requests. Smooth scalability becomes a critical requirement -- and more often than not, we are talking petabytes and even exabytes here.
As for virtualization, it goes hand in hand with shared storage. Hypervisors just won’t operate unless a scalable storage infrastructure is provided. Virtual machine deployment and data backup require high levels of fault-tolerance that only NAS (network-attached storage) solutions can meet.
BN: Is there a difference between high availability and fault tolerance?
SP: Yes, and no. Fault tolerance is an ability of a system to survive failures, be it a blackout, a corrupt disk in the array or any malfunctioning component. The data has to remain intact, no matter what. To reach that goal, fault-tolerant environments duplicate hardware components, and use smart software for failure prediction and quick remediation.
High availability is more about the quality of service and performance under force majeure circumstances, i.e. in degraded mode when one or several components fail. High availability has its measurements based on allowed downtime. You may come across 99.999 or 99.9999 percent uptime showings claimed by prominent data storage providers. Translated into daily downtime, this comes to the indiscernible 864.3 milliseconds and 86.4 milliseconds, respectively. Thus, data storage providers make specific commitments on reconstruction time and efficient failover.
In fact, these characteristics tend to mix up in the world of data storage since they’re closely linked. On the one hand, with the absence of data redundancy and duplication, there is no platform for recovery. At the same time, mathematicians team up with software developers to deliver smart RAID algorithms and make data rebuild faster than reading it from a physical disk.
BN: How can performance be balanced against resilience?
SP: Given the spread of all-flash storage, clustered systems and other differently priced options, enterprises and vendors have increasingly preoccupied themselves with the notion of reconstruction cost. What is this all about?
When a specific component fails, data storage systems employ Reed-Solomon calculations and advanced RAID technology to minimize or eliminate the negative impact. To re-establish connection with the failed element, the system has to read data from a bunch of other storage devices, which is a labor-intensive and expensive operation. In a clustered environment, this becomes a yet more daunting task since all requests are processed through the network -- another bottleneck in the recovery chain. In a nutshell, reconstruction poses two key challenges to solution providers: rebuild time and required resources.
The tech community offers a good deal of new approaches to keep ‘excessive’ redundancy at bay. One of them is known as Butterfly Codes, or Regenerating Codes. This model reduces the number of redundant requests by a factor of two, as opposed to Reed-Solomon Codes. Simply put, it takes three read requests to fix three errors. Another buzzword in this domain is LRC (Local Reconstruction Codes) that allow the system to focus its repair efforts on a particular logical area or equipment component, instead of rebuilding a larger chunk of the infrastructure.
BN: Do enterprises need to deploy different storage solutions to meet the needs of different applications?
SP: If we are talking data-intensive verticals like High Performance Computing, Big Data or High Performance Data Analytics, using multiple storage solutions might be a good idea for safety and scalability reasons. However, the division will only work as long as the system can bridge particular applications with particular locations. Which is hardly ever the case in cloud storage deployed by most enterprises.
In practice, we are faced with a disk volume running a multitude of virtual machines with all sorts of applications. In data storage jargon, this issue has been dubbed 'IO Blender'. A state of things when mapping virtual locations with specific workflow streams is challenging if not impossible.
As always, there is a workaround. The response could be another version of tiering that addresses spatial and temporal localities. In other words, the system knows which data is requested on a regular basis and where it is stored, so the information can be cached and fetched quickly on demand. The concept of 'where' shifts from the volume level down to the virtual machine (VM), or address range (local block address) level. Experts talk of a new VM-aware storage that would eventually dismantle the notorious IO blender into tangible bits.
BN: How does the RAIDIX solution differ from a conventional RAID installation?
SP: RAIDIX builds on high productivity, resilience and quality of service (QoS). The solution employs sophisticated mathematical algorithms for faster RAID calculations. Even when hardware fails or slows down, RAIDIX keeps up sustainable high performance while recovering the data.
We also have our proprietary RAID models, such as RAID N+M, which is a level of interleaving blocks with M checksums. This RAID allows the user to choose the number of disks for checksum allocation. RAID N+M requires at least eight disks and can sustain complete failure of up to 64 drives in the same group.
What really makes RAIDIX stand out from the crowd is the effective use of intellectual intelligence in data storage. The QoSmic feature, for one, allows assigning priorities to active client applications (rather than entire storage nodes) based on their behavior and operational patterns.
Budget-wise, the RAIDIX technology enables significant cost savings since it's compatible with standard off-the-shelf hardware, and scales up easily in line with the customer's needs.
Image Credit: Eugene Kouzmenok/Shutterstock