Why the IoT presents major data challenges [Q&A]
Internet of Things (IoT) devices are generating huge volumes of data and that represents a challenge for organizations when it comes to processing and storing it.
We spoke to Karthik Ranganathan, CTO of the company behind the distributed SQL database YugabyteDB, to find out how businesses can cope with the complexity and performance issues that handling IoT data and its associated metadata raises.
BN: In terms of data and engineering the back-end, what's the biggest question you should ask yourself before starting an IoT project?
KR: The most basic requirements of an IoT platform is availability and scalability. When you have connected devices all over the world, these devices start sending data back that you want to both collect and analyze over time. Furthermore you want to process that data so that you can derive insights and act upon it. That is the main value proposition of IoT and it seems simple enough. However it does create engineering questions that you must be able to answer.
Take the example of a connected car -- you always want to know how much fuel is in it and where it is located (geographically speaking), so that you can determine if you have enough gas to get to the next station. You need sensors continually sending data to make that work, and you have to process that data. The resources need to be engineered so that this data processing can happen quickly irrespective of how many other cars start sending back information, and without any interruptions due to infrastructure failures.
BN: What are the data challenges of IoT?
KR: Your data tier has to be always available. You can't have a zone outage or a cloud outage impacting your business. Any outage means that a data set will be incomplete, calling the resulting analytics into question. An IoT solution simply cannot afford to compromise on availability.
Furthermore, while you may or may not decide to store the data all the time, your number of sensors could change. This means data loads will put pressure on your existing solutions, which can create issues if there's an inability to scale at a rate which matches the increase in data.
The other challenge is that the devices are geographically distributed. This means you need the database to also have geo-distribution features in order to replicate and distribute the data across multiple geographies, to allow querying it with low latency and removing any chance of data loss due to an outage.
BN: How do you deal with the scale of data?
KR: The best way to handle this is to have a database that can scale out -- so you can just add more nodes to support more data processing, deal with the read and write requests that come in, etc. Any other approach merely postpones the problem, kicking the can down the road rather than actually dealing with it.
BN: What are the best practices for securing IoT databases?
KR: The distributed nature of the problem makes security a top concern. In the context of a database you need to be able to provide the expected security features - authentication and role-based access control. This allows managing who can access the data. Additionally, advanced security features such as encryption on the wire and encryption at rest are critical to protect the data from breaches.
BN: How can enterprises reduce complexity while still gaining the benefits the data offers?
KR: There are multiple forms of complexity -- developing the app through which people derive insights from the connected devices is one layer. In this case, one basic challenge is that your insights are only as good as the data you collect. Thus, it's critical that you can scale to collect as much data as needed. Additionally, as developers try to build apps on top of IoT platforms, they’ll want a language that enables them to pose the right type of queries.
Furthermore, there are also significant operational complexities. First and foremost, your platform needs to be resilient. You should be able to withstand common outages such as a node going down. Otherwise, key events might get missed, resulting in end users not being able to trust the insights your app pushes to you. If you are using a cloud provider to deploy your database, you should deploy across availability zones. This makes you immune to provider outages.
BN: Why is metadata an important part of the equation and why does it make life more difficult?
KR: There are two types of data in the IoT. First, there are just the raw readings from every single device. In this case, a reading or two lost here and there should theoretically not be that big a deal. However, it is critical to gather, maintain, and process accurate information so that the insights you derive are not skewed in any way.
Then there's the information about the devices, users, privacy policies, etc. You need to know how many devices you’re tracking, who’s using them, what rights do they have, etc. This data, while smaller in volume, is very critical and hence must be transactional and consistent. Ultimately you need two databases and you need to keep them in sync which creates complexity.
BN: Isn't this going to need a lot of time and effort?
KR: Many database solutions do indeed require more time and effort than organizations can stomach as they scale. Handling the two different workload types creates a level of complexity that many database solutions can't handle at scale. Fortunately there is the option of distributed SQL databases, which turn multiple databases into tables within the same database, significantly reducing the demands placed upon IT, and consuming less time per project.
BN: Won't all this be very expensive?
KR: To keep this affordable you need a database that allows you to start with what you need but scale as you go. Scaling should be easy, and you should be able to scale as much data as you need per node. You should not have to scale out simply because your database can't handle the amount of data requests. Ultimately, if you invest in a scalable solution at the outset, your costs should not balloon.
BN: How can database latency be reduced?
KR: There are a number of best practices for organizing your data to ensure low latency. If you start with that mindset, you save yourself a lot of hassles later on. Firstly, the database should have been engineered for performance from the ground up. What language the database is written in has a profound impact on the performance. Databases implemented in Java and GoLang typically require garbage collection, which leads to higher query latencies especially as memory size increases. C++ meanwhile can build very high performance apps because it doesn’t require garbage collection. Secondly, it's important to take into account how the data is organized on disk. A good data model makes it very easy and efficient to retrieve data even as the data sizes grow larger. Thirdly, the type of disk matters a lot. SSDs usually improve performance if the database is architected to take advantage of them. Additionally, there are a number of types of disks which have different tradeoffs in terms of performance and resilience, so it is important to think through these choices. For example, locally attached SSDs give low latency, but would require using a distributed database which replicates data across nodes for resiliency on failures.