Biggest Hadoop mistakes and how to avoid them

Hadoop

Hadoop, for all its strengths, is not without its difficulties. Business needs specialized skills, data integration, and budget all need to factor into planning and implementation. Even when this happens, a large percentage of Hadoop implementations fail.

To help others avoid common mistakes with Hadoop, I asked our consulting services and enterprise support teams to share their experiences working with organizations to develop, design and implement complex big data, business analytics or embedded analytics initiatives. These are their top seven mistakes, and some advice on how to avoid them.

Mistake 1: Migrate everything before devising a plan

As tempting as it can be to dive head first into Hadoop, don’t start without a plan. Migrating everything without a clear strategy will only create long-term issues resulting in expensive ongoing maintenance. With first-time Hadoop implementations, you can expect a lot of error messages and a steep learning curve.

Successful implementation starts by identifying a business use case. Consider every phase of the process -- from data ingestion to data transformation to analytics consumption, and even beyond to other applications and systems where analytics must be embedded. It also means clearly determining how Hadoop and big data will create value for your business.

My advice: Maximize your learning in the least amount of time by taking a holistic approach and starting with smaller test cases. Like artisan gin, good things come in small batches!

Mistake 2: Assume rational database skillsets are transferable to Hadoop

Hadoop is a distributed file system, not a traditional relational database (RDBMS). You can’t migrate all your relational data and manage it in Hadoop the same way, nor can you expect skillsets to be easily transferable between the two.

If your current team lacks Hadoop skills, it doesn’t necessarily mean you have to hire all new people. Every situation is different, and there are several options to consider. It might work best to train up existing people and add a few new. You might be able to plug skills gaps with point solutions in some instances, but growing organizations tend to do better in the long run with an end-to-end data platform that serves a broad spectrum of users.

My advice: While Hadoop does present IT organizations with skills and integration challenges, it’s important to look for software, along with the right combination of people, agility, and functionality to make you successful. More tools are now available that automate some of the more routine and repetitive aspects of data ingestion and preparation, for example.

Mistake 3: Treating a Hadoop data lake like a regular database

You can’t treat a data lake on Hadoop just like a regular database in Oracle, HP Vertica, or a Teradata database, for example. Hadoop’s structure is totally different. It also wasn’t designed to store anything you’d normally put on Dropbox or Google Drive. A good rule of thumb is: if it can fit on your desktop or laptop, it probably doesn’t belong on Hadoop!

Data in a lake exists in a very raw form. Think of a box of Lego: you have what you need in it to build a Star Wars figurine, but it’s not a figurine out of the box. People imagine a data lake to be pristine, clear, and easy to find. But as your organization scales up to onboard hundreds or more data sources, in reality they often end up being three miles wide, two inches deep and full of mud! IT time and resources can easily get monopolized, creating hundreds of hard-coded, error-prone data movement procedures.

My advice: Take the proper steps up front, in order to understand to best ingest data to get a working data lake. Otherwise, you’ll end up with a data swamp. Everything will be there, but you won’t be able to derive any value from it.

Mistake 4: I can figure out security later

High profile data breaches have motivated most enterprise IT teams to prioritize protecting sensitive data. If you’re considering using big data, it’s important to bear in mind that you’ll be processing sensitive data about your customers and partners. Never, ever, expose credit card and bank details, national insurance numbers, proprietary corporate information and personally identifiable information about clients, customers or employees. Protection starts with planning ahead, not after deployment.

My advice: Address each of the following security solutions before you deploy a big data project:

  • Authentication: Control who can access clusters and what they can do with the data
  • Authorization: Control what actions users can take once they’re in a cluster
  • Audit and tracking: Track and log all actions by each user as a matter of record
  • Compliant data protection: Utilise industry standard data encryption methods in compliance with applicable regulations
  • Automation: Prepare, blend, report and send alerts based on a variety of data in Hadoop
  • Predictive analytics: Integrate predictive analytics for near real-time behavioural analytics
  • Best practices: blending data from applications, networks and servers as well as mobile, cloud, and IoT data

Common strategic mistakes

HiPPO is an acronym for the "highest paid person's opinion." Trusting one person’s educated opinion over data may work occasionally, but Hadoop is complex and requires strategic inquiry to fully understand the nuances of when, where, and why to use it. To start, it’s important to understand what business goals you’re trying to reach with Hadoop, who will benefit, and how the spend will be justified. Most big data projects fail because the business value is not being achieved.

Once a data problem has been established, next determine whether or not your current architecture will help you achieve your big data goals. If you’re concerned about exposure to open source or unsupported code, it may be time to explore commercial options with support and security.

My advice: Once a business need for big data has been established, decide who will benefit from the investment, how it will impact your infrastructure, and how the spend will be justified. Also, try to avoid "science projects" -- technical exercises with limited business value.

Mistake 6: Bridging the skills gap with traditional ETL

Plugging the skills gap can be tricky for organizations considering how to solve big data’s ETL challenges. There just aren’t enough IT pros with Hadoop skills to go around. On the other hand, some developers proficient in Java, Python, and HiveQL, for example, may lack the experience to optimize performance on relational databases. When Hadoop and MapReduce are used for large scale traditional data management workloads such as ETL, this problem intensifies.

Some point solutions can help plug the skills gap, but these tend to work best for experienced developers. If you’re dealing with smaller data sets, it might work to hire people who’ve had the proper training on big data and traditional implementations, or work with experts to train and guide staff through projects. But if you’re dealing with hundreds of terabytes of data, for instance, then you’ll need an enterprise-class ETL tool as part of a comprehensive business analytics platform.

My advice: Technology only gets you so far. People, experience, and best practices are the essential for successful Hadoop projects. When considering an expert or a team of experts as permanent hires or consultants, you’ll want to consider their experience with "traditional" as well as big data integration, the size and complexity of the projects they’ve worked on, the companies with whom they worked with, and the number of successful implementations they’ve done. When dealing with very large volumes of data, it may be time to evaluate a comprehensive business analytics platform that’s designed to operationalize and simplify Hadoop implementations.

Mistake 7: I can get enterprise-level value on a small budget 

The low-cost scalability of Hadoop is one reason why organizations decide to use it. But many organizations fail to factor in data replication/compression (storage space), skilled resources, and overall management of big data integration of your existing ecosystem.

Remember, Hadoop was built to process a variety of enormous data files that continue to grow. And once data is ingested, it gets replicated! For example, if you have 3TB you want to bring in, that will immediately require 9TB of storage space, because Hadoop has built-in replication (which is part of the parallel processing that makes Hadoop so powerful.)

So, it’s absolutely essential to do proper sizing up front. This includes having the skills on hand to leverage SQL and BI against data in Hadoop and to compress data at the most granular levels. While you can compress data, it’s important to note that data compression affects performance. The compression of data also needs to be balanced with performance expectations for reading and writing date. Also, storing the data may cost 3x more than what you’ve initially planned.

My advice: Understand how storage, resources, growth rates, and management of big data will factor into your existing ecosystem before you implement.

Wael Elrifai, senior director of Sales Engineering, Pentaho.

Published under license from ITProPortal.com, a Future plc Publication. All rights reserved.

Comments are closed.

© 1998-2024 BetaNews, Inc. All Rights Reserved. Privacy Policy - Cookie Policy.