From big data pilot to production
According to Gartner, more than half of all big data projects fail to make it beyond the pilot stage. It’s important to consider what it takes to make big data projects successful in a production environment and to ensure you’re building in the right elements from the outset to support your big data initiatives.
If you’ve had a successful pilot, it’s a good time to step back and generalize and reexamine any expedient choices you may have made. And if you’re not yet to the pilot stage, even better -- you can build your pilot on sound principles and a proven approach, which will make the project go faster.
Define a strategy for success
Successful big data initiatives are underpinned by a strategy. Your strategy will help determine architecture investments, identify stakeholders who will carry over from pilot to production, and identify which datasets will be landed first. The strategy should consist of:
- An opportunity assessment -- Explore and analyze some of the opportunities for your organization to create new value from big data, whether on its own or in combination with existing data sources.
- An analysis of desired new capabilities -- Determine which new capabilities you need in the architecture. This is where Hadoop and other important technologies like Spark play a role.
- A roadmap built on a collaborative vision from business and technology stakeholders -- Working together, business and technology stakeholders should identify and prioritize the most valuable use cases to invest in over the next 12 to 18 months. Identify different datasets that will land in the Hadoop cluster, and develop an understanding of how business milestones and analytics initiatives align with big data initiatives.
- An architecture definition that supports your use cases -- This high-level architecture should identify the core functions of a big data solution that will support the use cases you identified. The high level architecture should take into account handling of real-time data, long-term storage, data access management, datasets exported to Hadoop for end users, a way to consistently capture metadata, and more. If you build your architecture with your top use cases in mind, it will be easier to support future use cases as well.
Moving to development
Once you’ve created a strategy and have a roadmap to guide your way, it’s time to move to development. Lack of business sponsorship is often the cause of stalled or abandoned big data projects, so start with a use case that can be implemented quickly and will get end users involved early. This will help accelerate time to value while also ensuring end users’ long-term support.
The next step is tackling operations and supportability. You can set up the environment and DevOps procedures in parallel with the pilot for maximum speed. Or you can prove the value first with a pilot, then tackle how to go to production in a supportable way as a fast follow-up.
Put together an operations support plan to help applications and operations teams work together, and develop the necessary skills to support the environment. Operations must understand how to set up, configure, and support Hadoop clusters; if the operations team comes from a system administration or database administration background, bear in mind that the learning curve for managing Hadoop clusters is fairly steep and additional expertise may be needed (see my article, Don’t Expect Your DBA to Do a Hadoop Expert’s Job). Oftentimes, organizations find it difficult to scale activities around Hadoop and develop production data flows.
From development to production
When moving from development to production, some core elements are crucial to ensure success:
- Capacity planning -- Using a few months’ worth of data, create test runs in the pre-production environment. This will enable you to estimate disk space, CPU, and memory requirements and how much data will be processed. From there you can estimate how much capacity a few years’ worth of data will consume in production. This often needs to be combined with an analysis of future use cases in the roadmap, which may require more resources than the first ones that are being piloted.
- Performance planning -- It’s easier to configure and optimize a Hadoop cluster for target applications if you first configure it for a baseline. Run benchmarks like TeraSort to estimate performance and provide the team with a baseline of expected throughput for the cluster and to identify any configuration issues.
- DevOps practices -- Moving from development to production requires a strong partnership between your development and operations teams. The teams often need to meet daily when testing data ingestion in pre-production to analyze log data and troubleshoot issues.
- Model monitoring -- It’s important to measure the effectiveness of predictive analytics models and to continue to monitor them as conditions change as well as to continuously test any improvements you make.
- Supportability -- For continued success, supportability and the ability to meet SLAs are the most crucial elements of all, especially as you deploy multiple big data applications in production. You need to make sure you know what’s going on with your data feeds and jobs and can fix issues, manage resource allocation among competing uses, and use best practices software development, testing, and deployment approaches.
Finally, it’s important to understand that moving from pilot to production is a different game. If done properly, your big data initiatives should benefit from continuous improvement as team members learn what works and apply their knowledge to the additional use cases you’ve identified.
Image Credit: Tashatuvango / Shutterstock
Ron Bodkin is Think Big's President and Founder. He founded Think Big to help companies realize measurable value from Big Data. Previously, Ron was VP Engineering at Quantcast where he led the data science and engineer teams that pioneered the use of Hadoop and NoSQL for batch and real-time decision making. Prior to that, Ron was Founder of New Aspects, which provided enterprise consulting for Aspect-oriented programming.