How automation is changing data science and machine learning
Almost any article you read about how automation will affect our future can be classified into one of two narratives. The first one is that it will definitely lead to a better future, as it always had since the industrial revolution. Of course, some people will lose their jobs, but as history shows, new jobs will be created. And not just new jobs, but better jobs. The other narrative is that this time is different. The robots are becoming more and more intelligent and capable. And the number of jobs and industries they’ll destroy will far exceed the number of jobs they create. Of course, it’s impossible to tell which of the two narratives will become a reality. What we can tell is that these narratives share similar inception: more and more parts of our jobs and lives are being automated.
Take for example the process of driving. For many years now, we have been taking small parts of the driving process and automating them. For a better driving experience, we built cruise control. For route planning, we developed the GPS. Now, we are able to tackle more complex problems like lane merging and emergency braking. In the next couple of years, we will most certainly have fully autonomous cars driving on the roads. Just last week Waymo, Alphabet’s self-driving subsidiary, officially received the very first California permit to test their vehicles in the state without a human behind the wheel.
All these developments illustrate an important lesson: automation happens in steps. And this process can be observed in many jobs. The only major variable is the timeframe at which it happens. If you’re a doctor, it’s going to take a while until major parts of your job are automated. If you’re an entry-level accountant, the timeframe is much smaller. Just as with driving, gradual automation is happening in data science as well. Although a complex topic, I’ll be focusing on two important developments that are changing the data science landscape:
- Data science platforms
- Automated machine learning
Before I start talking about data science platforms, let me give you a short introduction into what data science is. In its essence, it is a field that uses tools taken from computer science, statistics, and machine learning to extract insights from data. In other words, if you have some data and you want to make some decision or predictions based on it, you use data science. However, extracting information from big data sets can be expensive. To implement any type of big-data project, a company must build a data infrastructure first. Think of it as different pieces of technology that can run all the tools a data scientist needs. The issue is that for many years building such an infrastructure was like building a car just from parts. Possible, but you needed people with highly specialized skills, and it took a lot of money and a time. Fortunately, this is changing. What we have seen in the past few years is the appearance of platforms that automate this process. Take for example various cloud-based platforms that make it much easier to develop and maintain big-data infrastructures, from my own team to others in the market like Amazon Web Services (AWS), Google, Microsoft, and Anaconda.
Automated big data platforms are only part of the story. Although they enable us to set up and maintain data infrastructures more easily, somebody still needs to write lines of code to clean the data and experiment with machine learning models. This process can be quite time consuming and needlessly complex. To understand what I mean, let me walk you through the data science workflow.
Usually, every data science project consists of three parts: data processing, modeling, and deployment. When it comes to how time is spent in each of these parts, we have this rule of thumb: 50-60 percent is spent on just processing the data, and the rest is spent on modeling and deployment. As you can imagine, this is a very inefficient use of a data scientist’s time. But it’s necessary as real-world data is messy and algorithms can’t deal with messy and unstructured data sets. Modeling, on the other hand, is an iterative process. For any specific problem, it’s impossible to know beforehand which exact algorithm is going to be the best. As a result, a data scientist has to try out many different algorithms until he arrives at a well-performing one.
Another source of inefficiency is that it can be difficult to take a model and package it in a way that can be launched in production. Many times the machine learning pipeline that was built during the modeling part needs to be broken apart and reconstructed to make sure its production safe. To address all these inefficiencies, in the past few years analytics vendors have started developing products that take the entirety of this workflow and integrate into one end-to-end platform. Think of these platforms as operating systems for data science. The big innovation that these platforms bring is that first that they automate a lot of the data processing part. Second, they make it very easy to keep track of all the developed models and their parameters. And they make it easier to launch algorithms and models into production.
To give you a few examples, Alteryx has a smart and easy to use data science platform. They’re a company that had their IPO a little more than a year ago and have already seen their share price double. Others players in the sector include KNIME, RapidMiner, and H20.ai. Just as cars have all this technology that is making drivers more efficient, so do these platforms promise to help people navigate the data science workflow better.
However, that still doesn’t mean you can get rid of your data scientist. There’s still a need for a person who can build and interpret models. The issue is that these skills are hard to come by. And they’re also expensive to pay for. One of the biggest hurdles companies face when trying to build advanced analytics projects is the lack of skilled data scientists. This is where my next topic comes into play: automated machine learning. Some of these analytics companies have gone a step further and started integrating automated machine learning systems into their platforms. The kind of systems where with just some minimal intervention you can drop your data in and get collection models out. The biggest advantage these systems bring to the table is that they make predictive analytics open to a much wider audience. They can be incredibly powerful tools in helping non-technical employees to solve simpler prediction problems like customer churn. Although there are more and more companies that offer automated machine learning solutions, like DataRobot, I would like to mention a couple of interesting open source Python projects that anyone can try out.
The first one of these tools is Featuretools which is an amazing library for automatically building features. If you’re unfamiliar with feature engineering, it is the process of building input variables for a machine learning model:
- While Featuretools takes care of only a small but important part of the model building pipeline, the following two projects are actual automated machine learning libraries able to test and build advanced machine learning models automatically.
- One of these libraries is TPOT which uses genetic programming to come up with a performant machine learning pipelines automatically. Although it’s quite user-friendly, it can take a long time to come up with solutions.
- The other library is auto-sklearn which is based on meta-learning.
- A very interesting concept that uses knowledge gained on thousands of previous datasets and machine learning models to decide which models and parameters could work best for a given data set.
- Although these projects are still in their infancy, they can give a great insight into how automated machine learning systems work and the benefits they can bring in the data science process.
With all these tools geared towards automating parts of the data science workflow, the big question that arises is: "can the job of a data scientist fully automated?" The shortest answer is no. Automated machine learning does not eliminate the hard parts of a data scientists’ job, such as listening to clients, understanding the business problem, and figuring out how to craft a solution. It automates the repetitive and time-consuming parts of the job. What these tools can achieve however is to help analytical toolsets become more widely adopted by the general public. And we have already seen this happening with a new breed of 'citizen data scientist', people from non-technical backgrounds, working on and delivering analytical projects through the use of data science tools.