Automated Machine Learning

Traditionally, the development of predictive solutions is a challenging and time consuming process that requires expert resources in software development, data engineering, and data science. Engineers are required to complete the following tasks in an iterative and cyclical manner.

  1. Preprocess, feature engineer, and clean data
  2. Select appropriate model
  3. Tune Hyperparameters
  4. Analyze Results
  5. Repeat

As the industry identified the blockers that make the development of machine learning solutions costly, we (as a community) aim to figure out a way to automate the process in an attempt to make it easier and faster to deploy intelligent solutions. Therefore, selecting and tuning models can be automated to make the analysis of results easier for non-expert and expert developers.

Automated machine learning is the ability to have a defined dataset with a specific target feature, and automatically iterate over the dataset with different algorithms and combination of input variables to select the best model. The purpose is to make developing this solutions require less resources, less domain knowledge, and less time.

How it Works

Most Auto ML libraries available are used to solve supervised learning in order to solve specific problems. If you are unfamiliar, there are two main categories of machine learning.

  • Supervised Learning: is where you have input variables and output variables, and you apply algorithms to learn the mapping function of input to output.
  • Unsupervised Learning: is where you have input variables but no output variables to map them to. The goal is typically to identify trends and patterns in the data to make assumptions.

Note there is a category called semi-supervised learning but we will not get into that here. But it is simply a combination of the two categories above.

In order to use auto machine learning your dataset must be feature engineered. Meaning, you manually develop transformations to create a machine learning dataset to solve your problem. Most Auto ML libraries have built in transformation functions to solve the most popular transformation steps, but in my experience these functions are rarely enough to get data machine learning ready.

Once you have featured engineer your dataset the developer simply needs to determine the type of algorithm they need. Most supervised learning algorithms can be classified as:

  • Classification: The output variable is a set number of outcomes. For example, predicting if a customer will return to a store is either a “yes” or a “no”. Classification is additionally broken into multiclassification (3 or more outcomes) and binary classification (2 outcomes).
  • Regression: The output is a numeric value. For example, predicting the prices of a car or house.

When given an algorithm type, Auto ML libraries will run iterations over your dataset to determine the best combination features, and best hyperparameters for each algorithm, therefore, in turn it actually trains many models and gives the engineer the best algorithm.
I would like to highlight the differences between having to engineer columns for machine learning, and selecting the appropriate columns for machine learning. For example, lets assume I want to predict how many point of sale transactions will occur every hour of the day. The raw dataset is likely transactional, therefore, will require a developer to summarize the data at the hour level i.e. grouping, summing, and averaging. But often times developers will create custom functions in order to describe the trends in the dataset. This process is feature engineering.

Feature selection comes after feature engineering. I may summarize my dataset with 10 different columns that I believe will be useful, but Auto ML libraries may select the 8 best columns out of the 10.

The difference between feature engineering and feature selection is huge. Most libraries will handle common or simple data engineering processes, however, the majority of the time a data engineer will need to manually create those transformations in order to use Auto ML libraries.

When Auto Machine Learning libraries are used in the development process the output is usually a dataset containing metadata on the training runs and their results. This dataset enables developers to easily choose the best model based off the metrics provided. Being able to choose the best model out of many training iterations with different algorithms and feature columns automatically is that it enables us to easily automate the model selection process for *each* model deployment. With typical machine learning deployments, engineers typically deploy the same algorithm with the same feature columns each time. But with Auto Machine Learning solutions we are able to note only choose the best algorithm, feature combination, and hyper-parameters each time. That means, we can deploy a decision tree model trained on 4 columns one release, the deploy a logistic regression model trained on 5 columns another release without any code edits. This is so simple, yet so awesome about how easy it can be!

Available Libraries

MLBox, a python library for automated machine learning. Key features include distributed processing of data, robust feature selection, accurate hyperparameter tuning, deep learning support, and model interpretation.

TPOT, an automated machine learning python that uses genetic programming to optimizes machine learning pipelines. Similar to other automated machine learning libraries it is built on top of scikit learn.
The AutoML with TPOT is now available.

Auto-sklearn, a python library is great for all the sci-kit learn developers out there. It sits on top of sci-kit learn to automate the hyperparameter and algorithm selection process.

AzureML, an end to end platform for machine learning development and deployment. The library enables faster iterations by manage and tracking experiments, and fully supports most python-based frameworks like PyTorch, TensorFlow, and sci-kit learn. The Auto ML feature is baked into the platform to make it easy to select your model.
The AutoML with AzureML is now available.

Ludwig, a TensorFlow based platform for deep learning solutions was released by Uber to enable users with little coding experience. The developer simply needs to provide a training dataset and a configuration file identifying the features and labels desired.

Check out the libraries above! Automated machine learning is fun to play around with and apply to problems. I will be creating demos and walk throughs of each of these libraries. Once public you will be able to find them on my GitHub.

3 Keys for Your Organization to Get the Most from AI

Organizations are constantly weighing the cost and benefit of investing in Artificial Intelligence (AI) solutions. Introducing advanced predictive analytics to a company can push them to the bleeding edge of innovation and past their competitors, however, the hurdle is often difficult to get over. But why?

First, understand how we define AI

The term AI can mean several different things; however, the most commonly used definition refers to the idea of intelligent machines, which is in slight contrast to the aspirational machine with human-level intelligence. There are endless ways to implement Artificial Intelligence, but the primary ways are via machine learning and deep learning.

Machine learning uses labeled historical data to train a model to understand patterns and make accurate predictions on new and unlabeled data points.

Deep learning is a subcategory of machine learning revolving around neural networks. While neural networks have been around for decades, they have truly exploded in research and use in the last five to ten years. Deep learning is used to solve problems that require a human such as image recognition, text/speech analytics, and decision making (i.e. game playing).

For the sake of this article, we will use AI as a synonym for machine learning and deep learning, even though AI in general may refer to software and hardware having human-level intelligence which is not achieved using current methodologies.

Let’s focus on 3 key areas to get the most from AI

At 10th Magnitude, our data intelligence community focuses on bringing analytics to solve our customers’ problems through data science, reporting, and big data pipeline projects.

Outside of developing solutions we encourage organizations to focus on the following cultural- and process-oriented areas to truly get the most out of their AI solutions.

Know Your Business Use Case(s) and Collaborate 

The majority of AI applications are powered by machine learning, which is used to solve a very specific problem using data.The first thing that I do with a customer who is new to machine learning is to understand and identify all of their business problems. These problems often turn into new use cases for machine learning or deep learning.

Since the use cases are derived from the business itself, the key to creating a successful solution is collaboration between the data science team and business stakeholders. Additionally, stakeholders are likely the individuals who will need to approve the completion or evaluate the success of the developed application

Therefore, understanding what is needed to solve the problem and then relating the problem back to the data is crucial. Keeping the stakeholder aware of the development cycle allows them to understand the challenges data scientists encounter when creating new predictive analytics workflows.

Additionally, this enables the organization to develop a data-driven culture. Involving business users in the development room gives non-technical folks insight into what is possible, allowing them to spot other areas where AI can be of use.

10th Magnitude believes in the idea of data-driven design, where we use data to solve problems, power applications, and change the way an organization thinks about their business.

Don’t Stop After Development

Developing a machine learning solution is difficult. It is an iterative cycle where individuals go back and forth from the business to understand the problem, gather data, and train models. Developing these solutions takes time, however, once development is done you are only partially completed with the project.More often than not we see customers give up on a solution after the development portion because the model did not perform as well as hoped or the cost to put it all in production is simply too high. It takes a lot of work to move a solution from a development environment to a production one.For example, we recently worked with an organization to build and develop a model to detect anomalies for their different pieces of equipment. It took 2-3 weeks to develop the solution and an additional two weeks to set it up in production; we had to move the code to a production workload, build and release pipelines for two environments, model consumption, model monitoring, and more.

As data scientists, we often forget the difficulty and the amount of time it takes to move a solution to a production so that the organization is able to see the true benefit.

It is important to keep in mind that empowering an application or workflow with machine learning is about more than just the application. It also gives people the ability to see what is possible with the data.

Automation is Your Friend

Usually, data scientists are not familiar with automated build and release pipelines, but it is a skill that is quickly becoming a requisite in order to properly participate in the predictive analytics space. DevOps is the process and culture of delivering value to customers in a sustainable manner. As predictive insights grow organically within an organization, individuals need to be available to develop new solutions. and not maintaining existing ones. Automation is extremely useful in data science projects, specifically for: deploying changes to production with automated tests, retraining of existing model with new data, and monitoring the performance of the model.No data scientists should bring “right-click and deploy” predictive solutions to production; unfortunately, that happens more often than one would hope. Using Visual Studio Team Services (VSTS), we enable our customers to version control their code for team collaboration and set them up with automated build and release pipelines to train, test, and deploy their code.

As more data is collected, the solution will need to be retrained on a cadence to keep the model up to date so that it continues to make good predictions. While this task may seem like a trivial manual task, the time it takes a data scientist to update a model could be used to create new solutions or enhance old ones.

Often clients will only focus on surfacing results to their end users via reports, applications, or workflows; they forget that they need to build an interface to their solution for themselves.

Data scientists are responsible for maintaining the quality of a solution over time, therefore, the metadata gathered from testing the solution (success criteria, training time etc.) should be stored and visualized to understand the current and historical performance of a solution.

Conclusion

Developing machine learning and deep learning applications is far from easy. However, clients often struggle with the amount of effort it takes to create custom solutions, or they get so bogged down in technical details that they forget the why their business started on the path to AI in the first place.

So, what are the keys to successfully incorporate AI into your organization? To start, collaboration between the data science team and business stakeholders, understanding the data science process, and deploying solutions using DevOps. This process makes predictive analytics possible for data science teams of all sizes even as it changes the mindset of the organization as a whole.

If you’re ready to bring AI into your day-to-day, 10thMagnitude has the solutions to incorporate it seamlessly and painlessly, ensuring that you get the benefits without missing a beat.