Automated Machine Learning with MLBox

In continuation of my AutoML Blog Series we will be evaluating the capabilities of MLBox

What is MLBox

MLBox is an extremely popular and powerful automated machine learning python library. As noted by the MLBox Documentation, it provides features for:

  • Fast reading of data
  • Distributed data processing
  • Robust feature selection
  • Accurate hyper-parameter tuning
  • Start-of-the are machine learning and deep learning models
  • Model interpretation

MLBox is similar to other Automated Machine Learning libraries as it does not automate the data science process, but augments a developers ability to quickly create machine learning models. MLBox simply helps developers create the optimal model and select the best features to make predictions for the label of your choice.

One draw back of the MLBox library is that it doesn’t necessarily conform to a data scientist’s process, rather, the data scientist has to work the way the library expects. One example, is that I will often use three datasets when developing machine learning solutions in an attempt to avoid overfitting: train, validation, and test. Having these three datasets is rather difficult to do with MLBox.

Lets get started using the MLBox library!

Developing with MLBox

Installing MLBox

For this demo we will be using Anaconda Virtual Environments, and I will be using Visual Studio Code as my IDE. For more information on how to use the Anaconda distribution with Visual Studio code check out this blog I wrote. Additionally, I will be developing on a windows machine which is currently in an experimental release.

We will also need a linux machine to do our MLBox development, if you do not have one available you can create one by following these instructions.

  1. First let’s create a new Anaconda Environment.
    conda create -n MLBoxEnv python=3.6
    conda activate MLBoxEnv
  2. Next we will run the following installs.
    pip install setuptools
    pip install mlbox

Training a Model

As with the other AutoML libraries we will be using the titanic dataset where we will use specific features to predict whether or not they survived the catastrophe. For more information about the dataset check out the Kaggle Competition.

  1. Please download the data from the GitHub repository here. Save the file to a data folder in your application directory. Please note that the application directory I will be using is the TPOT directory in my repository.

  2. Now that we have our data and MLBox installed, let’s read our datasets into memory and start preparing it for a machine learning algorithm. MLBox has its own Reader class for efficient and distributed reading of data, one key feature to this class is that it expects a list of file paths to your training and test datasets. Interacting with my datasets was slightly foreign at first, but once I learned that the Reader class creates a dictionary object with pandas dataframes and our target (label) column as a pandas series it was easier to work with.

    from mlbox.preprocessing import *
    from mlbox.optimisation import *
    from mlbox.prediction import *
    train_path = ["./data/titanic_train.csv", "./data/titanic_test.csv"]
    reader = Reader(sep=",", header=0)
    data = reader.train_test_split(train_path, 'Survived')

    There are a few things worth noting about the train_test_split function. The dataset is only considered to be a test set if there is no label column present, otherwise, it will be merged with a train set. Being able to provide a list of file paths is a nice feature to have because it can allow developers to easily ingest many files at once, which is common with bigger datasets and data lakes. Since the function automatically scans for the target column there is little work for the developer to even identify a test dataset. Additionally, it determines whether or not it is a regression or classification problem based off our label and will automatically encode the column as needed.

  3. One really nice feature of MLBox is the ability to automatically remove drift variables. I am not an expert when it comes to explaining what drift is by all means, however, drift is the idea that the process or observation behavior may change over time. In turn the data will slowly change resulting in the relationship between the features to change as well. MLBox has built in functionality to deal with this drift. We will use a drift transform.

    data = Drift_thresholder().fit_transform(data)

  4. As with all Automated Machine Learning libraries, the key feature is not necessarily the algorithms but is the data scientist’s ability to select the appropriate features and optimal hyper-parameters for the algorithm. Using MLBox’s Optimiser class we are able to create a dimensional space to figure out the best set of parameters. Therefore, to optimize we must create a parameter space, and select the scoring metric we wish to optimize.

    opt = Optimiser(scoring='accuracy', n_folds=3)
    opt.evaluate(None, data) space = { 'ne__numerical_strategy': {"search":"choice", "space":[0]}, 'ce__strategy': {"search":"choice", "space":["label_encoding", "random_projection", "entity_embedding"]}, 'fs__threshold': {"search":"uniform", "space":[0.01, 0.3]}, 'est__max_depth': {"search":"choice", "space":[3,4,5,6,7]} }
    best_params = opt.optimise(space, data, 10)

  5. Next we can use the Predictor class to train a machine learning model.

    model = Predictor().fit_predict(best_params, data)

    The line of code above will create a folder called save so that it can export an sklearn pipeline that you can reuse for model deployment or further validation. Additionally, it provides you exports of feature importance, a csv of test predictions, and a target encoder object so that you can map the encoded values back to their original values.

For more information on MLBox, please check out their Github repository or the official documentation page. MLBox is a great library to assist data scientists in building a machine learning solution. For a full copy of the demo python file please refer to my personal Github.

Automated Machine Learning with TPOT

For part two of my automated machine learning series I am focusing on TPOT, a python library that uses genetic programming to optimize data science pipelines. TPOT’s success and popularity has grown extraordinarily since its initial commit in late 2015. As of March 20, 2019, TPOT has 286 people watching, 5,441 stars, and 969 forks on GitHub.

TPOT stands for Tree-Based Pipeline Optimization Tool, and has the goal to help automate the development of ML pipelines by combining flexible tree representation of pipelines with stochastic search algorithms to develop the best scikit-learn library possible. Once the best predictive pipeline has been found, TPOT will export the pipeline as Python code so that a data scientist can continue developing from there. In addition to faster development and great models, my experience with TPOT is that it great learning tool for newer data scientists who may want to understand how to develop better models manually.

One advantage of Auto Machine Learning is the ability to automatically retrain different types of models with different parameters and columns as your data changes. This enables a data scientist to provide a solution that is dynamic and intelligent, however, automated machine learning is compute intensive and time consuming because it is training several different models. TPOT gives users the ability to export their model to a python script to avoid having to apply automated machine learning to every retraining process allowing for fast retraining of a model with the same high-performing accuracy.

Like most automated machine learning libraries TPOT helps automated everything but data acquisition, data cleaning, and complex feature engineering for machine learning. TPOT, like scikit-learn, does provide some simple and dynamic feature engineering functions.

In the first part of my automated machine learning series I evaluated the Azure Machine Learning Auto ML library. Unlike the end to end platform that Azure Machine Learning provides, TPOT is a standalone package meant for developing the best models. In my experience TPOT is an excellent package that can be used in union with platforms like Azure ML and MLFlow to not only train the best model, but manage the data science lifecycle.

The best way to familiarize yourself with TPOT is to get started! Check out the demo I have created and the accompanying code on my GitHub. Please note that staying in line with our Azure Machine Learning Auto ML example we will be using the Titanic Dataset, which is also an example solution provided by the TPOT developers. The walk through I provide is slightly different than the one they provide, particularly surrounding the one-hot encoding of a variable that I deemed unnecessary.

Auto Machine Learning with Azure Machine Learning

I recently wrote a blog introducing automated machine learning (AutoML). If you have not read it you can check it out here. With there being a surplus of AutoML libraries in the marketplace my goal is to provide quick overviews and demo of libraries that I use to develop solutions. In this blog I will focus on the benefits of the Azure Machine Learning Service (AML Service) and the AutoML capabilities it provides. The AutoML library of Azure machine learning is different (not unique) from many other libraries because it also provides a platform to track, train, and deploy your machine learning models. 

Azure Machine Learning Service

An Azure Machine Learning Workspace (AML Workspace) is the foundation of developing python-based predictive solutions, and gives the developer the ability to deploy it as a web service in Azure. The AML Workspace allows data scientists to track their experiments, train and retrain their machine learning models, and deploy machine learning solutions as a containerized web service. When an engineer provisions an Azure Machine Learning Workspace the resources below are also created within the same resource group, and are the backbone to Azure Machine Learning.

The Azure Container Registry gives a developer easy integration with creating, storing, and deploying our web services as Docker containers. One added feature is the easy and automatic tagging to describe your container and associate the container with specific machine learning models. 

An Azure Storage account enables for fast dynamic storing of information from our experiments i.e. models, outputs. After training an initial model using the service, I would recommend manually navigating through the folders. Doing this will give you deeper insight into how the AML Workspace functions. But simply and automatically capture metadata and outputs from our training procedures is crucial to visibility and performance over time. 

When we deploy a web service using the AML Service, we allow the Azure Machine Learning resource to handle all authentication and key generation code. This allows data scientists to focus on developing models instead of writing authentication code. Using Azure Key Vault, the AML Service allows for extremely secure web services that you can expose to external and internal customers. 

Once your secure web service is deployed. Azure Machine Learning integrates seamlessly with Application Insights for all code logging and web service traffic giving users the ability to monitor the health of the deployed solution.

A key feature to allowing data scientists to scale their solutions is offering remote compute targets. Remote compute gives developers the ability easily get their solution off their laptop and into Azure with a familiar IDE and workflow. The remote targets allow developers to only pay for the run time of the experiment, making it a low cost for entry in the cloud analytics space. Additionally, there was a service in Azure called Batch AI that was a queuing resource to handle several jobs at one time. Batch AI was integrated into Azure Machine Learning allowing data scientists to train many machine learning models in parallel with separate compute resources.   

Azure Machine Learning provides data prep capabilities in the form of a “dprep” file allowing users to package up their data transforms into a single line of code. I am not a huge fan of the dprep but it is a capability that makes it easier to handle the required data transformations to score new data in production. Like most platforms, the AML Service offers specialized “pipeline” capabilities to connect various machine learning phases with each other like data acquisition, data preparation, and model training.  

In addition to remote compute, Azure Machine Learning enables users to deploy anywhere they can run docker. Theoretically, one could train a model locally and deploy a model locally (or another cloud), and only simply use Azure to track their experiments for a cheap monthly rate. However, I would suggest taking advantage of Azure Kubernetes Service for auto scaling of your web service to handle the up ticks in traffic, or to a more consistent compute target in Azure Container Instance.

Using Azure Machine Learning’s AutoML

Now it’s time to get to the actual point of this blog. Azure Machine Learning’s AutoML capabilities. In order to use Azure Machine Learning’s AutoML capabilities you will need to pip install `azureml-sdk`. This is the same Python library used to simply track your experiments in the cloud. 

As with any data science project, it starts with data acquisition and exploration. In this phase of developing we are exploring our dataset and identifying desired feature columns to use to make predictions. Our goal here is to create a machine learning dataset to predict our label column.

Once we have created our machine learning dataset and identified if we going to implement a classification or a regression solution, we can let Azure Machine Learning do the rest of the work to identify the best feature column combination, algorithm, and hyper-parameters. To automatically train a machine learning model using Azure ML the developer will need to: define the settings for the experiment then submit the experiment for model tuning. Once submitted, the library will iterate through different machine learning algorithms and hyperparameter settings, following your defined constraints. It chooses the best-fit model by optimizing an accuracy metric. The parameters or setting available to auto train machine learning models are:

  • iteration_timeout_minutes: time limit for each iteration. Total runtime = iterations * iteration_timeout_minutes
  • iterations: Number of iterations. Each iteration produces a machine learning model.
  • primary_metric: metric to optimize. We will choose the best model based on this value.
  • preprocess: When True the experiment may auto preprocess the input data with basic data manipulations.
  • verbosity: Logging level.
  • n_cross_validations: Number of cross validation splits when the validation data is not specified.

The output of this process is a dataset containing the metadata on training runs and their results. This dataset enables developers to easily choose the best model based off the metrics provided. The ability to choose the best model out of many training iterations with different algorithms and feature columns automatically enables us to easily automate the model selection process for *each* model deployment. With typical machine learning deployments, engineers typically deploy the same algorithm with the same feature columns each time, and the only difference was the dataset the model was trained on. But with Auto Machine Learning solutions we are able to note only choose the best algorithm, feature combination, and hyper-parameters each time. That means, we can deploy a decision tree model trained on 4 columns one release, the deploy a logistic regression model trained on 5 columns another release without any code edits.

My One Compliant

My one compliant is installing the library is difficult. The documentation states that it works with Python 3.5.2 and up, however, I was unable to get the proper libraries installed and working correctly using a Python 3.6 interpreter. I simply created a Python 3.5.6 interpreter and it worked great! Not sure if this was an error on my part or Microsoft’s but the AutoML capabilities worked as expected otherwise.  

Overall, I think Azure Machine Learning’ Auto ML works great. It is not ground breaking or a game changer, but it does exactly as advertised which is huge in the current landscape of data where it seems as if many tools do not work as expected. Azure ML will run iterations over your dataset to figure out the best model possible, but in the end predictive solutions depend on the correlation between your data points. For a more detailed example of Azure Machine Learning’s AutoML feature check out my walk through available here.

Automated Machine Learning

Traditionally, the development of predictive solutions is a challenging and time consuming process that requires expert resources in software development, data engineering, and data science. Engineers are required to complete the following tasks in an iterative and cyclical manner.

  1. Preprocess, feature engineer, and clean data
  2. Select appropriate model
  3. Tune Hyperparameters
  4. Analyze Results
  5. Repeat

As the industry identified the blockers that make the development of machine learning solutions costly, we (as a community) aim to figure out a way to automate the process in an attempt to make it easier and faster to deploy intelligent solutions. Therefore, selecting and tuning models can be automated to make the analysis of results easier for non-expert and expert developers.

Automated machine learning is the ability to have a defined dataset with a specific target feature, and automatically iterate over the dataset with different algorithms and combination of input variables to select the best model. The purpose is to make developing this solutions require less resources, less domain knowledge, and less time.

How it Works

Most Auto ML libraries available are used to solve supervised learning in order to solve specific problems. If you are unfamiliar, there are two main categories of machine learning.

  • Supervised Learning: is where you have input variables and output variables, and you apply algorithms to learn the mapping function of input to output.
  • Unsupervised Learning: is where you have input variables but no output variables to map them to. The goal is typically to identify trends and patterns in the data to make assumptions.

Note there is a category called semi-supervised learning but we will not get into that here. But it is simply a combination of the two categories above.

In order to use auto machine learning your dataset must be feature engineered. Meaning, you manually develop transformations to create a machine learning dataset to solve your problem. Most Auto ML libraries have built in transformation functions to solve the most popular transformation steps, but in my experience these functions are rarely enough to get data machine learning ready.

Once you have featured engineer your dataset the developer simply needs to determine the type of algorithm they need. Most supervised learning algorithms can be classified as:

  • Classification: The output variable is a set number of outcomes. For example, predicting if a customer will return to a store is either a “yes” or a “no”. Classification is additionally broken into multiclassification (3 or more outcomes) and binary classification (2 outcomes).
  • Regression: The output is a numeric value. For example, predicting the prices of a car or house.

When given an algorithm type, Auto ML libraries will run iterations over your dataset to determine the best combination features, and best hyperparameters for each algorithm, therefore, in turn it actually trains many models and gives the engineer the best algorithm.
I would like to highlight the differences between having to engineer columns for machine learning, and selecting the appropriate columns for machine learning. For example, lets assume I want to predict how many point of sale transactions will occur every hour of the day. The raw dataset is likely transactional, therefore, will require a developer to summarize the data at the hour level i.e. grouping, summing, and averaging. But often times developers will create custom functions in order to describe the trends in the dataset. This process is feature engineering.

Feature selection comes after feature engineering. I may summarize my dataset with 10 different columns that I believe will be useful, but Auto ML libraries may select the 8 best columns out of the 10.

The difference between feature engineering and feature selection is huge. Most libraries will handle common or simple data engineering processes, however, the majority of the time a data engineer will need to manually create those transformations in order to use Auto ML libraries.

When Auto Machine Learning libraries are used in the development process the output is usually a dataset containing metadata on the training runs and their results. This dataset enables developers to easily choose the best model based off the metrics provided. Being able to choose the best model out of many training iterations with different algorithms and feature columns automatically is that it enables us to easily automate the model selection process for *each* model deployment. With typical machine learning deployments, engineers typically deploy the same algorithm with the same feature columns each time. But with Auto Machine Learning solutions we are able to note only choose the best algorithm, feature combination, and hyper-parameters each time. That means, we can deploy a decision tree model trained on 4 columns one release, the deploy a logistic regression model trained on 5 columns another release without any code edits. This is so simple, yet so awesome about how easy it can be!

Available Libraries

MLBox, a python library for automated machine learning. Key features include distributed processing of data, robust feature selection, accurate hyperparameter tuning, deep learning support, and model interpretation.

TPOT, an automated machine learning python that uses genetic programming to optimizes machine learning pipelines. Similar to other automated machine learning libraries it is built on top of scikit learn.
The AutoML with TPOT is now available.

Auto-sklearn, a python library is great for all the sci-kit learn developers out there. It sits on top of sci-kit learn to automate the hyperparameter and algorithm selection process.

AzureML, an end to end platform for machine learning development and deployment. The library enables faster iterations by manage and tracking experiments, and fully supports most python-based frameworks like PyTorch, TensorFlow, and sci-kit learn. The Auto ML feature is baked into the platform to make it easy to select your model.
The AutoML with AzureML is now available.

Ludwig, a TensorFlow based platform for deep learning solutions was released by Uber to enable users with little coding experience. The developer simply needs to provide a training dataset and a configuration file identifying the features and labels desired.

Check out the libraries above! Automated machine learning is fun to play around with and apply to problems. I will be creating demos and walk throughs of each of these libraries. Once public you will be able to find them on my GitHub.