Azure Machine Learning vs MLFlow

Machine learning and deep learning models are difficult to get into production due to the highly cyclical development process of data science. As a data scientist I am not only experimenting with different hyper-parameters to tune my model, but I am constantly creating or adding new features in my dataset; therefore, needing to keep track of which features, hyper-parameters, and the type of model used can be a lot to remember. Additionally, I have to create a model that performs up to our evaluation criteria, otherwise, the process of deploying a model is irrelevant. While their seems to be an endless number of tools that can help engineers track and monitor tasks throughout the development process, tools like MLFlow and Azure Machine Learning look to help make this process manageable.

Once a data scientist develops a machine learning model, the code is rarely production ready and package dependencies are usually a mess. Not only do developers need to clean up the model training script, but they typically need to make big changes in how they clean and acquire data. During the training process, data scientists are able to access cold historical data that is easy to acquire and test on. However, once in production we need new code that acquires data as it is created, transforms it as needed, and make predictions. In addition, to a new data acquisition script data scientists need to write a new scoring script that actually makes predictions and returns those predictions to the application. In the end, data science solutions have a lot of moving parts in development and in production.

For the longest time I would recommend Azure Machine Learning to my clients to track, train, and deploy machine learning solutions. However, as Azure Databricks has become the premier big data and analytics resources in Azure, I have had to adjust that message slightly. Since MLFlow is integrated into Azure Databricks it has easily become the default platform to manage data science experiments from development to production in a Spark environment, however, I believe that Azure Machine Learning is a viable, and often better, tool choice for data scientists. In the demonstration available on my GitHub I show users how to train and track machine learning models using MLFlow, Azure Machine Learning, and MLFlow’s integration with Azure Machine Learning. If you are looking to deploy models using Azure Databricks and Azure Machine Learning, check out my previous demo available in the same repository.

MLFlow is centered around enhancing the engineer’s ability to track experiments so that they have visibility of performance during both development and production. Managing models is simplified by associating them with specific experiment runs, and it packages machine learning code so that it is reusable, reproduceable, and shareable across an organization. With MLFlow a data scientist is able to execute and compare hundreds of training runs in parallel on any compute platform they wish and deploy those models in any manner they desire. MLFlow supports all programming languages through its REST interface, however, R, Python, Java, and Scala are all supported languages out of the box. Below is a screenshot of a MLFlow experiment view in Azure Databricks.

If I were to select one of my experiment runs by clicking on the execution date hyperlink then I can see more details about the specific run. We also have the ability to select multiple runs and compare them.

Azure Machine Learning is an enterprise ready tool that integrates seamlessly with your Azure Active Directory and other Azure Services. Similar to MLFlow, it allows developers to train models, deploy them with built in Docker capabilities (please note that you do not have to deploy with Docker), and manage machine learning models in the cloud. Azure machine learning is fully compatible with popular packages like PyTorch, TensorFlow, and scikit-learn, and allows developers to train models locally and scale out to the cloud when needed.

At a high-level the Azure Machine Learning Service provides a workspace that is the central container for all development and artifact storage. Within a workspace a developer can create experiments that all scripts, artifacts, and logging is tracked by the usage of experiment runs. The most important aspect of data science is our model, the object that takes in our source data and makes predictions. Azure Machine Learning provides a model registry that tracks and versions our experiment models making it easier to deploy and audit predictive solutions. One of the most crucial aspects to any machine learning solution is deployment. The Azure Machine Learning service allows developers to package their python code as a web service Docker container. These docker images and containers are cataloged an Azure Container Registry that is associated to the Azure Machine Learning Workspace. This give data scientists the ability to track a single training run from development into production by capturing all the training criteria, registering our model, building a container, and creating a deployment.

Below is a list of my experiments in a demo Azure Machine Learning Workspace.

By clicking into one of the experiments I am able to see all the runs and view the performance of each run through the values logged using the AzureML Python SDK. Please note that I have the ability to select multiple runs and compare them.

The above screenshots are very similar to MLFlow, where I believe Azure Machine Learning extends and offers better capabilities is through the Compute, Models, Images, and Deployment tabs in our Azure ML Workspace.

Either programmatically or using the Azure Portal I am able to create a remote compute target where I can offload my experiment runs from my local laptop, and have everything logged and stored in my workspace.

By registering models in my workspace I make them available to create a Docker image and deploy as a web service. Developers can either use the Azure ML SDK or the Azure portal to do so.

Once an image is created I can easily deploy the Docker container anywhere that I can run containers. This can be in Azure, locally, or on the edge! One extremely nice feature built into Azure Machine Learning is the integration with Application Insights allow developers to capture telemetry data about the web service and the model in production.

Overall, while MLFlow and Azure Machine Learning are very similar, I typically side with Azure Machine Learning as the more enterprise ready product that enables developers to deploy solutions faster. However, the cross validation ability that is built into MLFlow, mllib, and Databricks makes it extremely easy to tune hyper-parameters, while the Azure Machine Learning hyper-parameter tuning is a little more difficult.

One my favorite features of MLFlow and Azure Machine Learning is the ability to use MLFlow in union with Azure Machine Learning, which I highlight in my demo. Generally, I recommend to engineers who are developing exclusively on Azure Databricks to use MLFlow due to the easy integration it provides, however, if there is a subset of solutions being deployed or developed in a non-Spark environment I would recommend a tool like Azure Machine Learning to centralize all data science experiments in one location. Please check out the example of MLFlow and Azure Machine Learning on Azure Databricks available on my github!

Automated Machine Learning with AutoKeras

In hopes of adding to my AutoML blog series I thought it would be great to touch on automated deep learning libraries as well. When I first started playing around with neural networks I turned towards the popular deep learning library, Keras for my development. Therefore, I decided that my first auto deep learning library should be AutoKeras!

Deep learning is a subset of machine learning focused on developing predictive models using neural networks, and allows humans to create solutions for object detection, image classification, speech recognition and more. One of the most popular deep learning libraries available is Keras, a high-level API to easily run TensorFlow, CNTK, and Theano. The main goal of Keras is to enable developers to quickly iterate and develop neural networks on multiple frameworks.

Over the last year or so, there has been a lot of development in the AutoML space, which is why I have been writing so many blogs showing off different libraries. AutoML libraries have mostly focused around on traditional machine learning algorithms. Therefore, to take the Keras vision to the next level and increase the speed at which we can create neural networks, the Keras team as been developing the AutoKeras library that aims to automatically learn the best architecture and hyper-parameters of a neural network to solve your specific need.

Since the library is still in pre-release there are not a ton of resources to available when you start building a model with AutoKeras. Most of the examples show of the MNIST data set which is built into the Keras library. So while I do show a quick MNIST example in the demo, I also provide one with a custom image dataset that requires the developer to load images as numpy arrays prior to using them as input in the model training.

The demo I have created walks users through the process of:

  • Curating your own image dataset
    • Note that we will be using the FastAI library, which is my favorite deep learning library and runs on top of PyTorch.
  • You can also use the data.zip file available in the GitHub repository.
  • Train a model with Keras
  • Train a model on the MNIST dataset using AutoKeras
  • Train a model with downloaded images using AutoKeras

Overall, the AutoKeras library is rough. It does not work quite as Keras worked, which kind of threw me off, and a lot of the built in functions that makes Keras great are not available. I would not recommend using AutoKeras for any real neural network development, but the overall idea of using AutoML with Keras intrigues me greatly. I would recommend monitoring the library as development continues as it could dramatically improve in the near future.  Check out the demo I have provided on GitHub. Please note that I developed the demo on a Linux virtual machine, and that environment setup varies by environment. Additionally, GPU support will enable faster training times.

Quick Review: Databricks Delta

As the number of data sources grow and the size of that data increases, organizations have moved to building out data lakes in the cloud in order to provide scalable data engineering workflows and predictive analytics to support business solutions. I have worked with several companies to build out these structured data lakes and the solutions that sit on top of them. While data lakes provide a level of scalability, ease of access, and ability to quickly iterate over solutions, they have always fallen a little short on the structure and reliability that traditional data warehouses have provided.

Historically I have recommended that customers apply structure, not rules, to their data lake so that it makes the aggregation and transformation of data easier for engineers to serve to customers. The recommended structure was usually similar lambda architecture, as not all organizations have streaming data, but they would build out their data lake knowing this was a possibility in the future. The flow of data generally followed the process described below:

  • Batch and streaming data sources are aggregated into raw data tables with little to no transforms applied i.e. streaming log data from a web application or batch loading application database deltas.
  • Batch and streaming jobs in our raw data tables are cleaned, transformed, and saved to staging tables by executing the minimum number of transforms on a single data source i.e. we tabularize a json file and save it as a parquet file without joining any other data or we aggregate granular data.
  • Finally we aggregate data, join sources, and apply business logic to create our summary tables i.e. the tables data analysts, data scientists, and engineers ingest for their solutions.

One key to the summary tables is that they are business driven. Meaning that we create these data tables to solve specific problems and to be queried on a regular basis. Additionally, I recently took a Databricks course and instead of the terms raw, staging, and summary; they used bronze, silver, and gold tables respectfully. I now prefer the Databricks terminology over my own.

Delta Lake is an open source project designed to make big data solutions easier and has been mostly developed by Databricks. Data lakes have always worked well, however, since Delta Lake came onto the scene, organizations are able to take advantage of additional features when updating or creating their data lakes.

  • ACID Transactions: Serial transactions to ensure data integrity.
  • Data Versioning: Delta Lake provides data snapshots allowing developers to access and revert earlier versions of data for audits, rollbacks, and reproducing predictive experiments.
  • Open Format: Data stored as in Parquet format making it easy to convert existing data lakes into Delta Lakes.
  • Unified Batch and Streaming: Combine streaming and batch data sources into a single location, and use Delta tables can act as a streaming source as well.
  • Schema Enforcement: Provide and enforce a schema as need to ensure correct data types and columns.
  • Schema Evolution: Easily change the schema of your data as it evolves over time.

Generally, Delta Lake offers a very similar development and consumption pattern as a typical data lake, however, the items listed above are added features that bring an enterprise level of capabilities that make the lives of data engineers, analysts, and scientists easier.

As an Azure consultant, Databricks Delta is the big data solution I recommend to my clients. To get started developing a data lake solution with Azure Databricks and Databricks Delta check out the demo provided on my GitHub. We take advantage of traditional cloud storage by using an Azure Data Lake Gen2 to serve as the storage layer on our Delta Lake.

Automated Machine Learning with MLBox

In continuation of my AutoML Blog Series we will be evaluating the capabilities of MLBox

What is MLBox

MLBox is an extremely popular and powerful automated machine learning python library. As noted by the MLBox Documentation, it provides features for:

  • Fast reading of data
  • Distributed data processing
  • Robust feature selection
  • Accurate hyper-parameter tuning
  • Start-of-the are machine learning and deep learning models
  • Model interpretation

MLBox is similar to other Automated Machine Learning libraries as it does not automate the data science process, but augments a developers ability to quickly create machine learning models. MLBox simply helps developers create the optimal model and select the best features to make predictions for the label of your choice.

One draw back of the MLBox library is that it doesn’t necessarily conform to a data scientist’s process, rather, the data scientist has to work the way the library expects. One example, is that I will often use three datasets when developing machine learning solutions in an attempt to avoid overfitting: train, validation, and test. Having these three datasets is rather difficult to do with MLBox.

Lets get started using the MLBox library!

Developing with MLBox

Installing MLBox

For this demo we will be using Anaconda Virtual Environments, and I will be using Visual Studio Code as my IDE. For more information on how to use the Anaconda distribution with Visual Studio code check out this blog I wrote. Additionally, I will be developing on a windows machine which is currently in an experimental release.

We will also need a linux machine to do our MLBox development, if you do not have one available you can create one by following these instructions.

  1. First let’s create a new Anaconda Environment.
    conda create -n MLBoxEnv python=3.6
    conda activate MLBoxEnv
  2. Next we will run the following installs.
    pip install setuptools
    pip install mlbox

Training a Model

As with the other AutoML libraries we will be using the titanic dataset where we will use specific features to predict whether or not they survived the catastrophe. For more information about the dataset check out the Kaggle Competition.

  1. Please download the data from the GitHub repository here. Save the file to a data folder in your application directory. Please note that the application directory I will be using is the TPOT directory in my repository.

  2. Now that we have our data and MLBox installed, let’s read our datasets into memory and start preparing it for a machine learning algorithm. MLBox has its own Reader class for efficient and distributed reading of data, one key feature to this class is that it expects a list of file paths to your training and test datasets. Interacting with my datasets was slightly foreign at first, but once I learned that the Reader class creates a dictionary object with pandas dataframes and our target (label) column as a pandas series it was easier to work with.

    from mlbox.preprocessing import *
    from mlbox.optimisation import *
    from mlbox.prediction import *
    train_path = ["./data/titanic_train.csv", "./data/titanic_test.csv"]
    reader = Reader(sep=",", header=0)
    data = reader.train_test_split(train_path, 'Survived')

    There are a few things worth noting about the train_test_split function. The dataset is only considered to be a test set if there is no label column present, otherwise, it will be merged with a train set. Being able to provide a list of file paths is a nice feature to have because it can allow developers to easily ingest many files at once, which is common with bigger datasets and data lakes. Since the function automatically scans for the target column there is little work for the developer to even identify a test dataset. Additionally, it determines whether or not it is a regression or classification problem based off our label and will automatically encode the column as needed.

  3. One really nice feature of MLBox is the ability to automatically remove drift variables. I am not an expert when it comes to explaining what drift is by all means, however, drift is the idea that the process or observation behavior may change over time. In turn the data will slowly change resulting in the relationship between the features to change as well. MLBox has built in functionality to deal with this drift. We will use a drift transform.

    data = Drift_thresholder().fit_transform(data)

  4. As with all Automated Machine Learning libraries, the key feature is not necessarily the algorithms but is the data scientist’s ability to select the appropriate features and optimal hyper-parameters for the algorithm. Using MLBox’s Optimiser class we are able to create a dimensional space to figure out the best set of parameters. Therefore, to optimize we must create a parameter space, and select the scoring metric we wish to optimize.

    opt = Optimiser(scoring='accuracy', n_folds=3)
    opt.evaluate(None, data) space = { 'ne__numerical_strategy': {"search":"choice", "space":[0]}, 'ce__strategy': {"search":"choice", "space":["label_encoding", "random_projection", "entity_embedding"]}, 'fs__threshold': {"search":"uniform", "space":[0.01, 0.3]}, 'est__max_depth': {"search":"choice", "space":[3,4,5,6,7]} }
    best_params = opt.optimise(space, data, 10)

  5. Next we can use the Predictor class to train a machine learning model.

    model = Predictor().fit_predict(best_params, data)

    The line of code above will create a folder called save so that it can export an sklearn pipeline that you can reuse for model deployment or further validation. Additionally, it provides you exports of feature importance, a csv of test predictions, and a target encoder object so that you can map the encoded values back to their original values.

For more information on MLBox, please check out their Github repository or the official documentation page. MLBox is a great library to assist data scientists in building a machine learning solution. For a full copy of the demo python file please refer to my personal Github.

Automated Machine Learning with TPOT

For part two of my automated machine learning series I am focusing on TPOT, a python library that uses genetic programming to optimize data science pipelines. TPOT’s success and popularity has grown extraordinarily since its initial commit in late 2015. As of March 20, 2019, TPOT has 286 people watching, 5,441 stars, and 969 forks on GitHub.

TPOT stands for Tree-Based Pipeline Optimization Tool, and has the goal to help automate the development of ML pipelines by combining flexible tree representation of pipelines with stochastic search algorithms to develop the best scikit-learn library possible. Once the best predictive pipeline has been found, TPOT will export the pipeline as Python code so that a data scientist can continue developing from there. In addition to faster development and great models, my experience with TPOT is that it great learning tool for newer data scientists who may want to understand how to develop better models manually.

One advantage of Auto Machine Learning is the ability to automatically retrain different types of models with different parameters and columns as your data changes. This enables a data scientist to provide a solution that is dynamic and intelligent, however, automated machine learning is compute intensive and time consuming because it is training several different models. TPOT gives users the ability to export their model to a python script to avoid having to apply automated machine learning to every retraining process allowing for fast retraining of a model with the same high-performing accuracy.

Like most automated machine learning libraries TPOT helps automated everything but data acquisition, data cleaning, and complex feature engineering for machine learning. TPOT, like scikit-learn, does provide some simple and dynamic feature engineering functions.

In the first part of my automated machine learning series I evaluated the Azure Machine Learning Auto ML library. Unlike the end to end platform that Azure Machine Learning provides, TPOT is a standalone package meant for developing the best models. In my experience TPOT is an excellent package that can be used in union with platforms like Azure ML and MLFlow to not only train the best model, but manage the data science lifecycle.

The best way to familiarize yourself with TPOT is to get started! Check out the demo I have created and the accompanying code on my GitHub. Please note that staying in line with our Azure Machine Learning Auto ML example we will be using the Titanic Dataset, which is also an example solution provided by the TPOT developers. The walk through I provide is slightly different than the one they provide, particularly surrounding the one-hot encoding of a variable that I deemed unnecessary.

Auto Machine Learning with Azure Machine Learning

I recently wrote a blog introducing automated machine learning (AutoML). If you have not read it you can check it out here. With there being a surplus of AutoML libraries in the marketplace my goal is to provide quick overviews and demo of libraries that I use to develop solutions. In this blog I will focus on the benefits of the Azure Machine Learning Service (AML Service) and the AutoML capabilities it provides. The AutoML library of Azure machine learning is different (not unique) from many other libraries because it also provides a platform to track, train, and deploy your machine learning models. 

Azure Machine Learning Service

An Azure Machine Learning Workspace (AML Workspace) is the foundation of developing python-based predictive solutions, and gives the developer the ability to deploy it as a web service in Azure. The AML Workspace allows data scientists to track their experiments, train and retrain their machine learning models, and deploy machine learning solutions as a containerized web service. When an engineer provisions an Azure Machine Learning Workspace the resources below are also created within the same resource group, and are the backbone to Azure Machine Learning.

The Azure Container Registry gives a developer easy integration with creating, storing, and deploying our web services as Docker containers. One added feature is the easy and automatic tagging to describe your container and associate the container with specific machine learning models. 

An Azure Storage account enables for fast dynamic storing of information from our experiments i.e. models, outputs. After training an initial model using the service, I would recommend manually navigating through the folders. Doing this will give you deeper insight into how the AML Workspace functions. But simply and automatically capture metadata and outputs from our training procedures is crucial to visibility and performance over time. 

When we deploy a web service using the AML Service, we allow the Azure Machine Learning resource to handle all authentication and key generation code. This allows data scientists to focus on developing models instead of writing authentication code. Using Azure Key Vault, the AML Service allows for extremely secure web services that you can expose to external and internal customers. 

Once your secure web service is deployed. Azure Machine Learning integrates seamlessly with Application Insights for all code logging and web service traffic giving users the ability to monitor the health of the deployed solution.

A key feature to allowing data scientists to scale their solutions is offering remote compute targets. Remote compute gives developers the ability easily get their solution off their laptop and into Azure with a familiar IDE and workflow. The remote targets allow developers to only pay for the run time of the experiment, making it a low cost for entry in the cloud analytics space. Additionally, there was a service in Azure called Batch AI that was a queuing resource to handle several jobs at one time. Batch AI was integrated into Azure Machine Learning allowing data scientists to train many machine learning models in parallel with separate compute resources.   

Azure Machine Learning provides data prep capabilities in the form of a “dprep” file allowing users to package up their data transforms into a single line of code. I am not a huge fan of the dprep but it is a capability that makes it easier to handle the required data transformations to score new data in production. Like most platforms, the AML Service offers specialized “pipeline” capabilities to connect various machine learning phases with each other like data acquisition, data preparation, and model training.  

In addition to remote compute, Azure Machine Learning enables users to deploy anywhere they can run docker. Theoretically, one could train a model locally and deploy a model locally (or another cloud), and only simply use Azure to track their experiments for a cheap monthly rate. However, I would suggest taking advantage of Azure Kubernetes Service for auto scaling of your web service to handle the up ticks in traffic, or to a more consistent compute target in Azure Container Instance.

Using Azure Machine Learning’s AutoML

Now it’s time to get to the actual point of this blog. Azure Machine Learning’s AutoML capabilities. In order to use Azure Machine Learning’s AutoML capabilities you will need to pip install `azureml-sdk`. This is the same Python library used to simply track your experiments in the cloud. 

As with any data science project, it starts with data acquisition and exploration. In this phase of developing we are exploring our dataset and identifying desired feature columns to use to make predictions. Our goal here is to create a machine learning dataset to predict our label column.

Once we have created our machine learning dataset and identified if we going to implement a classification or a regression solution, we can let Azure Machine Learning do the rest of the work to identify the best feature column combination, algorithm, and hyper-parameters. To automatically train a machine learning model using Azure ML the developer will need to: define the settings for the experiment then submit the experiment for model tuning. Once submitted, the library will iterate through different machine learning algorithms and hyperparameter settings, following your defined constraints. It chooses the best-fit model by optimizing an accuracy metric. The parameters or setting available to auto train machine learning models are:

  • iteration_timeout_minutes: time limit for each iteration. Total runtime = iterations * iteration_timeout_minutes
  • iterations: Number of iterations. Each iteration produces a machine learning model.
  • primary_metric: metric to optimize. We will choose the best model based on this value.
  • preprocess: When True the experiment may auto preprocess the input data with basic data manipulations.
  • verbosity: Logging level.
  • n_cross_validations: Number of cross validation splits when the validation data is not specified.

The output of this process is a dataset containing the metadata on training runs and their results. This dataset enables developers to easily choose the best model based off the metrics provided. The ability to choose the best model out of many training iterations with different algorithms and feature columns automatically enables us to easily automate the model selection process for *each* model deployment. With typical machine learning deployments, engineers typically deploy the same algorithm with the same feature columns each time, and the only difference was the dataset the model was trained on. But with Auto Machine Learning solutions we are able to note only choose the best algorithm, feature combination, and hyper-parameters each time. That means, we can deploy a decision tree model trained on 4 columns one release, the deploy a logistic regression model trained on 5 columns another release without any code edits.

My One Compliant

My one compliant is installing the library is difficult. The documentation states that it works with Python 3.5.2 and up, however, I was unable to get the proper libraries installed and working correctly using a Python 3.6 interpreter. I simply created a Python 3.5.6 interpreter and it worked great! Not sure if this was an error on my part or Microsoft’s but the AutoML capabilities worked as expected otherwise.  

Overall, I think Azure Machine Learning’ Auto ML works great. It is not ground breaking or a game changer, but it does exactly as advertised which is huge in the current landscape of data where it seems as if many tools do not work as expected. Azure ML will run iterations over your dataset to figure out the best model possible, but in the end predictive solutions depend on the correlation between your data points. For a more detailed example of Azure Machine Learning’s AutoML feature check out my walk through available here.

Automated Machine Learning

Traditionally, the development of predictive solutions is a challenging and time consuming process that requires expert resources in software development, data engineering, and data science. Engineers are required to complete the following tasks in an iterative and cyclical manner.

  1. Preprocess, feature engineer, and clean data
  2. Select appropriate model
  3. Tune Hyperparameters
  4. Analyze Results
  5. Repeat

As the industry identified the blockers that make the development of machine learning solutions costly, we (as a community) aim to figure out a way to automate the process in an attempt to make it easier and faster to deploy intelligent solutions. Therefore, selecting and tuning models can be automated to make the analysis of results easier for non-expert and expert developers.

Automated machine learning is the ability to have a defined dataset with a specific target feature, and automatically iterate over the dataset with different algorithms and combination of input variables to select the best model. The purpose is to make developing this solutions require less resources, less domain knowledge, and less time.

How it Works

Most Auto ML libraries available are used to solve supervised learning in order to solve specific problems. If you are unfamiliar, there are two main categories of machine learning.

  • Supervised Learning: is where you have input variables and output variables, and you apply algorithms to learn the mapping function of input to output.
  • Unsupervised Learning: is where you have input variables but no output variables to map them to. The goal is typically to identify trends and patterns in the data to make assumptions.

Note there is a category called semi-supervised learning but we will not get into that here. But it is simply a combination of the two categories above.

In order to use auto machine learning your dataset must be feature engineered. Meaning, you manually develop transformations to create a machine learning dataset to solve your problem. Most Auto ML libraries have built in transformation functions to solve the most popular transformation steps, but in my experience these functions are rarely enough to get data machine learning ready.

Once you have featured engineer your dataset the developer simply needs to determine the type of algorithm they need. Most supervised learning algorithms can be classified as:

  • Classification: The output variable is a set number of outcomes. For example, predicting if a customer will return to a store is either a “yes” or a “no”. Classification is additionally broken into multiclassification (3 or more outcomes) and binary classification (2 outcomes).
  • Regression: The output is a numeric value. For example, predicting the prices of a car or house.

When given an algorithm type, Auto ML libraries will run iterations over your dataset to determine the best combination features, and best hyperparameters for each algorithm, therefore, in turn it actually trains many models and gives the engineer the best algorithm.
I would like to highlight the differences between having to engineer columns for machine learning, and selecting the appropriate columns for machine learning. For example, lets assume I want to predict how many point of sale transactions will occur every hour of the day. The raw dataset is likely transactional, therefore, will require a developer to summarize the data at the hour level i.e. grouping, summing, and averaging. But often times developers will create custom functions in order to describe the trends in the dataset. This process is feature engineering.

Feature selection comes after feature engineering. I may summarize my dataset with 10 different columns that I believe will be useful, but Auto ML libraries may select the 8 best columns out of the 10.

The difference between feature engineering and feature selection is huge. Most libraries will handle common or simple data engineering processes, however, the majority of the time a data engineer will need to manually create those transformations in order to use Auto ML libraries.

When Auto Machine Learning libraries are used in the development process the output is usually a dataset containing metadata on the training runs and their results. This dataset enables developers to easily choose the best model based off the metrics provided. Being able to choose the best model out of many training iterations with different algorithms and feature columns automatically is that it enables us to easily automate the model selection process for *each* model deployment. With typical machine learning deployments, engineers typically deploy the same algorithm with the same feature columns each time. But with Auto Machine Learning solutions we are able to note only choose the best algorithm, feature combination, and hyper-parameters each time. That means, we can deploy a decision tree model trained on 4 columns one release, the deploy a logistic regression model trained on 5 columns another release without any code edits. This is so simple, yet so awesome about how easy it can be!

Available Libraries

MLBox, a python library for automated machine learning. Key features include distributed processing of data, robust feature selection, accurate hyperparameter tuning, deep learning support, and model interpretation.

TPOT, an automated machine learning python that uses genetic programming to optimizes machine learning pipelines. Similar to other automated machine learning libraries it is built on top of scikit learn.
The AutoML with TPOT is now available.

Auto-sklearn, a python library is great for all the sci-kit learn developers out there. It sits on top of sci-kit learn to automate the hyperparameter and algorithm selection process.

AzureML, an end to end platform for machine learning development and deployment. The library enables faster iterations by manage and tracking experiments, and fully supports most python-based frameworks like PyTorch, TensorFlow, and sci-kit learn. The Auto ML feature is baked into the platform to make it easy to select your model.
The AutoML with AzureML is now available.

Ludwig, a TensorFlow based platform for deep learning solutions was released by Uber to enable users with little coding experience. The developer simply needs to provide a training dataset and a configuration file identifying the features and labels desired.

Check out the libraries above! Automated machine learning is fun to play around with and apply to problems. I will be creating demos and walk throughs of each of these libraries. Once public you will be able to find them on my GitHub.

Azure Machine Learning Services and Azure Databricks

As a consultant working almost exclusively in Microsoft Azure, developing and deploying artificial intelligent (AI) solutions to suit our client’s needs is at the core of our business. Predictive solutions need to be easy to implement and must scale as it becomes business critical. Most organizations have existing applications and processes that they wish to infuse with AI. When deploying intelligence to integrate with existing applications it needs to be a microservice type feature that is easy to consume by the application. After trial and error I have grown to love implementing new features using both the Azure Machine Learning Service (AML Service) and Azure Databricks.

Azure Machine Learning Service is a platform that allows data scientists and data engineers to train, deploy, automate, and manage machine learning models at scale and in the cloud. Developers can build intelligent algorithms into applications and workflows using Python-based libraries. The AML Service is a framework that allows developers to train wherever they choose, then wrap their model as a web service in a docker container and deploy to any container orchestrator they wish!

Azure Databricks is a an optimized Apache Spark Platform for heavy analytics workloads. It was designed with the founders of Apache Spark, allowing for a natural integration with Azure services. Databricks makes the setup of Spark as easy as a few clicks allowing organizations to streamline development and provides an interactive workspace for collaboration between data scientists, data engineers, and business analysts. Developers can enable their business with familiar tools and a distributed processing platform to unlock their data’s secrets.

While Azure Databricks is a great platform to deploy AI Solutions (batch and streaming), I will often use it as the compute for training machine learning models before deploying with the AML Service (web service).

Ways to Implement AI

The most common ways to deploy a machine learning solution are as a:

  • Consumable web service
  • Scheduled batch process
  • Continuously streaming predictions

Many organizations will start with smaller batch processes to support reporting needs, then as the need for application integration and near real-time predictions grow the solution turns into streaming or a web service.

Web Service Implementation

A web service is simply code that can be invoked remotely to execute a specific task. In machine learning solutions, web services are a great way to deploy a predictive model that needs to be consumed by one or more applications. Web services allow for simply integration into new and existing applications.

A major advantage to deploying web services over both batch and streaming solutions is the ability to add near real-time intelligence without changing infrastructure or architecture. Web services allow developers to simply add a feature to their code without having to do a massive overhaul of the current processes because they simply need to add a new API call to bring those predictions to consumption.

One disadvantage is that predictions can only be made by calling the web service. Therefore, if a developer wishes to have predictions made on a scheduled basis or continuously, there needs to be an outside process to call that web service. However, if an individual is simply trying to make scheduled batch calls, I would recommend using Azure Databricks.

Batch Processing

Batch processing is a technique to transform a dataset at one time, as opposed to individual data points. Typically this is a large amount of data that has been aggregated over a period of time. The main goal of batch processing is to efficiently work on a bigger window of data that consists of files or records. These processes are usually ran in “off” hours so that it does not impact business critical systems.

Batch processing is extremely effective at unlocking *deep insights* in your data. It allows users to process a large window of data to analyze trends over time and really allow engineers to manipulate and transform data to solve business problems.

As common as batch processing is, there are a few disadvantages to implementing a batch process. Maintaining and debugging a batch process can sometimes be difficult. For anyone who has tried to debug a complex stored procedure in a Microsoft SQL Server will understand this difficulty. Another issue that can arise in today’s cloud first world is the cost of implementing a solution. Batch solutions are great at saving money because the infrastructure required can spin up and shut down automatically since it only needs to be on when the process is running. However, the implementation and knowledge transfer of the solution can often be the first hurdle faced.

By thoughtfully designing and documenting these batch processes, organizations should be able to avoid any issues with these types of solutions.

Stream Processing

Stream processing is the ability to analyze data as it flows from the data source (application, devices, etc.) to a storage location (relational databases, data lakes, etc.). Due to the continuous nature of these systems, large amounts of data is not required to be stored at one time and are focused on finding insights in small windows of time. Stream processing is ideal when you wish to track or detect events that are close in time and occur frequently.

The hardest part of implementing a streaming data solution is the ability to keep up with the input data rate. Meaning that the solution must be able to process data as fast or faster than the rate at which the data sources generate data. If the solution is unable to achieve this then it will lead to a never ending backlog of data and may run into storage or memory issues. Having a plan to access data after the stream is operated on and reduce the number of copies to optimize storage can be difficult.

While there are difficulties with a streaming data architecture, it enables engineers to unlock insights as they occur. Meaning, organizations can detect or predict if there is a problem faster than any other method of data processing. Streaming solutions truly enable predictive agility within an organization.

Check out the Walkthrough

Implementing a machine learning solution with Azure Databricks and Azure Machine Learning allows data scientists to easily deploy the same model in several different environments. Azure Databricks is capable of making streaming predictions as data enters the system, as well as large batch processes. While these two ways are great for unlocking insights from your data, often the best way to incorporate intelligence into an application is by calling a web service. Azure Machine Learning service allows a data scientist to wrap up their model and easily deploy it to Azure Container Instance. From my experience this is the best and easiest way to integrate intelligence into existing applications and processes!

Check out the walkthrough I created that shows engineers how to train a model on the Databricks platform and deploys that model to AML Service.

Streaming Machine Learning with Azure Databricks

Organizations are beginning to not only benefit from streaming data solutions, but require them to differentiate themselves from their competitors. Real-time reporting, alerts, and predictions are now common requests for businesses of all sizes.

That said, they rarely understand the requirements or implementation details needed to achieve that level of data processing. Streaming data is information that is generated and consumed continuously. This information typically includes many data sources, including log files, point of sale data (in store and online), financial data, and IoT Devices, to name just a few.

Implementation

Fast and Easy

Generating business-changing insights from streaming data can be a difficult process; however, there are quick wins for organizations of all sizes. Microsoft Azure offers an abundance of PaaS or SaaS products that allow users to connect to sources and automate workflows.

With Azure Logic Apps, it is extremely easy to set up data pipelines that extract data from your social media pages, analyze them for sentiment analysis, and alert users when comments or posts need to be addressed. While this may not be a business-changing solution, it gives companies the ability to have a more intimate level of interaction with customers or users than they had before.

Microsoft has provided a simple solution for companies to take advantage of this capability. Using Azure Logic Apps and Microsoft Cognitive Services, one can be alerted of any positive or negative tweets that occurs about their company. This is an easy and cost-effective way to implement intelligence into workflows. (Check out the example available here.) Azure Logic Apps connect to a variety of data sources, enabling organizations to obtain a quick win for real-time reporting with a deceptively simple drag-and-drop interface.

Ideal Implementation

From my experience, companies benefit most from custom machine learning solutions that solve a specific business problem using their own data. Creating solutions tailored to solve a problem in a specific environment allows a business to truly take a proactive approach as they incorporate intelligence throughout their organization. However, lack of knowledge is often a barrier for companies when implementing custom and scalable solutions.

Azure Databricks is an optimized Apache Spark platform perfect for data engineering and artificial intelligence solutions. It is an ideal platform for implementing batch or streaming processes on business critical data, and enables developers to create and deploy predictive analytics (machine learning and deep learning) solutions in an easy to use notebook environment.

Initially, organizations may implement their solutions as batch processes on Azure Databricks to save on cloud consumption costs, but plan for the future by using a platform that will scale and grow with the needs of the business. Batch processes allow users to save on monthly costs by turning off your virtual machines when they are not used, then when real-time insights is required the developer can almost flip a switch for streaming data. Deploy cost effective infrastructure now with the ability to scale limitlessly as you need in the future.

Below is a common infrastructure diagram I implement with my customers. If streaming is not required then we simply bypass the event hub and write python or scala scripts to connect directly to the data sources.

  1. A number of data sources (devices, applications, databases etc.) that publish information to an Azure Event Hub (or Apache Kafka).
    1. Please note that whatever the data source is, there will always need to be some sort of process or application that collects data and sends it to the Event Hub.
  2. Azure Databricks will write the stream of data as quickly as possible to an unaltered, “raw”, data storage in an Azure Data Lake Store or Azure Blob Storage.
  3. In addition to writing to raw storage, Databricks will be used to cleanse data as needed and stream appropriately to an application database, Power BI, or use Databricks Delta for real-time insights, consumption, and intelligent automated actions. Please note that applications can read directly off an Event Hub as a consumer as well.
  4. Then use Azure Databricks to train a machine learning or deep learning model that can be used to make streaming or batch predictions.

Tips to Actually Implement a Solution

When implementing new intelligent solutions with cloud infrastructure, it is likely that it will require internal business stakeholder buy in. Therefore, in order to successfully implement a new predictive analytics solution you must:

  1.  Identify a business problem to solve and the stakeholders
  2.  Visualize or surface results to “wow” stakeholders
  3.  Start developing iteratively

If a team attempts to solve too many problems initially by trying to answer all possible questions, they will likely fail to “wow” a business user. Developers will likely focus all their time on coding and analyzing the best path forward that they will only have code to show (code is a rather boring deliverable for most business users), and may simply never get past the proof of concept or analysis phase.

Business Problem

It is common for companies to simply start creating a solution to work with newer technology without a true business problem they are trying to solve. It happens most often for organizations who want to start a data lake strategy. Their main goal is to develop a data lake so that other business units can take advantage of the sandbox environment for predictive analytics.

I believe a centralized data lake for organizations is a great idea for any IT group. However, without a specific business problem, it is difficult to see the true value that a data lake or machine learning solution provides, which in turn can slow adoption. By focusing on solving a single use case other, there will be a reference to other business units on why they should use the enterprise data lake. The reason for adoption is much more tangible.

Wow Stakeholders

There is not a more boring outcome to a business stakeholder than a project resulting in code. Machine learning or deep learning projects must have some type of end product that accurately describes the effectiveness of the solution created. In most machine learning solutions that I implement, I will almost always provide a Power BI Report. This ensures that the model and predictions are tangible because they are shown through visualizations. The business user now has the ability to actually use the predictions and show other internal users the solution.

Iterative Development

The most frustrating part of projects can be the initial planning or analysis phase. Large enterprises will often start a project and get stuck in analysis paralysis. I encourage teams I work with to simply start coding! This does not mean to do zero planning or proof of concepts, but at some point a team has to pick a direction and run with it. Avoid over analyzing various products by picking a small subset of well-known products, analyze them, and go!

Benefits

Streaming data architecture is beneficial in most scenarios where dynamic data is generated on a continual basis. Any industry can benefit from data that’s available almost instantly from the time it was created. Most organizations will begin with simple solutions to collect log data, detect outliers based on set (unintelligent) rules, or provide real-time reporting.

However, these solutions evolve, becoming more sophisticated data processing pipelines that can learn and detect outlier data points as they occur. The true advantage of streaming data is in performing advanced tasks, like machine learning, to take preventive or proactive action.

Processing a data stream effectively generates quick insights, but it does not replace batch processes. Typically, organizations implement both solutions to gain quick, more computationally intensive insights. Streaming data reacts to or anticipates events, while batch processing derives additional insights after the fact.

Batch processing can often require more compute. It’s ideal when time or speed is not a priority. One of the biggest advantages of Azure Databricks is that companies are able to use the same infrastructure for both their workflows!

Batch processing data requires a system to allow data to build up so that it can be processed all at once. This often requires larger compute resources than streaming due to the size of data, which can be a hurdle for most organizations; however, it allows users to aggregate and analyze large amounts of data over a longer period of time. Streaming solutions do less computing, but require machines to be running 100% of the time and typically look at data over a shorter period of time.

Example

I recently created a simple walkthrough of how to implement a streaming data solution on Azure. Check out the walkthrough on GitHub. Please note that an Azure subscription is required.

Conclusion

Organizations of any size can benefit from a streaming solution using Databricks and Azure Data Lake Store. It enables near real-time reporting, as well as, provides a sandbox environment for iterative development of intelligent solutions. Azure Databricks and Data Lake Store allow a developer to implement both batch and streaming solutions in a familiar and easy to use environment.

3 Keys for Your Organization to Get the Most from AI

Organizations are constantly weighing the cost and benefit of investing in Artificial Intelligence (AI) solutions. Introducing advanced predictive analytics to a company can push them to the bleeding edge of innovation and past their competitors, however, the hurdle is often difficult to get over. But why?

First, understand how we define AI

The term AI can mean several different things; however, the most commonly used definition refers to the idea of intelligent machines, which is in slight contrast to the aspirational machine with human-level intelligence. There are endless ways to implement Artificial Intelligence, but the primary ways are via machine learning and deep learning.

Machine learning uses labeled historical data to train a model to understand patterns and make accurate predictions on new and unlabeled data points.

Deep learning is a subcategory of machine learning revolving around neural networks. While neural networks have been around for decades, they have truly exploded in research and use in the last five to ten years. Deep learning is used to solve problems that require a human such as image recognition, text/speech analytics, and decision making (i.e. game playing).

For the sake of this article, we will use AI as a synonym for machine learning and deep learning, even though AI in general may refer to software and hardware having human-level intelligence which is not achieved using current methodologies.

Let’s focus on 3 key areas to get the most from AI

At 10th Magnitude, our data intelligence community focuses on bringing analytics to solve our customers’ problems through data science, reporting, and big data pipeline projects.

Outside of developing solutions we encourage organizations to focus on the following cultural- and process-oriented areas to truly get the most out of their AI solutions.

Know Your Business Use Case(s) and Collaborate 

The majority of AI applications are powered by machine learning, which is used to solve a very specific problem using data.The first thing that I do with a customer who is new to machine learning is to understand and identify all of their business problems. These problems often turn into new use cases for machine learning or deep learning.

Since the use cases are derived from the business itself, the key to creating a successful solution is collaboration between the data science team and business stakeholders. Additionally, stakeholders are likely the individuals who will need to approve the completion or evaluate the success of the developed application

Therefore, understanding what is needed to solve the problem and then relating the problem back to the data is crucial. Keeping the stakeholder aware of the development cycle allows them to understand the challenges data scientists encounter when creating new predictive analytics workflows.

Additionally, this enables the organization to develop a data-driven culture. Involving business users in the development room gives non-technical folks insight into what is possible, allowing them to spot other areas where AI can be of use.

10th Magnitude believes in the idea of data-driven design, where we use data to solve problems, power applications, and change the way an organization thinks about their business.

Don’t Stop After Development

Developing a machine learning solution is difficult. It is an iterative cycle where individuals go back and forth from the business to understand the problem, gather data, and train models. Developing these solutions takes time, however, once development is done you are only partially completed with the project.More often than not we see customers give up on a solution after the development portion because the model did not perform as well as hoped or the cost to put it all in production is simply too high. It takes a lot of work to move a solution from a development environment to a production one.For example, we recently worked with an organization to build and develop a model to detect anomalies for their different pieces of equipment. It took 2-3 weeks to develop the solution and an additional two weeks to set it up in production; we had to move the code to a production workload, build and release pipelines for two environments, model consumption, model monitoring, and more.

As data scientists, we often forget the difficulty and the amount of time it takes to move a solution to a production so that the organization is able to see the true benefit.

It is important to keep in mind that empowering an application or workflow with machine learning is about more than just the application. It also gives people the ability to see what is possible with the data.

Automation is Your Friend

Usually, data scientists are not familiar with automated build and release pipelines, but it is a skill that is quickly becoming a requisite in order to properly participate in the predictive analytics space. DevOps is the process and culture of delivering value to customers in a sustainable manner. As predictive insights grow organically within an organization, individuals need to be available to develop new solutions. and not maintaining existing ones. Automation is extremely useful in data science projects, specifically for: deploying changes to production with automated tests, retraining of existing model with new data, and monitoring the performance of the model.No data scientists should bring “right-click and deploy” predictive solutions to production; unfortunately, that happens more often than one would hope. Using Visual Studio Team Services (VSTS), we enable our customers to version control their code for team collaboration and set them up with automated build and release pipelines to train, test, and deploy their code.

As more data is collected, the solution will need to be retrained on a cadence to keep the model up to date so that it continues to make good predictions. While this task may seem like a trivial manual task, the time it takes a data scientist to update a model could be used to create new solutions or enhance old ones.

Often clients will only focus on surfacing results to their end users via reports, applications, or workflows; they forget that they need to build an interface to their solution for themselves.

Data scientists are responsible for maintaining the quality of a solution over time, therefore, the metadata gathered from testing the solution (success criteria, training time etc.) should be stored and visualized to understand the current and historical performance of a solution.

Conclusion

Developing machine learning and deep learning applications is far from easy. However, clients often struggle with the amount of effort it takes to create custom solutions, or they get so bogged down in technical details that they forget the why their business started on the path to AI in the first place.

So, what are the keys to successfully incorporate AI into your organization? To start, collaboration between the data science team and business stakeholders, understanding the data science process, and deploying solutions using DevOps. This process makes predictive analytics possible for data science teams of all sizes even as it changes the mindset of the organization as a whole.

If you’re ready to bring AI into your day-to-day, 10thMagnitude has the solutions to incorporate it seamlessly and painlessly, ensuring that you get the benefits without missing a beat.