Unknown's avatar

About ryanchynoweth44

I am a data scientist and data engineer based out of Bellevue, WA. I deploy predictive, prescriptive, and data solutions in the cloud.

Azure Machine Learning vs MLFlow

Machine learning and deep learning models are difficult to get into production due to the highly cyclical development process of data science. As a data scientist I am not only experimenting with different hyper-parameters to tune my model, but I am constantly creating or adding new features in my dataset; therefore, needing to keep track of which features, hyper-parameters, and the type of model used can be a lot to remember. Additionally, I have to create a model that performs up to our evaluation criteria, otherwise, the process of deploying a model is irrelevant. While their seems to be an endless number of tools that can help engineers track and monitor tasks throughout the development process, tools like MLFlow and Azure Machine Learning look to help make this process manageable.

Once a data scientist develops a machine learning model, the code is rarely production ready and package dependencies are usually a mess. Not only do developers need to clean up the model training script, but they typically need to make big changes in how they clean and acquire data. During the training process, data scientists are able to access cold historical data that is easy to acquire and test on. However, once in production we need new code that acquires data as it is created, transforms it as needed, and make predictions. In addition, to a new data acquisition script data scientists need to write a new scoring script that actually makes predictions and returns those predictions to the application. In the end, data science solutions have a lot of moving parts in development and in production.

For the longest time I would recommend Azure Machine Learning to my clients to track, train, and deploy machine learning solutions. However, as Azure Databricks has become the premier big data and analytics resources in Azure, I have had to adjust that message slightly. Since MLFlow is integrated into Azure Databricks it has easily become the default platform to manage data science experiments from development to production in a Spark environment, however, I believe that Azure Machine Learning is a viable, and often better, tool choice for data scientists. In the demonstration available on my GitHub I show users how to train and track machine learning models using MLFlow, Azure Machine Learning, and MLFlow’s integration with Azure Machine Learning. If you are looking to deploy models using Azure Databricks and Azure Machine Learning, check out my previous demo available in the same repository.

MLFlow is centered around enhancing the engineer’s ability to track experiments so that they have visibility of performance during both development and production. Managing models is simplified by associating them with specific experiment runs, and it packages machine learning code so that it is reusable, reproduceable, and shareable across an organization. With MLFlow a data scientist is able to execute and compare hundreds of training runs in parallel on any compute platform they wish and deploy those models in any manner they desire. MLFlow supports all programming languages through its REST interface, however, R, Python, Java, and Scala are all supported languages out of the box. Below is a screenshot of a MLFlow experiment view in Azure Databricks.

If I were to select one of my experiment runs by clicking on the execution date hyperlink then I can see more details about the specific run. We also have the ability to select multiple runs and compare them.

Azure Machine Learning is an enterprise ready tool that integrates seamlessly with your Azure Active Directory and other Azure Services. Similar to MLFlow, it allows developers to train models, deploy them with built in Docker capabilities (please note that you do not have to deploy with Docker), and manage machine learning models in the cloud. Azure machine learning is fully compatible with popular packages like PyTorch, TensorFlow, and scikit-learn, and allows developers to train models locally and scale out to the cloud when needed.

At a high-level the Azure Machine Learning Service provides a workspace that is the central container for all development and artifact storage. Within a workspace a developer can create experiments that all scripts, artifacts, and logging is tracked by the usage of experiment runs. The most important aspect of data science is our model, the object that takes in our source data and makes predictions. Azure Machine Learning provides a model registry that tracks and versions our experiment models making it easier to deploy and audit predictive solutions. One of the most crucial aspects to any machine learning solution is deployment. The Azure Machine Learning service allows developers to package their python code as a web service Docker container. These docker images and containers are cataloged an Azure Container Registry that is associated to the Azure Machine Learning Workspace. This give data scientists the ability to track a single training run from development into production by capturing all the training criteria, registering our model, building a container, and creating a deployment.

Below is a list of my experiments in a demo Azure Machine Learning Workspace.

By clicking into one of the experiments I am able to see all the runs and view the performance of each run through the values logged using the AzureML Python SDK. Please note that I have the ability to select multiple runs and compare them.

The above screenshots are very similar to MLFlow, where I believe Azure Machine Learning extends and offers better capabilities is through the Compute, Models, Images, and Deployment tabs in our Azure ML Workspace.

Either programmatically or using the Azure Portal I am able to create a remote compute target where I can offload my experiment runs from my local laptop, and have everything logged and stored in my workspace.

By registering models in my workspace I make them available to create a Docker image and deploy as a web service. Developers can either use the Azure ML SDK or the Azure portal to do so.

Once an image is created I can easily deploy the Docker container anywhere that I can run containers. This can be in Azure, locally, or on the edge! One extremely nice feature built into Azure Machine Learning is the integration with Application Insights allow developers to capture telemetry data about the web service and the model in production.

Overall, while MLFlow and Azure Machine Learning are very similar, I typically side with Azure Machine Learning as the more enterprise ready product that enables developers to deploy solutions faster. However, the cross validation ability that is built into MLFlow, mllib, and Databricks makes it extremely easy to tune hyper-parameters, while the Azure Machine Learning hyper-parameter tuning is a little more difficult.

One my favorite features of MLFlow and Azure Machine Learning is the ability to use MLFlow in union with Azure Machine Learning, which I highlight in my demo. Generally, I recommend to engineers who are developing exclusively on Azure Databricks to use MLFlow due to the easy integration it provides, however, if there is a subset of solutions being deployed or developed in a non-Spark environment I would recommend a tool like Azure Machine Learning to centralize all data science experiments in one location. Please check out the example of MLFlow and Azure Machine Learning on Azure Databricks available on my github!

Automated Machine Learning with AutoKeras

In hopes of adding to my AutoML blog series I thought it would be great to touch on automated deep learning libraries as well. When I first started playing around with neural networks I turned towards the popular deep learning library, Keras for my development. Therefore, I decided that my first auto deep learning library should be AutoKeras!

Deep learning is a subset of machine learning focused on developing predictive models using neural networks, and allows humans to create solutions for object detection, image classification, speech recognition and more. One of the most popular deep learning libraries available is Keras, a high-level API to easily run TensorFlow, CNTK, and Theano. The main goal of Keras is to enable developers to quickly iterate and develop neural networks on multiple frameworks.

Over the last year or so, there has been a lot of development in the AutoML space, which is why I have been writing so many blogs showing off different libraries. AutoML libraries have mostly focused around on traditional machine learning algorithms. Therefore, to take the Keras vision to the next level and increase the speed at which we can create neural networks, the Keras team as been developing the AutoKeras library that aims to automatically learn the best architecture and hyper-parameters of a neural network to solve your specific need.

Since the library is still in pre-release there are not a ton of resources to available when you start building a model with AutoKeras. Most of the examples show of the MNIST data set which is built into the Keras library. So while I do show a quick MNIST example in the demo, I also provide one with a custom image dataset that requires the developer to load images as numpy arrays prior to using them as input in the model training.

The demo I have created walks users through the process of:

  • Curating your own image dataset
    • Note that we will be using the FastAI library, which is my favorite deep learning library and runs on top of PyTorch.
  • You can also use the data.zip file available in the GitHub repository.
  • Train a model with Keras
  • Train a model on the MNIST dataset using AutoKeras
  • Train a model with downloaded images using AutoKeras

Overall, the AutoKeras library is rough. It does not work quite as Keras worked, which kind of threw me off, and a lot of the built in functions that makes Keras great are not available. I would not recommend using AutoKeras for any real neural network development, but the overall idea of using AutoML with Keras intrigues me greatly. I would recommend monitoring the library as development continues as it could dramatically improve in the near future.  Check out the demo I have provided on GitHub. Please note that I developed the demo on a Linux virtual machine, and that environment setup varies by environment. Additionally, GPU support will enable faster training times.

Quick Review: Databricks Delta

As the number of data sources grow and the size of that data increases, organizations have moved to building out data lakes in the cloud in order to provide scalable data engineering workflows and predictive analytics to support business solutions. I have worked with several companies to build out these structured data lakes and the solutions that sit on top of them. While data lakes provide a level of scalability, ease of access, and ability to quickly iterate over solutions, they have always fallen a little short on the structure and reliability that traditional data warehouses have provided.

Historically I have recommended that customers apply structure, not rules, to their data lake so that it makes the aggregation and transformation of data easier for engineers to serve to customers. The recommended structure was usually similar lambda architecture, as not all organizations have streaming data, but they would build out their data lake knowing this was a possibility in the future. The flow of data generally followed the process described below:

  • Batch and streaming data sources are aggregated into raw data tables with little to no transforms applied i.e. streaming log data from a web application or batch loading application database deltas.
  • Batch and streaming jobs in our raw data tables are cleaned, transformed, and saved to staging tables by executing the minimum number of transforms on a single data source i.e. we tabularize a json file and save it as a parquet file without joining any other data or we aggregate granular data.
  • Finally we aggregate data, join sources, and apply business logic to create our summary tables i.e. the tables data analysts, data scientists, and engineers ingest for their solutions.

One key to the summary tables is that they are business driven. Meaning that we create these data tables to solve specific problems and to be queried on a regular basis. Additionally, I recently took a Databricks course and instead of the terms raw, staging, and summary; they used bronze, silver, and gold tables respectfully. I now prefer the Databricks terminology over my own.

Delta Lake is an open source project designed to make big data solutions easier and has been mostly developed by Databricks. Data lakes have always worked well, however, since Delta Lake came onto the scene, organizations are able to take advantage of additional features when updating or creating their data lakes.

  • ACID Transactions: Serial transactions to ensure data integrity.
  • Data Versioning: Delta Lake provides data snapshots allowing developers to access and revert earlier versions of data for audits, rollbacks, and reproducing predictive experiments.
  • Open Format: Data stored as in Parquet format making it easy to convert existing data lakes into Delta Lakes.
  • Unified Batch and Streaming: Combine streaming and batch data sources into a single location, and use Delta tables can act as a streaming source as well.
  • Schema Enforcement: Provide and enforce a schema as need to ensure correct data types and columns.
  • Schema Evolution: Easily change the schema of your data as it evolves over time.

Generally, Delta Lake offers a very similar development and consumption pattern as a typical data lake, however, the items listed above are added features that bring an enterprise level of capabilities that make the lives of data engineers, analysts, and scientists easier.

As an Azure consultant, Databricks Delta is the big data solution I recommend to my clients. To get started developing a data lake solution with Azure Databricks and Databricks Delta check out the demo provided on my GitHub. We take advantage of traditional cloud storage by using an Azure Data Lake Gen2 to serve as the storage layer on our Delta Lake.

Data Pipelines Using Apache Airflow

I previously wrote a blog and demo discussing how and why data engineers should deploy pipelines using containers. One slight disadvantage to deploying data pipeline containers is the managing, monitoring, and scheduling of these activities can be a little bit of a pain. One of the most popular tools out there for solving this is Apache Airflow. Apache Airflow is a platform to programmatically develop, schedule, and monitor workflows. Workflows are defined as code, making them easy to maintain, test, deploy, and collaborate across a team.

At the core of Apache Airflow are workflows that are represented as Directed Acyclic Graphs (DAGs) that are written mainly in Python or Bash commands. DAGs are made up of tasks that can be scheduled on a specific cadence, and can be monitored using the built in Airflow Webserver with an interface that looks like the following:

Generally, I recommend two methods of using Airflow for monitoring and scheduling purposes with containers in Azure.

  1. DAG
  2. RESTful

Developing your data pipelines as DAGs makes it easy to deploy and set a schedule for your jobs. Engineers will need to write a data pipeline Python script to extract, transform, or move data. A second script that imports our data pipeline into a DAG to be ran on a specific cadence. An example of this would be the hello world example I have provided. While the development and integration of data pipelines in Azure is easier when created as DAGs, it requires the developer to deploy all their pipelines to the same Azure Container Instance or Kubernetes Cluster.

Deploying data pipelines as RESTful web services allows developers to decouple scheduling from the data pipeline by deploying a web service separate from your Apache Airflow deployment. Separate deployments would simply require a developer to write a DAG to call your web service on the schedule you wish. This is a great way to off load the compute and memory required to from your airflow server as well. The one draw back is that this adds a little more work to handle web service secrets but once it is handled it is easy to repeat and use across all your data pipelines. An example of this can be found with my Restful deployment example. While the Azure Machine Learning Service is geared toward deploying machine learning models as a web service, it can be used to deploy data pipelines as well allowing the developer to offload and authentication and security required when developing a web service.

Overall, I have seen organizations develop home grown scheduling and monitoring techniques in order to capture all the metadata required to ensure your data pipelines are running properly. Apache Airflow makes this process easy by offering a great built-in user interface to visualize your data pipelines, and provides a database for developers to build additional reporting as needed.

Check out the demo I created walking engineers through the development and deployment of data pipelines in Azure using Apache Airflow!

Power BI for the Enterprise

All data projects come down to consumption. How do you get your historical, predictive, and prescriptive analytic solutions in the hands of your users? I have worked on a wide variety of projects where we have infused intelligence into applications, automated systems, and reporting. Many organizations require an internal analytics strategy that is centered around reporting, therefore, I would like the focus of this blog to address the number reason why enterprise reporting roll outs fail.

Report creators fail to provide a consumption layer that fits the desired use of the report. As an Azure consultant we will focus on Power BI rollouts and how it is important to understand the four types of Power BI users: The Developer, The Power User, The Data Query, and The Quick Answer. Please note that these types of users are not exclusive, as a single individual can fall into any number of these categories.

The Power BI Developer

This individual creates, manages, and provides knowledge transfer on the report. The developer will love drilling into and cross filtering the report to find new information because they know how to push Power BI to its limits and will include as much functionality in the report as possible by default. However, this individual is not necessarily the intended business owner or end user of the report.

A Power BI developer knows the product extremely well, and their main responsibility is to create and manage reports for business users to support the organization. These employees are not necessarily easy to find, as the analytical skillset is uncommon in the marketplace, making their time valuable. Therefore, the report developer must understand the type of end user they are delivering the report to. There is nothing more frustrating than a developer creating a report that is too complex for the user to use, and they simply discard the report after a few uses. End users discarding reports due to complexity is the biggest blocker when it comes to implementing an organizational analytics solution, and it can be avoided by the Power BI Developers.

The Power User

A Power User is someone who uses a report to make strategic decisions for the organization. They understand the product well enough to create a few simple reports if the data model is provided, and is able to understand the cross filter and drill down capabilities so that they can use the tool to answer new questions and discover insights.

From experience, these users are desired in an organization but are rarely found. It is difficult to find an individual how knows Power BI well enough to use all of its capabilities, but is not Power BI Developer. Therefore, most of the people who fall into this category are the ones who are actually creating the report as well.

As a developer, if you have a Power User consuming the report then include as many dynamic visualizations and capabilities as possible. The Power User loves finding insights and will spend great lengths of time understanding the data you provide.

The Data Query

The most common user of Power BI is the Excel user who says they want to learn Power BI but doesn’t put the effort in to understanding it. Therefore, they use Power BI as a query interface to export data into excel for them to do their own analysis. This is extremely common and is a great way to utilize Power BI. Organizations typically shake their head at an individual who uses Power BI as a data acquisition tool but I believe that getting data into the hands of users is the number one goal of an analytics strategy and this is a great way to provide specific data to users.

As a developer, if you have an individual simply querying for data then you should focus on providing simple data visualizations and lots of data tables. The visualizations will give them a quick look at trends but the tables will provide them all the information they need to complete their analysis.

The Quick Answer

Another common use of Power BI is to get the quick and high-level answers about a dataset. This individuals want to spend as little time as possible to get the information they want so that they can make intelligent decisions.

As a developer, you will need to know the exact questions this individual wants answered and create simple visuals that answer those questions. The visuals can be dynamic like bar charts and maps, but typically summary numbers are sufficient. These reports are typically provided in a dashboard using the Power BI Service.

Conclusion

Understanding your business users capabilities and needs for data consumption determines how successful your analytics deployment is. All the users described above are present in every organization and are crucial to the day to day business. Creating consumable data interfaces rests on the developer, so understand what people need and good luck!

Automated Machine Learning with MLBox

In continuation of my AutoML Blog Series we will be evaluating the capabilities of MLBox

What is MLBox

MLBox is an extremely popular and powerful automated machine learning python library. As noted by the MLBox Documentation, it provides features for:

  • Fast reading of data
  • Distributed data processing
  • Robust feature selection
  • Accurate hyper-parameter tuning
  • Start-of-the are machine learning and deep learning models
  • Model interpretation

MLBox is similar to other Automated Machine Learning libraries as it does not automate the data science process, but augments a developers ability to quickly create machine learning models. MLBox simply helps developers create the optimal model and select the best features to make predictions for the label of your choice.

One draw back of the MLBox library is that it doesn’t necessarily conform to a data scientist’s process, rather, the data scientist has to work the way the library expects. One example, is that I will often use three datasets when developing machine learning solutions in an attempt to avoid overfitting: train, validation, and test. Having these three datasets is rather difficult to do with MLBox.

Lets get started using the MLBox library!

Developing with MLBox

Installing MLBox

For this demo we will be using Anaconda Virtual Environments, and I will be using Visual Studio Code as my IDE. For more information on how to use the Anaconda distribution with Visual Studio code check out this blog I wrote. Additionally, I will be developing on a windows machine which is currently in an experimental release.

We will also need a linux machine to do our MLBox development, if you do not have one available you can create one by following these instructions.

  1. First let’s create a new Anaconda Environment.
    conda create -n MLBoxEnv python=3.6
    conda activate MLBoxEnv
  2. Next we will run the following installs.
    pip install setuptools
    pip install mlbox

Training a Model

As with the other AutoML libraries we will be using the titanic dataset where we will use specific features to predict whether or not they survived the catastrophe. For more information about the dataset check out the Kaggle Competition.

  1. Please download the data from the GitHub repository here. Save the file to a data folder in your application directory. Please note that the application directory I will be using is the TPOT directory in my repository.

  2. Now that we have our data and MLBox installed, let’s read our datasets into memory and start preparing it for a machine learning algorithm. MLBox has its own Reader class for efficient and distributed reading of data, one key feature to this class is that it expects a list of file paths to your training and test datasets. Interacting with my datasets was slightly foreign at first, but once I learned that the Reader class creates a dictionary object with pandas dataframes and our target (label) column as a pandas series it was easier to work with.

    from mlbox.preprocessing import *
    from mlbox.optimisation import *
    from mlbox.prediction import *
    train_path = ["./data/titanic_train.csv", "./data/titanic_test.csv"]
    reader = Reader(sep=",", header=0)
    data = reader.train_test_split(train_path, 'Survived')

    There are a few things worth noting about the train_test_split function. The dataset is only considered to be a test set if there is no label column present, otherwise, it will be merged with a train set. Being able to provide a list of file paths is a nice feature to have because it can allow developers to easily ingest many files at once, which is common with bigger datasets and data lakes. Since the function automatically scans for the target column there is little work for the developer to even identify a test dataset. Additionally, it determines whether or not it is a regression or classification problem based off our label and will automatically encode the column as needed.

  3. One really nice feature of MLBox is the ability to automatically remove drift variables. I am not an expert when it comes to explaining what drift is by all means, however, drift is the idea that the process or observation behavior may change over time. In turn the data will slowly change resulting in the relationship between the features to change as well. MLBox has built in functionality to deal with this drift. We will use a drift transform.

    data = Drift_thresholder().fit_transform(data)

  4. As with all Automated Machine Learning libraries, the key feature is not necessarily the algorithms but is the data scientist’s ability to select the appropriate features and optimal hyper-parameters for the algorithm. Using MLBox’s Optimiser class we are able to create a dimensional space to figure out the best set of parameters. Therefore, to optimize we must create a parameter space, and select the scoring metric we wish to optimize.

    opt = Optimiser(scoring='accuracy', n_folds=3)
    opt.evaluate(None, data) space = { 'ne__numerical_strategy': {"search":"choice", "space":[0]}, 'ce__strategy': {"search":"choice", "space":["label_encoding", "random_projection", "entity_embedding"]}, 'fs__threshold': {"search":"uniform", "space":[0.01, 0.3]}, 'est__max_depth': {"search":"choice", "space":[3,4,5,6,7]} }
    best_params = opt.optimise(space, data, 10)

  5. Next we can use the Predictor class to train a machine learning model.

    model = Predictor().fit_predict(best_params, data)

    The line of code above will create a folder called save so that it can export an sklearn pipeline that you can reuse for model deployment or further validation. Additionally, it provides you exports of feature importance, a csv of test predictions, and a target encoder object so that you can map the encoded values back to their original values.

For more information on MLBox, please check out their Github repository or the official documentation page. MLBox is a great library to assist data scientists in building a machine learning solution. For a full copy of the demo python file please refer to my personal Github.

Linux Development from a Windows Guy

As a data scientist and data engineer, I get a lot of comments from peers on the fact that I prefer to develop on a Windows machine compared to Mac or Linux. I have always really liked my Windows machines, and for the longest time I stuck with them even when specific machine learning libraries weren’t supported on Windows. However, about a year ago I finally gave in and started using Linux distribution for about a half of my data science work because it simply became too difficult to avoid unsupported libraries. Additionally, I was deploying a lot of Python code using Docker, which ended up running on a Linux distribution.

While it was time to start using Linux, I wanted to keep using Windows for my day to day work, therefore, I decided to create a Hyper-V VM. To be completely honest there are a ton of resources on the internet that walk you through setting up a Linux Hyper-V VM on Windows (and probably better than this one), but I am writing a demo of a popular Auto ML library, MLBox, and it is not yet supported on Windows, therefore, this will serve as the first step of the demo.

Creating a Linux VM

  1. My favorite way to develop on Linux is to create a Hyper-V VM on my local desktop. To enable Hyper-V on your Windows 10 machine, search for “Turn Windows features on or off” by opening your start menu.
  2. Now scroll down to find “Hyper-V” and check the box next to it to enable.
  3. Now that Hyper-V is enable, we can create a virtual machine on our computer. First we will need to download a Linux distribution, I prefer Ubuntu. Note that this is download (~2GB) so downloading can take some time, and varies depending on your network speed.

    Once you have the `.iso` file we can create a virtual machine. In your start menu search for “Hyper-V Manager”

  4. In the Hyper-V manager navigate “New” > “Virtual Machine…”. This will launch the start up wizard.
  5. For the most part in Wizard defaults will be acceptable. The first menu will have you provide the name of your virtual machine.
  6. The second menu will have you select the generation of the VM. We will want to use “Generation 1”.
  7. Third we will need to allocate memory for your machine. The default of 1024 MB of Memory is fine, and we will also check the box to use “Dynamic Memory”.
  8. Next we will need configure the network access for the virtual machine. Simply select “Default Switch”.
  9. Next we have the option to specify where we want to store our virtual machine hard disk. It is easiest to simply use the default locations. Note that the name of the Hard Disk will be determined by what you name your VM in the previous step.
  10. Now we simply need to select our Ubuntu `.iso` file we downloaded, and click Finish.
  11. This will launch the Ubuntu setup menu, and simply follow the instructions to create the virtual machine with your username and password. Now you have a Ubuntu machine to develop your data science solutions on! I would recommend downloading and installing the Anaconda distribution of Python.

Automated Machine Learning with TPOT

For part two of my automated machine learning series I am focusing on TPOT, a python library that uses genetic programming to optimize data science pipelines. TPOT’s success and popularity has grown extraordinarily since its initial commit in late 2015. As of March 20, 2019, TPOT has 286 people watching, 5,441 stars, and 969 forks on GitHub.

TPOT stands for Tree-Based Pipeline Optimization Tool, and has the goal to help automate the development of ML pipelines by combining flexible tree representation of pipelines with stochastic search algorithms to develop the best scikit-learn library possible. Once the best predictive pipeline has been found, TPOT will export the pipeline as Python code so that a data scientist can continue developing from there. In addition to faster development and great models, my experience with TPOT is that it great learning tool for newer data scientists who may want to understand how to develop better models manually.

One advantage of Auto Machine Learning is the ability to automatically retrain different types of models with different parameters and columns as your data changes. This enables a data scientist to provide a solution that is dynamic and intelligent, however, automated machine learning is compute intensive and time consuming because it is training several different models. TPOT gives users the ability to export their model to a python script to avoid having to apply automated machine learning to every retraining process allowing for fast retraining of a model with the same high-performing accuracy.

Like most automated machine learning libraries TPOT helps automated everything but data acquisition, data cleaning, and complex feature engineering for machine learning. TPOT, like scikit-learn, does provide some simple and dynamic feature engineering functions.

In the first part of my automated machine learning series I evaluated the Azure Machine Learning Auto ML library. Unlike the end to end platform that Azure Machine Learning provides, TPOT is a standalone package meant for developing the best models. In my experience TPOT is an excellent package that can be used in union with platforms like Azure ML and MLFlow to not only train the best model, but manage the data science lifecycle.

The best way to familiarize yourself with TPOT is to get started! Check out the demo I have created and the accompanying code on my GitHub. Please note that staying in line with our Azure Machine Learning Auto ML example we will be using the Titanic Dataset, which is also an example solution provided by the TPOT developers. The walk through I provide is slightly different than the one they provide, particularly surrounding the one-hot encoding of a variable that I deemed unnecessary.

Data Analytics, Data Engineering, and Containers

Implementing scalable and manageable data solutions in the cloud can be difficult. Organizations need to develop a strategy that not only succeeds technically but fits with their team’s persona. There are a number of Platform as a Service (PaaS) products and Software as a Service (SaaS) products that make it easy to connect to, transform, and move data in your network. However, the surplus of tools can make it difficult to figure out which ones to use, and often they tools can only do a fraction of what an engineer can do with scripting language. Many of the engineers I work with love functionally languages when working with data. My preferred data language is Python, however, there can be a barrier when moving from a local desktop to the cloud. When developing data pipelines using a language like Python I recommend using Docker containers.

Historically, it is not a simple task to deploy code to different environments and have it run reliably. This issue arises most when a data scientist or data engineer is moving code from local development to a test or production environment. Containers consist of their own run-time environment and contain all the required dependencies, therefore, it eliminates variable environments at deployment. Containers make it easy to develop in the same environment as production and eliminate a lot of risk when deploying.

Creating Data Pipeline Containers

My preferred Python distribution is Anaconda because of how easy it is to create an use different virtual environments, allowing me to insure that there are no python or dependency conflicts when working on different solutions. Virtual environments are extremely popular with python developers, therefore, the transition deploying using containers should be familiar. If you are unfamiliar with anaconda virtual environments check out this separate blog post where I talk about best practices and how to use these environments when working with Visual Studio Code.

Data pipelines always start with data extractions. Best practices the engineer should land their raw data into a data store as quickly as possible. The raw data gives organizations a source of data that is untouched, allowing a developer to reprocess data as needed to solve different business problems. Once in the raw data store the developer will transform and manipulate data as needed. In Azure, my favorite data store to handle raw, transformed, and business data is the Azure Data Lake Store. Below is a general flow diagram of data pipelines where the transformations can be as complicated as machine learning models, or as simple as normalizing the data. In this scenario each intermediate pipe could be a container, or the entire data pipeline could be a single container. At each pipeline the data may be read a data source or chained from a previous transform. This flexibility is left up to the developer. Containers make versioning and deploying data applications easy because they allow an engineer to develop how they prefer, and quickly deploy with a few configuration steps and commands.

Most engineers prefer to develop locally on their laptops using notebooks (like Jupyter notebooks) or a code editor (like Visual Studio Code). Therefore, when a new data source is determined, engineers should simply start developing locally using an Anaconda environment and iterate over their solution in order to package it up as a container. If the engineer is using Python to extract data, they will need to track all dependencies in a requirements.txt file, and make note of any special installations (like SQL drivers) required to extract data and write it to a raw data lake store. Once the initial development is completed the engineer will then need to get their code ready for deployment! This workflow is ideal for small to medium size data sources because the velocity of true big data can often be an issue for batch data extraction, and a streaming data solution is preferred (i.e. Apache Spark).

Deploying Data Pipeline Containers in Azure

To set the stage, you are a developer and you have written a python data extraction application using a virtual environment on your machine. Since you started with a fresh python interpreter and added requirements you have compiled a list of the installed libraries, drivers, and other dependencies as need to solve their problem. How does a developer get from running the extraction on a local machine to the cloud?

First we will create and run a docker container locally for testing purposes. Then we will deploy the container to Azure Container Instance, the fastest and simplest way to run a container in Azure. Data extractors that are deployed as containers are usually batch jobs that the developers wants to run on a specific cadence. There are two ways to achieve this CRON scheduling: have the application “sleep” after each data extraction, or have a centralized enterprise scheduler (like Apache Airflow) that kicks off the process as needed. I recommend the latter because it allows for a central location to monitor all data pipeline jobs, and avoids having to redeploy or make code changes if the developers wishes to change the schedule.

Before deploying a Docker container there are a few things that the engineer will do before it is ready.

  1. Create a requirements.txt file in the solution’s root directory
  2. Create a Dockerfile file in the solution’s root directory
  3. Make sure the data extractor is in an “application” folder off the root directory
  4. Write automated tests using the popular pytest python packagethis is not required but I would recommend it for automated testing. I do not include this in my walk through that is provided.
  5. Build an image locally
  6. Build and run the container locally for testing
  7. Deploy to Azure Container Instance (or Azure Kubernetes Service)

Here is an example requirements.txt file for the sample application available here:

azure-mgmt-resource==1.2.2
azure-mgmt-datalake-store==0.4.0
azure-datalake-store==0.0.19
configparser==3.5.0
requests==2.20.0
pytest==3.5.1

Here is an example Dockerfile file that starts with a python 3.6 image, copies are application into the working directory, and runs our data extraction. In this case we have a python script, extract_data.py, in the application folder:

FROM python:3.6

RUN mkdir /src
COPY . /src/
WORKDIR /src
RUN pip install -r requirements.txt
CMD [ "python", "./application/extract_data.py" ]

To build an image locally you will need Docker installed. If you do not have it installed please download it here, otherwise, make sure that docker is currently running on your machine. Open up a command prompt, navigate to your projects root directory, and run the following commands:

## Build an image from the current directory 
docker build -t my-image-name .
## Run the container using the newly created image
docker run my-image-name

To deploy the container to Azure Container Instance, you first must create an Azure Container Registry and push your container to the registry. Next you will need to deploy that image to Azure Container Instance using the Azure CLI. Note that the Azure CLI tool can be used to automate these deployments in the future, or an engineer can take advantage of Azure DevOps Build and Release tasks.

Now that you have deployed the container manually to Azure Container Instance, it is important to manage these applications. Often times data extractors will be on a scheduled basis, therefore, will likely require external triggers to extract and monitor data pipelines. Stay tuned for a future blog on how to managed your data containers!

Conclusion

Developing data solutions using containers is an excellent way to manage, orchestrate, and develop a scalable analytics and artificial intelligence application. This walkthrough walks engineers through the process of creating a weather data source extractor, wrap it up as a container, and deploy the container both locally and in the cloud.

Auto Machine Learning with Azure Machine Learning

I recently wrote a blog introducing automated machine learning (AutoML). If you have not read it you can check it out here. With there being a surplus of AutoML libraries in the marketplace my goal is to provide quick overviews and demo of libraries that I use to develop solutions. In this blog I will focus on the benefits of the Azure Machine Learning Service (AML Service) and the AutoML capabilities it provides. The AutoML library of Azure machine learning is different (not unique) from many other libraries because it also provides a platform to track, train, and deploy your machine learning models. 

Azure Machine Learning Service

An Azure Machine Learning Workspace (AML Workspace) is the foundation of developing python-based predictive solutions, and gives the developer the ability to deploy it as a web service in Azure. The AML Workspace allows data scientists to track their experiments, train and retrain their machine learning models, and deploy machine learning solutions as a containerized web service. When an engineer provisions an Azure Machine Learning Workspace the resources below are also created within the same resource group, and are the backbone to Azure Machine Learning.

The Azure Container Registry gives a developer easy integration with creating, storing, and deploying our web services as Docker containers. One added feature is the easy and automatic tagging to describe your container and associate the container with specific machine learning models. 

An Azure Storage account enables for fast dynamic storing of information from our experiments i.e. models, outputs. After training an initial model using the service, I would recommend manually navigating through the folders. Doing this will give you deeper insight into how the AML Workspace functions. But simply and automatically capture metadata and outputs from our training procedures is crucial to visibility and performance over time. 

When we deploy a web service using the AML Service, we allow the Azure Machine Learning resource to handle all authentication and key generation code. This allows data scientists to focus on developing models instead of writing authentication code. Using Azure Key Vault, the AML Service allows for extremely secure web services that you can expose to external and internal customers. 

Once your secure web service is deployed. Azure Machine Learning integrates seamlessly with Application Insights for all code logging and web service traffic giving users the ability to monitor the health of the deployed solution.

A key feature to allowing data scientists to scale their solutions is offering remote compute targets. Remote compute gives developers the ability easily get their solution off their laptop and into Azure with a familiar IDE and workflow. The remote targets allow developers to only pay for the run time of the experiment, making it a low cost for entry in the cloud analytics space. Additionally, there was a service in Azure called Batch AI that was a queuing resource to handle several jobs at one time. Batch AI was integrated into Azure Machine Learning allowing data scientists to train many machine learning models in parallel with separate compute resources.   

Azure Machine Learning provides data prep capabilities in the form of a “dprep” file allowing users to package up their data transforms into a single line of code. I am not a huge fan of the dprep but it is a capability that makes it easier to handle the required data transformations to score new data in production. Like most platforms, the AML Service offers specialized “pipeline” capabilities to connect various machine learning phases with each other like data acquisition, data preparation, and model training.  

In addition to remote compute, Azure Machine Learning enables users to deploy anywhere they can run docker. Theoretically, one could train a model locally and deploy a model locally (or another cloud), and only simply use Azure to track their experiments for a cheap monthly rate. However, I would suggest taking advantage of Azure Kubernetes Service for auto scaling of your web service to handle the up ticks in traffic, or to a more consistent compute target in Azure Container Instance.

Using Azure Machine Learning’s AutoML

Now it’s time to get to the actual point of this blog. Azure Machine Learning’s AutoML capabilities. In order to use Azure Machine Learning’s AutoML capabilities you will need to pip install `azureml-sdk`. This is the same Python library used to simply track your experiments in the cloud. 

As with any data science project, it starts with data acquisition and exploration. In this phase of developing we are exploring our dataset and identifying desired feature columns to use to make predictions. Our goal here is to create a machine learning dataset to predict our label column.

Once we have created our machine learning dataset and identified if we going to implement a classification or a regression solution, we can let Azure Machine Learning do the rest of the work to identify the best feature column combination, algorithm, and hyper-parameters. To automatically train a machine learning model using Azure ML the developer will need to: define the settings for the experiment then submit the experiment for model tuning. Once submitted, the library will iterate through different machine learning algorithms and hyperparameter settings, following your defined constraints. It chooses the best-fit model by optimizing an accuracy metric. The parameters or setting available to auto train machine learning models are:

  • iteration_timeout_minutes: time limit for each iteration. Total runtime = iterations * iteration_timeout_minutes
  • iterations: Number of iterations. Each iteration produces a machine learning model.
  • primary_metric: metric to optimize. We will choose the best model based on this value.
  • preprocess: When True the experiment may auto preprocess the input data with basic data manipulations.
  • verbosity: Logging level.
  • n_cross_validations: Number of cross validation splits when the validation data is not specified.

The output of this process is a dataset containing the metadata on training runs and their results. This dataset enables developers to easily choose the best model based off the metrics provided. The ability to choose the best model out of many training iterations with different algorithms and feature columns automatically enables us to easily automate the model selection process for *each* model deployment. With typical machine learning deployments, engineers typically deploy the same algorithm with the same feature columns each time, and the only difference was the dataset the model was trained on. But with Auto Machine Learning solutions we are able to note only choose the best algorithm, feature combination, and hyper-parameters each time. That means, we can deploy a decision tree model trained on 4 columns one release, the deploy a logistic regression model trained on 5 columns another release without any code edits.

My One Compliant

My one compliant is installing the library is difficult. The documentation states that it works with Python 3.5.2 and up, however, I was unable to get the proper libraries installed and working correctly using a Python 3.6 interpreter. I simply created a Python 3.5.6 interpreter and it worked great! Not sure if this was an error on my part or Microsoft’s but the AutoML capabilities worked as expected otherwise.  

Overall, I think Azure Machine Learning’ Auto ML works great. It is not ground breaking or a game changer, but it does exactly as advertised which is huge in the current landscape of data where it seems as if many tools do not work as expected. Azure ML will run iterations over your dataset to figure out the best model possible, but in the end predictive solutions depend on the correlation between your data points. For a more detailed example of Azure Machine Learning’s AutoML feature check out my walk through available here.