Quick Review: Databricks Delta

As the number of data sources grow and the size of that data increases, organizations have moved to building out data lakes in the cloud in order to provide scalable data engineering workflows and predictive analytics to support business solutions. I have worked with several companies to build out these structured data lakes and the solutions that sit on top of them. While data lakes provide a level of scalability, ease of access, and ability to quickly iterate over solutions, they have always fallen a little short on the structure and reliability that traditional data warehouses have provided.

Historically I have recommended that customers apply structure, not rules, to their data lake so that it makes the aggregation and transformation of data easier for engineers to serve to customers. The recommended structure was usually similar lambda architecture, as not all organizations have streaming data, but they would build out their data lake knowing this was a possibility in the future. The flow of data generally followed the process described below:

  • Batch and streaming data sources are aggregated into raw data tables with little to no transforms applied i.e. streaming log data from a web application or batch loading application database deltas.
  • Batch and streaming jobs in our raw data tables are cleaned, transformed, and saved to staging tables by executing the minimum number of transforms on a single data source i.e. we tabularize a json file and save it as a parquet file without joining any other data or we aggregate granular data.
  • Finally we aggregate data, join sources, and apply business logic to create our summary tables i.e. the tables data analysts, data scientists, and engineers ingest for their solutions.

One key to the summary tables is that they are business driven. Meaning that we create these data tables to solve specific problems and to be queried on a regular basis. Additionally, I recently took a Databricks course and instead of the terms raw, staging, and summary; they used bronze, silver, and gold tables respectfully. I now prefer the Databricks terminology over my own.

Delta Lake is an open source project designed to make big data solutions easier and has been mostly developed by Databricks. Data lakes have always worked well, however, since Delta Lake came onto the scene, organizations are able to take advantage of additional features when updating or creating their data lakes.

  • ACID Transactions: Serial transactions to ensure data integrity.
  • Data Versioning: Delta Lake provides data snapshots allowing developers to access and revert earlier versions of data for audits, rollbacks, and reproducing predictive experiments.
  • Open Format: Data stored as in Parquet format making it easy to convert existing data lakes into Delta Lakes.
  • Unified Batch and Streaming: Combine streaming and batch data sources into a single location, and use Delta tables can act as a streaming source as well.
  • Schema Enforcement: Provide and enforce a schema as need to ensure correct data types and columns.
  • Schema Evolution: Easily change the schema of your data as it evolves over time.

Generally, Delta Lake offers a very similar development and consumption pattern as a typical data lake, however, the items listed above are added features that bring an enterprise level of capabilities that make the lives of data engineers, analysts, and scientists easier.

As an Azure consultant, Databricks Delta is the big data solution I recommend to my clients. To get started developing a data lake solution with Azure Databricks and Databricks Delta check out the demo provided on my GitHub. We take advantage of traditional cloud storage by using an Azure Data Lake Gen2 to serve as the storage layer on our Delta Lake.

Data Analytics, Data Engineering, and Containers

Implementing scalable and manageable data solutions in the cloud can be difficult. Organizations need to develop a strategy that not only succeeds technically but fits with their team’s persona. There are a number of Platform as a Service (PaaS) products and Software as a Service (SaaS) products that make it easy to connect to, transform, and move data in your network. However, the surplus of tools can make it difficult to figure out which ones to use, and often they tools can only do a fraction of what an engineer can do with scripting language. Many of the engineers I work with love functionally languages when working with data. My preferred data language is Python, however, there can be a barrier when moving from a local desktop to the cloud. When developing data pipelines using a language like Python I recommend using Docker containers.

Historically, it is not a simple task to deploy code to different environments and have it run reliably. This issue arises most when a data scientist or data engineer is moving code from local development to a test or production environment. Containers consist of their own run-time environment and contain all the required dependencies, therefore, it eliminates variable environments at deployment. Containers make it easy to develop in the same environment as production and eliminate a lot of risk when deploying.

Creating Data Pipeline Containers

My preferred Python distribution is Anaconda because of how easy it is to create an use different virtual environments, allowing me to insure that there are no python or dependency conflicts when working on different solutions. Virtual environments are extremely popular with python developers, therefore, the transition deploying using containers should be familiar. If you are unfamiliar with anaconda virtual environments check out this separate blog post where I talk about best practices and how to use these environments when working with Visual Studio Code.

Data pipelines always start with data extractions. Best practices the engineer should land their raw data into a data store as quickly as possible. The raw data gives organizations a source of data that is untouched, allowing a developer to reprocess data as needed to solve different business problems. Once in the raw data store the developer will transform and manipulate data as needed. In Azure, my favorite data store to handle raw, transformed, and business data is the Azure Data Lake Store. Below is a general flow diagram of data pipelines where the transformations can be as complicated as machine learning models, or as simple as normalizing the data. In this scenario each intermediate pipe could be a container, or the entire data pipeline could be a single container. At each pipeline the data may be read a data source or chained from a previous transform. This flexibility is left up to the developer. Containers make versioning and deploying data applications easy because they allow an engineer to develop how they prefer, and quickly deploy with a few configuration steps and commands.

Most engineers prefer to develop locally on their laptops using notebooks (like Jupyter notebooks) or a code editor (like Visual Studio Code). Therefore, when a new data source is determined, engineers should simply start developing locally using an Anaconda environment and iterate over their solution in order to package it up as a container. If the engineer is using Python to extract data, they will need to track all dependencies in a requirements.txt file, and make note of any special installations (like SQL drivers) required to extract data and write it to a raw data lake store. Once the initial development is completed the engineer will then need to get their code ready for deployment! This workflow is ideal for small to medium size data sources because the velocity of true big data can often be an issue for batch data extraction, and a streaming data solution is preferred (i.e. Apache Spark).

Deploying Data Pipeline Containers in Azure

To set the stage, you are a developer and you have written a python data extraction application using a virtual environment on your machine. Since you started with a fresh python interpreter and added requirements you have compiled a list of the installed libraries, drivers, and other dependencies as need to solve their problem. How does a developer get from running the extraction on a local machine to the cloud?

First we will create and run a docker container locally for testing purposes. Then we will deploy the container to Azure Container Instance, the fastest and simplest way to run a container in Azure. Data extractors that are deployed as containers are usually batch jobs that the developers wants to run on a specific cadence. There are two ways to achieve this CRON scheduling: have the application “sleep” after each data extraction, or have a centralized enterprise scheduler (like Apache Airflow) that kicks off the process as needed. I recommend the latter because it allows for a central location to monitor all data pipeline jobs, and avoids having to redeploy or make code changes if the developers wishes to change the schedule.

Before deploying a Docker container there are a few things that the engineer will do before it is ready.

  1. Create a requirements.txt file in the solution’s root directory
  2. Create a Dockerfile file in the solution’s root directory
  3. Make sure the data extractor is in an “application” folder off the root directory
  4. Write automated tests using the popular pytest python packagethis is not required but I would recommend it for automated testing. I do not include this in my walk through that is provided.
  5. Build an image locally
  6. Build and run the container locally for testing
  7. Deploy to Azure Container Instance (or Azure Kubernetes Service)

Here is an example requirements.txt file for the sample application available here:

azure-mgmt-resource==1.2.2
azure-mgmt-datalake-store==0.4.0
azure-datalake-store==0.0.19
configparser==3.5.0
requests==2.20.0
pytest==3.5.1

Here is an example Dockerfile file that starts with a python 3.6 image, copies are application into the working directory, and runs our data extraction. In this case we have a python script, extract_data.py, in the application folder:

FROM python:3.6

RUN mkdir /src
COPY . /src/
WORKDIR /src
RUN pip install -r requirements.txt
CMD [ "python", "./application/extract_data.py" ]

To build an image locally you will need Docker installed. If you do not have it installed please download it here, otherwise, make sure that docker is currently running on your machine. Open up a command prompt, navigate to your projects root directory, and run the following commands:

## Build an image from the current directory 
docker build -t my-image-name .
## Run the container using the newly created image
docker run my-image-name

To deploy the container to Azure Container Instance, you first must create an Azure Container Registry and push your container to the registry. Next you will need to deploy that image to Azure Container Instance using the Azure CLI. Note that the Azure CLI tool can be used to automate these deployments in the future, or an engineer can take advantage of Azure DevOps Build and Release tasks.

Now that you have deployed the container manually to Azure Container Instance, it is important to manage these applications. Often times data extractors will be on a scheduled basis, therefore, will likely require external triggers to extract and monitor data pipelines. Stay tuned for a future blog on how to managed your data containers!

Conclusion

Developing data solutions using containers is an excellent way to manage, orchestrate, and develop a scalable analytics and artificial intelligence application. This walkthrough walks engineers through the process of creating a weather data source extractor, wrap it up as a container, and deploy the container both locally and in the cloud.

Auto Machine Learning with Azure Machine Learning

I recently wrote a blog introducing automated machine learning (AutoML). If you have not read it you can check it out here. With there being a surplus of AutoML libraries in the marketplace my goal is to provide quick overviews and demo of libraries that I use to develop solutions. In this blog I will focus on the benefits of the Azure Machine Learning Service (AML Service) and the AutoML capabilities it provides. The AutoML library of Azure machine learning is different (not unique) from many other libraries because it also provides a platform to track, train, and deploy your machine learning models. 

Azure Machine Learning Service

An Azure Machine Learning Workspace (AML Workspace) is the foundation of developing python-based predictive solutions, and gives the developer the ability to deploy it as a web service in Azure. The AML Workspace allows data scientists to track their experiments, train and retrain their machine learning models, and deploy machine learning solutions as a containerized web service. When an engineer provisions an Azure Machine Learning Workspace the resources below are also created within the same resource group, and are the backbone to Azure Machine Learning.

The Azure Container Registry gives a developer easy integration with creating, storing, and deploying our web services as Docker containers. One added feature is the easy and automatic tagging to describe your container and associate the container with specific machine learning models. 

An Azure Storage account enables for fast dynamic storing of information from our experiments i.e. models, outputs. After training an initial model using the service, I would recommend manually navigating through the folders. Doing this will give you deeper insight into how the AML Workspace functions. But simply and automatically capture metadata and outputs from our training procedures is crucial to visibility and performance over time. 

When we deploy a web service using the AML Service, we allow the Azure Machine Learning resource to handle all authentication and key generation code. This allows data scientists to focus on developing models instead of writing authentication code. Using Azure Key Vault, the AML Service allows for extremely secure web services that you can expose to external and internal customers. 

Once your secure web service is deployed. Azure Machine Learning integrates seamlessly with Application Insights for all code logging and web service traffic giving users the ability to monitor the health of the deployed solution.

A key feature to allowing data scientists to scale their solutions is offering remote compute targets. Remote compute gives developers the ability easily get their solution off their laptop and into Azure with a familiar IDE and workflow. The remote targets allow developers to only pay for the run time of the experiment, making it a low cost for entry in the cloud analytics space. Additionally, there was a service in Azure called Batch AI that was a queuing resource to handle several jobs at one time. Batch AI was integrated into Azure Machine Learning allowing data scientists to train many machine learning models in parallel with separate compute resources.   

Azure Machine Learning provides data prep capabilities in the form of a “dprep” file allowing users to package up their data transforms into a single line of code. I am not a huge fan of the dprep but it is a capability that makes it easier to handle the required data transformations to score new data in production. Like most platforms, the AML Service offers specialized “pipeline” capabilities to connect various machine learning phases with each other like data acquisition, data preparation, and model training.  

In addition to remote compute, Azure Machine Learning enables users to deploy anywhere they can run docker. Theoretically, one could train a model locally and deploy a model locally (or another cloud), and only simply use Azure to track their experiments for a cheap monthly rate. However, I would suggest taking advantage of Azure Kubernetes Service for auto scaling of your web service to handle the up ticks in traffic, or to a more consistent compute target in Azure Container Instance.

Using Azure Machine Learning’s AutoML

Now it’s time to get to the actual point of this blog. Azure Machine Learning’s AutoML capabilities. In order to use Azure Machine Learning’s AutoML capabilities you will need to pip install `azureml-sdk`. This is the same Python library used to simply track your experiments in the cloud. 

As with any data science project, it starts with data acquisition and exploration. In this phase of developing we are exploring our dataset and identifying desired feature columns to use to make predictions. Our goal here is to create a machine learning dataset to predict our label column.

Once we have created our machine learning dataset and identified if we going to implement a classification or a regression solution, we can let Azure Machine Learning do the rest of the work to identify the best feature column combination, algorithm, and hyper-parameters. To automatically train a machine learning model using Azure ML the developer will need to: define the settings for the experiment then submit the experiment for model tuning. Once submitted, the library will iterate through different machine learning algorithms and hyperparameter settings, following your defined constraints. It chooses the best-fit model by optimizing an accuracy metric. The parameters or setting available to auto train machine learning models are:

  • iteration_timeout_minutes: time limit for each iteration. Total runtime = iterations * iteration_timeout_minutes
  • iterations: Number of iterations. Each iteration produces a machine learning model.
  • primary_metric: metric to optimize. We will choose the best model based on this value.
  • preprocess: When True the experiment may auto preprocess the input data with basic data manipulations.
  • verbosity: Logging level.
  • n_cross_validations: Number of cross validation splits when the validation data is not specified.

The output of this process is a dataset containing the metadata on training runs and their results. This dataset enables developers to easily choose the best model based off the metrics provided. The ability to choose the best model out of many training iterations with different algorithms and feature columns automatically enables us to easily automate the model selection process for *each* model deployment. With typical machine learning deployments, engineers typically deploy the same algorithm with the same feature columns each time, and the only difference was the dataset the model was trained on. But with Auto Machine Learning solutions we are able to note only choose the best algorithm, feature combination, and hyper-parameters each time. That means, we can deploy a decision tree model trained on 4 columns one release, the deploy a logistic regression model trained on 5 columns another release without any code edits.

My One Compliant

My one compliant is installing the library is difficult. The documentation states that it works with Python 3.5.2 and up, however, I was unable to get the proper libraries installed and working correctly using a Python 3.6 interpreter. I simply created a Python 3.5.6 interpreter and it worked great! Not sure if this was an error on my part or Microsoft’s but the AutoML capabilities worked as expected otherwise.  

Overall, I think Azure Machine Learning’ Auto ML works great. It is not ground breaking or a game changer, but it does exactly as advertised which is huge in the current landscape of data where it seems as if many tools do not work as expected. Azure ML will run iterations over your dataset to figure out the best model possible, but in the end predictive solutions depend on the correlation between your data points. For a more detailed example of Azure Machine Learning’s AutoML feature check out my walk through available here.

Azure Machine Learning Services and Azure Databricks

As a consultant working almost exclusively in Microsoft Azure, developing and deploying artificial intelligent (AI) solutions to suit our client’s needs is at the core of our business. Predictive solutions need to be easy to implement and must scale as it becomes business critical. Most organizations have existing applications and processes that they wish to infuse with AI. When deploying intelligence to integrate with existing applications it needs to be a microservice type feature that is easy to consume by the application. After trial and error I have grown to love implementing new features using both the Azure Machine Learning Service (AML Service) and Azure Databricks.

Azure Machine Learning Service is a platform that allows data scientists and data engineers to train, deploy, automate, and manage machine learning models at scale and in the cloud. Developers can build intelligent algorithms into applications and workflows using Python-based libraries. The AML Service is a framework that allows developers to train wherever they choose, then wrap their model as a web service in a docker container and deploy to any container orchestrator they wish!

Azure Databricks is a an optimized Apache Spark Platform for heavy analytics workloads. It was designed with the founders of Apache Spark, allowing for a natural integration with Azure services. Databricks makes the setup of Spark as easy as a few clicks allowing organizations to streamline development and provides an interactive workspace for collaboration between data scientists, data engineers, and business analysts. Developers can enable their business with familiar tools and a distributed processing platform to unlock their data’s secrets.

While Azure Databricks is a great platform to deploy AI Solutions (batch and streaming), I will often use it as the compute for training machine learning models before deploying with the AML Service (web service).

Ways to Implement AI

The most common ways to deploy a machine learning solution are as a:

  • Consumable web service
  • Scheduled batch process
  • Continuously streaming predictions

Many organizations will start with smaller batch processes to support reporting needs, then as the need for application integration and near real-time predictions grow the solution turns into streaming or a web service.

Web Service Implementation

A web service is simply code that can be invoked remotely to execute a specific task. In machine learning solutions, web services are a great way to deploy a predictive model that needs to be consumed by one or more applications. Web services allow for simply integration into new and existing applications.

A major advantage to deploying web services over both batch and streaming solutions is the ability to add near real-time intelligence without changing infrastructure or architecture. Web services allow developers to simply add a feature to their code without having to do a massive overhaul of the current processes because they simply need to add a new API call to bring those predictions to consumption.

One disadvantage is that predictions can only be made by calling the web service. Therefore, if a developer wishes to have predictions made on a scheduled basis or continuously, there needs to be an outside process to call that web service. However, if an individual is simply trying to make scheduled batch calls, I would recommend using Azure Databricks.

Batch Processing

Batch processing is a technique to transform a dataset at one time, as opposed to individual data points. Typically this is a large amount of data that has been aggregated over a period of time. The main goal of batch processing is to efficiently work on a bigger window of data that consists of files or records. These processes are usually ran in “off” hours so that it does not impact business critical systems.

Batch processing is extremely effective at unlocking *deep insights* in your data. It allows users to process a large window of data to analyze trends over time and really allow engineers to manipulate and transform data to solve business problems.

As common as batch processing is, there are a few disadvantages to implementing a batch process. Maintaining and debugging a batch process can sometimes be difficult. For anyone who has tried to debug a complex stored procedure in a Microsoft SQL Server will understand this difficulty. Another issue that can arise in today’s cloud first world is the cost of implementing a solution. Batch solutions are great at saving money because the infrastructure required can spin up and shut down automatically since it only needs to be on when the process is running. However, the implementation and knowledge transfer of the solution can often be the first hurdle faced.

By thoughtfully designing and documenting these batch processes, organizations should be able to avoid any issues with these types of solutions.

Stream Processing

Stream processing is the ability to analyze data as it flows from the data source (application, devices, etc.) to a storage location (relational databases, data lakes, etc.). Due to the continuous nature of these systems, large amounts of data is not required to be stored at one time and are focused on finding insights in small windows of time. Stream processing is ideal when you wish to track or detect events that are close in time and occur frequently.

The hardest part of implementing a streaming data solution is the ability to keep up with the input data rate. Meaning that the solution must be able to process data as fast or faster than the rate at which the data sources generate data. If the solution is unable to achieve this then it will lead to a never ending backlog of data and may run into storage or memory issues. Having a plan to access data after the stream is operated on and reduce the number of copies to optimize storage can be difficult.

While there are difficulties with a streaming data architecture, it enables engineers to unlock insights as they occur. Meaning, organizations can detect or predict if there is a problem faster than any other method of data processing. Streaming solutions truly enable predictive agility within an organization.

Check out the Walkthrough

Implementing a machine learning solution with Azure Databricks and Azure Machine Learning allows data scientists to easily deploy the same model in several different environments. Azure Databricks is capable of making streaming predictions as data enters the system, as well as large batch processes. While these two ways are great for unlocking insights from your data, often the best way to incorporate intelligence into an application is by calling a web service. Azure Machine Learning service allows a data scientist to wrap up their model and easily deploy it to Azure Container Instance. From my experience this is the best and easiest way to integrate intelligence into existing applications and processes!

Check out the walkthrough I created that shows engineers how to train a model on the Databricks platform and deploys that model to AML Service.