Quick Review: Databricks Delta

As the number of data sources grow and the size of that data increases, organizations have moved to building out data lakes in the cloud in order to provide scalable data engineering workflows and predictive analytics to support business solutions. I have worked with several companies to build out these structured data lakes and the solutions that sit on top of them. While data lakes provide a level of scalability, ease of access, and ability to quickly iterate over solutions, they have always fallen a little short on the structure and reliability that traditional data warehouses have provided.

Historically I have recommended that customers apply structure, not rules, to their data lake so that it makes the aggregation and transformation of data easier for engineers to serve to customers. The recommended structure was usually similar lambda architecture, as not all organizations have streaming data, but they would build out their data lake knowing this was a possibility in the future. The flow of data generally followed the process described below:

  • Batch and streaming data sources are aggregated into raw data tables with little to no transforms applied i.e. streaming log data from a web application or batch loading application database deltas.
  • Batch and streaming jobs in our raw data tables are cleaned, transformed, and saved to staging tables by executing the minimum number of transforms on a single data source i.e. we tabularize a json file and save it as a parquet file without joining any other data or we aggregate granular data.
  • Finally we aggregate data, join sources, and apply business logic to create our summary tables i.e. the tables data analysts, data scientists, and engineers ingest for their solutions.

One key to the summary tables is that they are business driven. Meaning that we create these data tables to solve specific problems and to be queried on a regular basis. Additionally, I recently took a Databricks course and instead of the terms raw, staging, and summary; they used bronze, silver, and gold tables respectfully. I now prefer the Databricks terminology over my own.

Delta Lake is an open source project designed to make big data solutions easier and has been mostly developed by Databricks. Data lakes have always worked well, however, since Delta Lake came onto the scene, organizations are able to take advantage of additional features when updating or creating their data lakes.

  • ACID Transactions: Serial transactions to ensure data integrity.
  • Data Versioning: Delta Lake provides data snapshots allowing developers to access and revert earlier versions of data for audits, rollbacks, and reproducing predictive experiments.
  • Open Format: Data stored as in Parquet format making it easy to convert existing data lakes into Delta Lakes.
  • Unified Batch and Streaming: Combine streaming and batch data sources into a single location, and use Delta tables can act as a streaming source as well.
  • Schema Enforcement: Provide and enforce a schema as need to ensure correct data types and columns.
  • Schema Evolution: Easily change the schema of your data as it evolves over time.

Generally, Delta Lake offers a very similar development and consumption pattern as a typical data lake, however, the items listed above are added features that bring an enterprise level of capabilities that make the lives of data engineers, analysts, and scientists easier.

As an Azure consultant, Databricks Delta is the big data solution I recommend to my clients. To get started developing a data lake solution with Azure Databricks and Databricks Delta check out the demo provided on my GitHub. We take advantage of traditional cloud storage by using an Azure Data Lake Gen2 to serve as the storage layer on our Delta Lake.

Data Pipelines Using Apache Airflow

I previously wrote a blog and demo discussing how and why data engineers should deploy pipelines using containers. One slight disadvantage to deploying data pipeline containers is the managing, monitoring, and scheduling of these activities can be a little bit of a pain. One of the most popular tools out there for solving this is Apache Airflow. Apache Airflow is a platform to programmatically develop, schedule, and monitor workflows. Workflows are defined as code, making them easy to maintain, test, deploy, and collaborate across a team.

At the core of Apache Airflow are workflows that are represented as Directed Acyclic Graphs (DAGs) that are written mainly in Python or Bash commands. DAGs are made up of tasks that can be scheduled on a specific cadence, and can be monitored using the built in Airflow Webserver with an interface that looks like the following:

Generally, I recommend two methods of using Airflow for monitoring and scheduling purposes with containers in Azure.

  1. DAG
  2. RESTful

Developing your data pipelines as DAGs makes it easy to deploy and set a schedule for your jobs. Engineers will need to write a data pipeline Python script to extract, transform, or move data. A second script that imports our data pipeline into a DAG to be ran on a specific cadence. An example of this would be the hello world example I have provided. While the development and integration of data pipelines in Azure is easier when created as DAGs, it requires the developer to deploy all their pipelines to the same Azure Container Instance or Kubernetes Cluster.

Deploying data pipelines as RESTful web services allows developers to decouple scheduling from the data pipeline by deploying a web service separate from your Apache Airflow deployment. Separate deployments would simply require a developer to write a DAG to call your web service on the schedule you wish. This is a great way to off load the compute and memory required to from your airflow server as well. The one draw back is that this adds a little more work to handle web service secrets but once it is handled it is easy to repeat and use across all your data pipelines. An example of this can be found with my Restful deployment example. While the Azure Machine Learning Service is geared toward deploying machine learning models as a web service, it can be used to deploy data pipelines as well allowing the developer to offload and authentication and security required when developing a web service.

Overall, I have seen organizations develop home grown scheduling and monitoring techniques in order to capture all the metadata required to ensure your data pipelines are running properly. Apache Airflow makes this process easy by offering a great built-in user interface to visualize your data pipelines, and provides a database for developers to build additional reporting as needed.

Check out the demo I created walking engineers through the development and deployment of data pipelines in Azure using Apache Airflow!

Data Analytics, Data Engineering, and Containers

Implementing scalable and manageable data solutions in the cloud can be difficult. Organizations need to develop a strategy that not only succeeds technically but fits with their team’s persona. There are a number of Platform as a Service (PaaS) products and Software as a Service (SaaS) products that make it easy to connect to, transform, and move data in your network. However, the surplus of tools can make it difficult to figure out which ones to use, and often they tools can only do a fraction of what an engineer can do with scripting language. Many of the engineers I work with love functionally languages when working with data. My preferred data language is Python, however, there can be a barrier when moving from a local desktop to the cloud. When developing data pipelines using a language like Python I recommend using Docker containers.

Historically, it is not a simple task to deploy code to different environments and have it run reliably. This issue arises most when a data scientist or data engineer is moving code from local development to a test or production environment. Containers consist of their own run-time environment and contain all the required dependencies, therefore, it eliminates variable environments at deployment. Containers make it easy to develop in the same environment as production and eliminate a lot of risk when deploying.

Creating Data Pipeline Containers

My preferred Python distribution is Anaconda because of how easy it is to create an use different virtual environments, allowing me to insure that there are no python or dependency conflicts when working on different solutions. Virtual environments are extremely popular with python developers, therefore, the transition deploying using containers should be familiar. If you are unfamiliar with anaconda virtual environments check out this separate blog post where I talk about best practices and how to use these environments when working with Visual Studio Code.

Data pipelines always start with data extractions. Best practices the engineer should land their raw data into a data store as quickly as possible. The raw data gives organizations a source of data that is untouched, allowing a developer to reprocess data as needed to solve different business problems. Once in the raw data store the developer will transform and manipulate data as needed. In Azure, my favorite data store to handle raw, transformed, and business data is the Azure Data Lake Store. Below is a general flow diagram of data pipelines where the transformations can be as complicated as machine learning models, or as simple as normalizing the data. In this scenario each intermediate pipe could be a container, or the entire data pipeline could be a single container. At each pipeline the data may be read a data source or chained from a previous transform. This flexibility is left up to the developer. Containers make versioning and deploying data applications easy because they allow an engineer to develop how they prefer, and quickly deploy with a few configuration steps and commands.

Most engineers prefer to develop locally on their laptops using notebooks (like Jupyter notebooks) or a code editor (like Visual Studio Code). Therefore, when a new data source is determined, engineers should simply start developing locally using an Anaconda environment and iterate over their solution in order to package it up as a container. If the engineer is using Python to extract data, they will need to track all dependencies in a requirements.txt file, and make note of any special installations (like SQL drivers) required to extract data and write it to a raw data lake store. Once the initial development is completed the engineer will then need to get their code ready for deployment! This workflow is ideal for small to medium size data sources because the velocity of true big data can often be an issue for batch data extraction, and a streaming data solution is preferred (i.e. Apache Spark).

Deploying Data Pipeline Containers in Azure

To set the stage, you are a developer and you have written a python data extraction application using a virtual environment on your machine. Since you started with a fresh python interpreter and added requirements you have compiled a list of the installed libraries, drivers, and other dependencies as need to solve their problem. How does a developer get from running the extraction on a local machine to the cloud?

First we will create and run a docker container locally for testing purposes. Then we will deploy the container to Azure Container Instance, the fastest and simplest way to run a container in Azure. Data extractors that are deployed as containers are usually batch jobs that the developers wants to run on a specific cadence. There are two ways to achieve this CRON scheduling: have the application “sleep” after each data extraction, or have a centralized enterprise scheduler (like Apache Airflow) that kicks off the process as needed. I recommend the latter because it allows for a central location to monitor all data pipeline jobs, and avoids having to redeploy or make code changes if the developers wishes to change the schedule.

Before deploying a Docker container there are a few things that the engineer will do before it is ready.

  1. Create a requirements.txt file in the solution’s root directory
  2. Create a Dockerfile file in the solution’s root directory
  3. Make sure the data extractor is in an “application” folder off the root directory
  4. Write automated tests using the popular pytest python packagethis is not required but I would recommend it for automated testing. I do not include this in my walk through that is provided.
  5. Build an image locally
  6. Build and run the container locally for testing
  7. Deploy to Azure Container Instance (or Azure Kubernetes Service)

Here is an example requirements.txt file for the sample application available here:

azure-mgmt-resource==1.2.2
azure-mgmt-datalake-store==0.4.0
azure-datalake-store==0.0.19
configparser==3.5.0
requests==2.20.0
pytest==3.5.1

Here is an example Dockerfile file that starts with a python 3.6 image, copies are application into the working directory, and runs our data extraction. In this case we have a python script, extract_data.py, in the application folder:

FROM python:3.6

RUN mkdir /src
COPY . /src/
WORKDIR /src
RUN pip install -r requirements.txt
CMD [ "python", "./application/extract_data.py" ]

To build an image locally you will need Docker installed. If you do not have it installed please download it here, otherwise, make sure that docker is currently running on your machine. Open up a command prompt, navigate to your projects root directory, and run the following commands:

## Build an image from the current directory 
docker build -t my-image-name .
## Run the container using the newly created image
docker run my-image-name

To deploy the container to Azure Container Instance, you first must create an Azure Container Registry and push your container to the registry. Next you will need to deploy that image to Azure Container Instance using the Azure CLI. Note that the Azure CLI tool can be used to automate these deployments in the future, or an engineer can take advantage of Azure DevOps Build and Release tasks.

Now that you have deployed the container manually to Azure Container Instance, it is important to manage these applications. Often times data extractors will be on a scheduled basis, therefore, will likely require external triggers to extract and monitor data pipelines. Stay tuned for a future blog on how to managed your data containers!

Conclusion

Developing data solutions using containers is an excellent way to manage, orchestrate, and develop a scalable analytics and artificial intelligence application. This walkthrough walks engineers through the process of creating a weather data source extractor, wrap it up as a container, and deploy the container both locally and in the cloud.

Anaconda Environments in Visual Studio Code

Over the last several years I have made a transition from developing in R to developing in Python for my data science projects. At first I would use Jupyter Notebooks to develop my solutions, however, I missed R Studio. While I am still a huge fan of the notebooks, especially Databricks Notebooks, I primarily use Visual Studio Code (VS Code). The Python distribution of choice for me is Anaconda, and I really love being able to have different conda environments for my different solutions so that I can avoid dependency conflicts and start fresh with new predictive solutions. The conda environments also make it extremely easy to track your dependencies since you are starting fresh with a new Python interpreter.

One thing that I initially found difficult with VS Code is the ability to create a conda environment and use it as my python interpreter. There is plenty of documentation available but it took me a little while to figure it out since I had to piece together a few different sources. As a reference I am using a Windows 10 Microsoft Surface Laptop for my local development, and I have Anaconda 4.5.12 installed.

Development Environment

To set up your development environment please make sure that you have both Anaconda and Visual Studio Code installed on your machine (links above).Once installed, please open an “Anaconda Command Prompt”.

The following command creates a new conda environment with python 3.7.

conda create -n myenv python=3.7

To activate the new environment run:

conda activate myenv

You have now created and activated an anaconda environment. This means that you have two Python interpreters available on your machine: “base” and “myenv”. To run Python in the “myenv” environment run the code snippet below. Note, to pip or conda install python libraries I will typically just use the Anaconda Command Prompt with my desired environment activated.

python
# the "python" command should enable you to run python code from the command line. 
# Now run the following in order to execute python code

a = 1 + 1
print(a)

Please make note where your python environment is located on your computer. For example, mine are located at “C:\Users\<username>\Anaconda3\envs\myenv\python.exe”.

So far we have only worked with the Anaconda command prompt, lets open up VS Code. There are two VS Code extensions that you will want installed: “Python” and “Anaconda Extension Pack”. Please note that they are both developed by Microsoft.

To install an extension in VS Code navigate to the extension, search for “Python” and “Anaconda Extension Pack”, and install. See below.

We would like to use this new conda environment as our python interpreter in VS Code. Next use the command “CTRL + Shift + P” to open the command palette. Then type “>Python: Select Interpreter”. You will want to paste the file path of the python interpreter we created previously i.e. step 4.

Create a new python script called python_script.py, and paste the following code:

a = 1 + 1
print(a)

To execute this code highlight it or put your cursor on the line, and press “Shift + Enter”.

You are now using a new Anaconda Environment in VS Code!

From Dev to Prod

One advantage of using Anaconda Environments to create solutions is it allows developers to use various versions of python and install packages without conflicting dependencies. It also allows you to easily track what was installed from the base python installation making it easier to go from development to production.

One of my favorite ways of deploying python code is using Docker. In order to deploy a Docker container you will need two files. A requirements.txt file that contains dependencies you wish to pip install, and a Dockerfile that is executed to actually create a container image. I will touch on the requirements.txt file here, but to learn more about deploying python code using docker check out my blog post showing how to deploy data pipelines using containers.

requirements.txt file is of the format:

configparser==3.5.0
requests==2.20.0
pytest==3.5.1

The best way to managed this file is to simply track manually what you installed since creating your virtual environment. However, if you lost track and simply do not know what libraries you need, open an Anaconda Command Prompt, activate your environment, and run pip freeze. This will list all the libraries and versions that you have installed in your environment. Note that this includes dependency installation, meaning if you installed a library that install dependent libraries it will list those as well. That is not ideal but it will be fine.

Overall, Anaconda environments make python development much easier and allows me to quickly prepare my solution to be deployed to test and production environments.

Streaming Machine Learning with Azure Databricks

Organizations are beginning to not only benefit from streaming data solutions, but require them to differentiate themselves from their competitors. Real-time reporting, alerts, and predictions are now common requests for businesses of all sizes.

That said, they rarely understand the requirements or implementation details needed to achieve that level of data processing. Streaming data is information that is generated and consumed continuously. This information typically includes many data sources, including log files, point of sale data (in store and online), financial data, and IoT Devices, to name just a few.

Implementation

Fast and Easy

Generating business-changing insights from streaming data can be a difficult process; however, there are quick wins for organizations of all sizes. Microsoft Azure offers an abundance of PaaS or SaaS products that allow users to connect to sources and automate workflows.

With Azure Logic Apps, it is extremely easy to set up data pipelines that extract data from your social media pages, analyze them for sentiment analysis, and alert users when comments or posts need to be addressed. While this may not be a business-changing solution, it gives companies the ability to have a more intimate level of interaction with customers or users than they had before.

Microsoft has provided a simple solution for companies to take advantage of this capability. Using Azure Logic Apps and Microsoft Cognitive Services, one can be alerted of any positive or negative tweets that occurs about their company. This is an easy and cost-effective way to implement intelligence into workflows. (Check out the example available here.) Azure Logic Apps connect to a variety of data sources, enabling organizations to obtain a quick win for real-time reporting with a deceptively simple drag-and-drop interface.

Ideal Implementation

From my experience, companies benefit most from custom machine learning solutions that solve a specific business problem using their own data. Creating solutions tailored to solve a problem in a specific environment allows a business to truly take a proactive approach as they incorporate intelligence throughout their organization. However, lack of knowledge is often a barrier for companies when implementing custom and scalable solutions.

Azure Databricks is an optimized Apache Spark platform perfect for data engineering and artificial intelligence solutions. It is an ideal platform for implementing batch or streaming processes on business critical data, and enables developers to create and deploy predictive analytics (machine learning and deep learning) solutions in an easy to use notebook environment.

Initially, organizations may implement their solutions as batch processes on Azure Databricks to save on cloud consumption costs, but plan for the future by using a platform that will scale and grow with the needs of the business. Batch processes allow users to save on monthly costs by turning off your virtual machines when they are not used, then when real-time insights is required the developer can almost flip a switch for streaming data. Deploy cost effective infrastructure now with the ability to scale limitlessly as you need in the future.

Below is a common infrastructure diagram I implement with my customers. If streaming is not required then we simply bypass the event hub and write python or scala scripts to connect directly to the data sources.

  1. A number of data sources (devices, applications, databases etc.) that publish information to an Azure Event Hub (or Apache Kafka).
    1. Please note that whatever the data source is, there will always need to be some sort of process or application that collects data and sends it to the Event Hub.
  2. Azure Databricks will write the stream of data as quickly as possible to an unaltered, “raw”, data storage in an Azure Data Lake Store or Azure Blob Storage.
  3. In addition to writing to raw storage, Databricks will be used to cleanse data as needed and stream appropriately to an application database, Power BI, or use Databricks Delta for real-time insights, consumption, and intelligent automated actions. Please note that applications can read directly off an Event Hub as a consumer as well.
  4. Then use Azure Databricks to train a machine learning or deep learning model that can be used to make streaming or batch predictions.

Tips to Actually Implement a Solution

When implementing new intelligent solutions with cloud infrastructure, it is likely that it will require internal business stakeholder buy in. Therefore, in order to successfully implement a new predictive analytics solution you must:

  1.  Identify a business problem to solve and the stakeholders
  2.  Visualize or surface results to “wow” stakeholders
  3.  Start developing iteratively

If a team attempts to solve too many problems initially by trying to answer all possible questions, they will likely fail to “wow” a business user. Developers will likely focus all their time on coding and analyzing the best path forward that they will only have code to show (code is a rather boring deliverable for most business users), and may simply never get past the proof of concept or analysis phase.

Business Problem

It is common for companies to simply start creating a solution to work with newer technology without a true business problem they are trying to solve. It happens most often for organizations who want to start a data lake strategy. Their main goal is to develop a data lake so that other business units can take advantage of the sandbox environment for predictive analytics.

I believe a centralized data lake for organizations is a great idea for any IT group. However, without a specific business problem, it is difficult to see the true value that a data lake or machine learning solution provides, which in turn can slow adoption. By focusing on solving a single use case other, there will be a reference to other business units on why they should use the enterprise data lake. The reason for adoption is much more tangible.

Wow Stakeholders

There is not a more boring outcome to a business stakeholder than a project resulting in code. Machine learning or deep learning projects must have some type of end product that accurately describes the effectiveness of the solution created. In most machine learning solutions that I implement, I will almost always provide a Power BI Report. This ensures that the model and predictions are tangible because they are shown through visualizations. The business user now has the ability to actually use the predictions and show other internal users the solution.

Iterative Development

The most frustrating part of projects can be the initial planning or analysis phase. Large enterprises will often start a project and get stuck in analysis paralysis. I encourage teams I work with to simply start coding! This does not mean to do zero planning or proof of concepts, but at some point a team has to pick a direction and run with it. Avoid over analyzing various products by picking a small subset of well-known products, analyze them, and go!

Benefits

Streaming data architecture is beneficial in most scenarios where dynamic data is generated on a continual basis. Any industry can benefit from data that’s available almost instantly from the time it was created. Most organizations will begin with simple solutions to collect log data, detect outliers based on set (unintelligent) rules, or provide real-time reporting.

However, these solutions evolve, becoming more sophisticated data processing pipelines that can learn and detect outlier data points as they occur. The true advantage of streaming data is in performing advanced tasks, like machine learning, to take preventive or proactive action.

Processing a data stream effectively generates quick insights, but it does not replace batch processes. Typically, organizations implement both solutions to gain quick, more computationally intensive insights. Streaming data reacts to or anticipates events, while batch processing derives additional insights after the fact.

Batch processing can often require more compute. It’s ideal when time or speed is not a priority. One of the biggest advantages of Azure Databricks is that companies are able to use the same infrastructure for both their workflows!

Batch processing data requires a system to allow data to build up so that it can be processed all at once. This often requires larger compute resources than streaming due to the size of data, which can be a hurdle for most organizations; however, it allows users to aggregate and analyze large amounts of data over a longer period of time. Streaming solutions do less computing, but require machines to be running 100% of the time and typically look at data over a shorter period of time.

Example

I recently created a simple walkthrough of how to implement a streaming data solution on Azure. Check out the walkthrough on GitHub. Please note that an Azure subscription is required.

Conclusion

Organizations of any size can benefit from a streaming solution using Databricks and Azure Data Lake Store. It enables near real-time reporting, as well as, provides a sandbox environment for iterative development of intelligent solutions. Azure Databricks and Data Lake Store allow a developer to implement both batch and streaming solutions in a familiar and easy to use environment.