Automated Machine Learning with AutoKeras

In hopes of adding to my AutoML blog series I thought it would be great to touch on automated deep learning libraries as well. When I first started playing around with neural networks I turned towards the popular deep learning library, Keras for my development. Therefore, I decided that my first auto deep learning library should be AutoKeras!

Deep learning is a subset of machine learning focused on developing predictive models using neural networks, and allows humans to create solutions for object detection, image classification, speech recognition and more. One of the most popular deep learning libraries available is Keras, a high-level API to easily run TensorFlow, CNTK, and Theano. The main goal of Keras is to enable developers to quickly iterate and develop neural networks on multiple frameworks.

Over the last year or so, there has been a lot of development in the AutoML space, which is why I have been writing so many blogs showing off different libraries. AutoML libraries have mostly focused around on traditional machine learning algorithms. Therefore, to take the Keras vision to the next level and increase the speed at which we can create neural networks, the Keras team as been developing the AutoKeras library that aims to automatically learn the best architecture and hyper-parameters of a neural network to solve your specific need.

Since the library is still in pre-release there are not a ton of resources to available when you start building a model with AutoKeras. Most of the examples show of the MNIST data set which is built into the Keras library. So while I do show a quick MNIST example in the demo, I also provide one with a custom image dataset that requires the developer to load images as numpy arrays prior to using them as input in the model training.

The demo I have created walks users through the process of:

  • Curating your own image dataset
    • Note that we will be using the FastAI library, which is my favorite deep learning library and runs on top of PyTorch.
  • You can also use the data.zip file available in the GitHub repository.
  • Train a model with Keras
  • Train a model on the MNIST dataset using AutoKeras
  • Train a model with downloaded images using AutoKeras

Overall, the AutoKeras library is rough. It does not work quite as Keras worked, which kind of threw me off, and a lot of the built in functions that makes Keras great are not available. I would not recommend using AutoKeras for any real neural network development, but the overall idea of using AutoML with Keras intrigues me greatly. I would recommend monitoring the library as development continues as it could dramatically improve in the near future.  Check out the demo I have provided on GitHub. Please note that I developed the demo on a Linux virtual machine, and that environment setup varies by environment. Additionally, GPU support will enable faster training times.

Quick Review: Databricks Delta

As the number of data sources grow and the size of that data increases, organizations have moved to building out data lakes in the cloud in order to provide scalable data engineering workflows and predictive analytics to support business solutions. I have worked with several companies to build out these structured data lakes and the solutions that sit on top of them. While data lakes provide a level of scalability, ease of access, and ability to quickly iterate over solutions, they have always fallen a little short on the structure and reliability that traditional data warehouses have provided.

Historically I have recommended that customers apply structure, not rules, to their data lake so that it makes the aggregation and transformation of data easier for engineers to serve to customers. The recommended structure was usually similar lambda architecture, as not all organizations have streaming data, but they would build out their data lake knowing this was a possibility in the future. The flow of data generally followed the process described below:

  • Batch and streaming data sources are aggregated into raw data tables with little to no transforms applied i.e. streaming log data from a web application or batch loading application database deltas.
  • Batch and streaming jobs in our raw data tables are cleaned, transformed, and saved to staging tables by executing the minimum number of transforms on a single data source i.e. we tabularize a json file and save it as a parquet file without joining any other data or we aggregate granular data.
  • Finally we aggregate data, join sources, and apply business logic to create our summary tables i.e. the tables data analysts, data scientists, and engineers ingest for their solutions.

One key to the summary tables is that they are business driven. Meaning that we create these data tables to solve specific problems and to be queried on a regular basis. Additionally, I recently took a Databricks course and instead of the terms raw, staging, and summary; they used bronze, silver, and gold tables respectfully. I now prefer the Databricks terminology over my own.

Delta Lake is an open source project designed to make big data solutions easier and has been mostly developed by Databricks. Data lakes have always worked well, however, since Delta Lake came onto the scene, organizations are able to take advantage of additional features when updating or creating their data lakes.

  • ACID Transactions: Serial transactions to ensure data integrity.
  • Data Versioning: Delta Lake provides data snapshots allowing developers to access and revert earlier versions of data for audits, rollbacks, and reproducing predictive experiments.
  • Open Format: Data stored as in Parquet format making it easy to convert existing data lakes into Delta Lakes.
  • Unified Batch and Streaming: Combine streaming and batch data sources into a single location, and use Delta tables can act as a streaming source as well.
  • Schema Enforcement: Provide and enforce a schema as need to ensure correct data types and columns.
  • Schema Evolution: Easily change the schema of your data as it evolves over time.

Generally, Delta Lake offers a very similar development and consumption pattern as a typical data lake, however, the items listed above are added features that bring an enterprise level of capabilities that make the lives of data engineers, analysts, and scientists easier.

As an Azure consultant, Databricks Delta is the big data solution I recommend to my clients. To get started developing a data lake solution with Azure Databricks and Databricks Delta check out the demo provided on my GitHub. We take advantage of traditional cloud storage by using an Azure Data Lake Gen2 to serve as the storage layer on our Delta Lake.

Data Pipelines Using Apache Airflow

I previously wrote a blog and demo discussing how and why data engineers should deploy pipelines using containers. One slight disadvantage to deploying data pipeline containers is the managing, monitoring, and scheduling of these activities can be a little bit of a pain. One of the most popular tools out there for solving this is Apache Airflow. Apache Airflow is a platform to programmatically develop, schedule, and monitor workflows. Workflows are defined as code, making them easy to maintain, test, deploy, and collaborate across a team.

At the core of Apache Airflow are workflows that are represented as Directed Acyclic Graphs (DAGs) that are written mainly in Python or Bash commands. DAGs are made up of tasks that can be scheduled on a specific cadence, and can be monitored using the built in Airflow Webserver with an interface that looks like the following:

Generally, I recommend two methods of using Airflow for monitoring and scheduling purposes with containers in Azure.

  1. DAG
  2. RESTful

Developing your data pipelines as DAGs makes it easy to deploy and set a schedule for your jobs. Engineers will need to write a data pipeline Python script to extract, transform, or move data. A second script that imports our data pipeline into a DAG to be ran on a specific cadence. An example of this would be the hello world example I have provided. While the development and integration of data pipelines in Azure is easier when created as DAGs, it requires the developer to deploy all their pipelines to the same Azure Container Instance or Kubernetes Cluster.

Deploying data pipelines as RESTful web services allows developers to decouple scheduling from the data pipeline by deploying a web service separate from your Apache Airflow deployment. Separate deployments would simply require a developer to write a DAG to call your web service on the schedule you wish. This is a great way to off load the compute and memory required to from your airflow server as well. The one draw back is that this adds a little more work to handle web service secrets but once it is handled it is easy to repeat and use across all your data pipelines. An example of this can be found with my Restful deployment example. While the Azure Machine Learning Service is geared toward deploying machine learning models as a web service, it can be used to deploy data pipelines as well allowing the developer to offload and authentication and security required when developing a web service.

Overall, I have seen organizations develop home grown scheduling and monitoring techniques in order to capture all the metadata required to ensure your data pipelines are running properly. Apache Airflow makes this process easy by offering a great built-in user interface to visualize your data pipelines, and provides a database for developers to build additional reporting as needed.

Check out the demo I created walking engineers through the development and deployment of data pipelines in Azure using Apache Airflow!