Power BI for the Enterprise

Posted on May 24, 2019 by ryanchynoweth44

All data projects come down to consumption. How do you get your historical, predictive, and prescriptive analytic solutions in the hands of your users? I have worked on a wide variety of projects where we have infused intelligence into applications, automated systems, and reporting. Many organizations require an internal analytics strategy that is centered around reporting, therefore, I would like the focus of this blog to address the number reason why enterprise reporting roll outs fail.

Report creators fail to provide a consumption layer that fits the desired use of the report. As an Azure consultant we will focus on Power BI rollouts and how it is important to understand the four types of Power BI users: The Developer, The Power User, The Data Query, and The Quick Answer. Please note that these types of users are not exclusive, as a single individual can fall into any number of these categories.

The Power BI Developer

This individual creates, manages, and provides knowledge transfer on the report. The developer will love drilling into and cross filtering the report to find new information because they know how to push Power BI to its limits and will include as much functionality in the report as possible by default. However, this individual is not necessarily the intended business owner or end user of the report.

A Power BI developer knows the product extremely well, and their main responsibility is to create and manage reports for business users to support the organization. These employees are not necessarily easy to find, as the analytical skillset is uncommon in the marketplace, making their time valuable. Therefore, the report developer must understand the type of end user they are delivering the report to. There is nothing more frustrating than a developer creating a report that is too complex for the user to use, and they simply discard the report after a few uses. End users discarding reports due to complexity is the biggest blocker when it comes to implementing an organizational analytics solution, and it can be avoided by the Power BI Developers.

The Power User

A Power User is someone who uses a report to make strategic decisions for the organization. They understand the product well enough to create a few simple reports if the data model is provided, and is able to understand the cross filter and drill down capabilities so that they can use the tool to answer new questions and discover insights.

From experience, these users are desired in an organization but are rarely found. It is difficult to find an individual how knows Power BI well enough to use all of its capabilities, but is not Power BI Developer. Therefore, most of the people who fall into this category are the ones who are actually creating the report as well.

As a developer, if you have a Power User consuming the report then include as many dynamic visualizations and capabilities as possible. The Power User loves finding insights and will spend great lengths of time understanding the data you provide.

The Data Query

The most common user of Power BI is the Excel user who says they want to learn Power BI but doesn’t put the effort in to understanding it. Therefore, they use Power BI as a query interface to export data into excel for them to do their own analysis. This is extremely common and is a great way to utilize Power BI. Organizations typically shake their head at an individual who uses Power BI as a data acquisition tool but I believe that getting data into the hands of users is the number one goal of an analytics strategy and this is a great way to provide specific data to users.

As a developer, if you have an individual simply querying for data then you should focus on providing simple data visualizations and lots of data tables. The visualizations will give them a quick look at trends but the tables will provide them all the information they need to complete their analysis.

The Quick Answer

Another common use of Power BI is to get the quick and high-level answers about a dataset. This individuals want to spend as little time as possible to get the information they want so that they can make intelligent decisions.

As a developer, you will need to know the exact questions this individual wants answered and create simple visuals that answer those questions. The visuals can be dynamic like bar charts and maps, but typically summary numbers are sufficient. These reports are typically provided in a dashboard using the Power BI Service.

Conclusion

Understanding your business users capabilities and needs for data consumption determines how successful your analytics deployment is. All the users described above are present in every organization and are crucial to the day to day business. Creating consumable data interfaces rests on the developer, so understand what people need and good luck!

Automated Machine Learning with MLBox

Posted on May 15, 2019 by ryanchynoweth44

In continuation of my AutoML Blog Series we will be evaluating the capabilities of MLBox

What is MLBox

MLBox is an extremely popular and powerful automated machine learning python library. As noted by the MLBox Documentation, it provides features for:

Fast reading of data
Distributed data processing
Robust feature selection
Accurate hyper-parameter tuning
Start-of-the are machine learning and deep learning models
Model interpretation

MLBox is similar to other Automated Machine Learning libraries as it does not automate the data science process, but augments a developers ability to quickly create machine learning models. MLBox simply helps developers create the optimal model and select the best features to make predictions for the label of your choice.

One draw back of the MLBox library is that it doesn’t necessarily conform to a data scientist’s process, rather, the data scientist has to work the way the library expects. One example, is that I will often use three datasets when developing machine learning solutions in an attempt to avoid overfitting: train, validation, and test. Having these three datasets is rather difficult to do with MLBox.

Lets get started using the MLBox library!

Developing with MLBox

Installing MLBox

For this demo we will be using Anaconda Virtual Environments, and I will be using Visual Studio Code as my IDE. For more information on how to use the Anaconda distribution with Visual Studio code check out this blog I wrote. Additionally, I will be developing on a windows machine which is currently in an experimental release.

We will also need a linux machine to do our MLBox development, if you do not have one available you can create one by following these instructions.

First let’s create a new Anaconda Environment.
conda create -n MLBoxEnv python=3.6
conda activate MLBoxEnv
Next we will run the following installs.
pip install setuptools
pip install mlbox

Training a Model

As with the other AutoML libraries we will be using the titanic dataset where we will use specific features to predict whether or not they survived the catastrophe. For more information about the dataset check out the Kaggle Competition.

Please download the data from the GitHub repository here. Save the file to a data folder in your application directory. Please note that the application directory I will be using is the TPOT directory in my repository.
Now that we have our data and MLBox installed, let’s read our datasets into memory and start preparing it for a machine learning algorithm. MLBox has its own Reader class for efficient and distributed reading of data, one key feature to this class is that it expects a list of file paths to your training and test datasets. Interacting with my datasets was slightly foreign at first, but once I learned that the Reader class creates a dictionary object with pandas dataframes and our target (label) column as a pandas series it was easier to work with.

from mlbox.preprocessing import *
from mlbox.optimisation import *
from mlbox.prediction import *
train_path = ["./data/titanic_train.csv", "./data/titanic_test.csv"]
reader = Reader(sep=",", header=0)
data = reader.train_test_split(train_path, 'Survived')

There are a few things worth noting about the train_test_split function. The dataset is only considered to be a test set if there is no label column present, otherwise, it will be merged with a train set. Being able to provide a list of file paths is a nice feature to have because it can allow developers to easily ingest many files at once, which is common with bigger datasets and data lakes. Since the function automatically scans for the target column there is little work for the developer to even identify a test dataset. Additionally, it determines whether or not it is a regression or classification problem based off our label and will automatically encode the column as needed.
One really nice feature of MLBox is the ability to automatically remove drift variables. I am not an expert when it comes to explaining what drift is by all means, however, drift is the idea that the process or observation behavior may change over time. In turn the data will slowly change resulting in the relationship between the features to change as well. MLBox has built in functionality to deal with this drift. We will use a drift transform.

data = Drift_thresholder().fit_transform(data)
As with all Automated Machine Learning libraries, the key feature is not necessarily the algorithms but is the data scientist’s ability to select the appropriate features and optimal hyper-parameters for the algorithm. Using MLBox’s Optimiser class we are able to create a dimensional space to figure out the best set of parameters. Therefore, to optimize we must create a parameter space, and select the scoring metric we wish to optimize.

opt = Optimiser(scoring='accuracy', n_folds=3)
opt.evaluate(None, data) space = { 'ne__numerical_strategy': {"search":"choice", "space":[0]}, 'ce__strategy': {"search":"choice", "space":["label_encoding", "random_projection", "entity_embedding"]}, 'fs__threshold': {"search":"uniform", "space":[0.01, 0.3]}, 'est__max_depth': {"search":"choice", "space":[3,4,5,6,7]} }
best_params = opt.optimise(space, data, 10)
Next we can use the Predictor class to train a machine learning model.

model = Predictor().fit_predict(best_params, data)

The line of code above will create a folder called save so that it can export an sklearn pipeline that you can reuse for model deployment or further validation. Additionally, it provides you exports of feature importance, a csv of test predictions, and a target encoder object so that you can map the encoded values back to their original values.

For more information on MLBox, please check out their Github repository or the official documentation page. MLBox is a great library to assist data scientists in building a machine learning solution. For a full copy of the demo python file please refer to my personal Github.

Linux Development from a Windows Guy

Posted on May 7, 2019 by ryanchynoweth44

As a data scientist and data engineer, I get a lot of comments from peers on the fact that I prefer to develop on a Windows machine compared to Mac or Linux. I have always really liked my Windows machines, and for the longest time I stuck with them even when specific machine learning libraries weren’t supported on Windows. However, about a year ago I finally gave in and started using Linux distribution for about a half of my data science work because it simply became too difficult to avoid unsupported libraries. Additionally, I was deploying a lot of Python code using Docker, which ended up running on a Linux distribution.

While it was time to start using Linux, I wanted to keep using Windows for my day to day work, therefore, I decided to create a Hyper-V VM. To be completely honest there are a ton of resources on the internet that walk you through setting up a Linux Hyper-V VM on Windows (and probably better than this one), but I am writing a demo of a popular Auto ML library, MLBox, and it is not yet supported on Windows, therefore, this will serve as the first step of the demo.

Creating a Linux VM

My favorite way to develop on Linux is to create a Hyper-V VM on my local desktop. To enable Hyper-V on your Windows 10 machine, search for “Turn Windows features on or off” by opening your start menu.

Now scroll down to find “Hyper-V” and check the box next to it to enable.

Now that Hyper-V is enable, we can create a virtual machine on our computer. First we will need to download a Linux distribution, I prefer Ubuntu. Note that this is download (~2GB) so downloading can take some time, and varies depending on your network speed.

Once you have the `.iso` file we can create a virtual machine. In your start menu search for “Hyper-V Manager”
In the Hyper-V manager navigate “New” > “Virtual Machine…”. This will launch the start up wizard.

For the most part in Wizard defaults will be acceptable. The first menu will have you provide the name of your virtual machine.

The second menu will have you select the generation of the VM. We will want to use “Generation 1”.

Third we will need to allocate memory for your machine. The default of 1024 MB of Memory is fine, and we will also check the box to use “Dynamic Memory”.

Next we will need configure the network access for the virtual machine. Simply select “Default Switch”.

Next we have the option to specify where we want to store our virtual machine hard disk. It is easiest to simply use the default locations. Note that the name of the Hard Disk will be determined by what you name your VM in the previous step.

Now we simply need to select our Ubuntu `.iso` file we downloaded, and click Finish.

This will launch the Ubuntu setup menu, and simply follow the instructions to create the virtual machine with your username and password. Now you have a Ubuntu machine to develop your data science solutions on! I would recommend downloading and installing the Anaconda distribution of Python.