November 26, 2019

Enhancing Digital Twins Part 2: Setting up Azure Databricks for Predictive Maintenance

By

Theta

In this post, we will elaborate upon Azure Databricks, and explain how we set it up to prepare for Predictive Maintenance of our digital twin’s corresponding real-life assets.

This is the second post in a 4-part series about displaying Predictive Maintenance insights in Digital Twins.

In part 1 of this series, we introduced the concept of Predictive Maintenance in Digital Twins and discussed the tools we used to make predictions.

Read Part 1: Harnessing data and Predictive Maintenance to enhance Digital Twins.

What is Azure Databricks?

Azure Databricks is an Apache Spark based cloud service, available as part of Microsoft Azure’s Platform. It was created in collaboration with Microsoft and Apache Spark and aims to ease collaboration between business analysts, data scientists/engineers on an integrated Azure platform. Azure Databricks, just like Apache Spark, is capable of machine learning, graph processing and fast treatment of big data because of its distributed computing framework. Azure Databricks is therefore ideal for conducting Predictive Maintenance on our warehouse equipment assets in the cloud.

Data source

We need to have legacy data to make warehouse equipment asset failure predictions. We could have taken the time to record IoT readings from warehouse equipment assets, but for the purposes of this trial, we’ve adapted a Predictive Maintenance dataset we had on hand to represent our legacy data.

The original dataset contained the columns ‘team’ and ‘provider’, which we replaced with 'branch' and 'machine' respectively, keeping with the warehouse asset scenario.

We also changed all mentions of ‘team’ to ‘branch’ in each row, added an id column and assigned a different machine to what used to be 'providers', from 1 through 4:

  • Provider1: Forklift
  • Provider2: Cooling Unit
  • Provider3: Aircon Unit
  • Provider4: L.B.Door Motor
modified dataset predictive maintenance

You can download our modified dataset.

Although our Predictive Maintenance dataset isn’t big enough to require Databricks, we thought it would be worthwhile to investigate Databricks anyways, for future applications.

Azure Databricks set up – and troubleshooting errors

In order to start on Azure Databricks, you first have to create an Azure Databricks resource through the Azure portal. When we first attempted to create it, however, we got the following error:

"This subscription is not registered to use the namespace ‘Microsoft.Databricks’. See https://aka.ms/rps-not-found for how to register subscriptions. (Code: MissingSubscriptionRegistration)"

This error indicated we needed to register Microsoft.Databricks under ‘Resource providers’ in our Azure Portal subscription group and once we did, it successfully created the resource.

If you run into other problems, we found this resource useful in detailing solutions for common issues.

Once created, you should be able to launch the Databricks web portal from the ‘Overview’ tab in the Azure Databricks workspace in Azure portal.

databricks workspace predictive maintenance

You will be prompted to sign in using Azure’s Active Directory. Once you are signed in, you should be greeted by this.

azure databricks web portal

Azure Databricks cluster set up

Before we create a notebook, we will have to first create a cluster from the ‘Clusters’ tab in the Databricks web portal. A cluster is needed to host and process the tasks Databricks is capable of.

azure databricks cluster predictive maintenance

Here is how we chose to configure ours:

configuring the cluster predictive maintenance

Of the two cluster modes to choose from, ‘Standard’ was ideal for us as it’s only meant for a handful of users and it supports Python and R, among other languages. ‘High Concurrency’ was too robust for us as we didn’t need to accommodate multiple concurrent users. For a similar reason, we chose not to auto-scale workers based on workload.

We also chose to halve the amount of time to wait before terminating an idle cluster as leaving a cluster inactive, but on, can be costly.  

The other configurations were general suggestions by Databricks.

So far, we’ve walked you through creating a cluster through the UI view but you can switch to the JSON view at the top right and paste in the code below to match our configurations.

 

{ 

    "num_workers": 1, 

    "cluster_name": "test_pdm", 

    "spark_version": "5.5.x-scala2.11", 

    "spark_conf": {}, 

    "node_type_id": "Standard_DS3_v2", 

    "driver_node_type_id": "Standard_DS3_v2", 

    "ssh_public_keys": [], 

    "custom_tags": {}, 

    "spark_env_vars": { 

        "PYSPARK_PYTHON": "/databricks/python3/bin/python3" 

    }, 

    "autotermination_minutes": 60, 

    "init_scripts": [] 

} 

 

Once you’ve configured the cluster, you can create it. You will notice the cluster is initially created as an ‘Interactive’ cluster.

interactive clusters predictive maintenance

Please note that the cluster will start itself up on creation and it’s not uncommon for this to take some time. The 'State' column will indicate if its ‘Running’ or ‘Pending’. We will keep it running for the next few steps.

State Pending vs State Running
State Pending vs State Running

Clusters are segregated into 2 types, interactive or automated clusters; the difference being automated clusters are fast and can be self-triggered periodically while interactive clusters are more suited for collaborative analyses. As we didn’t require constant reanalysis of our data, we choose to keep it as an interactive cluster. You can change your cluster type through the ‘Jobs’ tab where jobs can be assigned to an ‘Automated’ cluster. A job is simply a scheduled run of a Databricks notebook or JAR (Java Archive file).

Jobs Tab
Jobs Tab
How to Reconfigure Cluster to Automated
How to Reconfigure Cluster to Automated

Azure Databricks notebook set up

You will also have to upload a dataset to analyze within your notebook. This can be done in the ‘Data’ tab in the Databricks web portal. The dataset has to be associated with the cluster you are currently running. You will not be able to add a new dataset without a running cluster. First, navigate to ‘Add Data’.

azure databricks notebook

If your dataset is less than 5TB, either upload a file using ‘Upload File’ or create your own table within Databricks using ‘Other Data Sources’ or ‘DBFS’. If your dataset exceeds 5TB, you will have to upload it to a distributed FileStore like Hadoop. We chose to upload our Predictive Maintenance dataset ‘maintenancedata.csv’, and it saved to ‘/FileStore/tables/ maintenancedata.csv’. Keep a note of where your dataset was saved, as we will use its path to read from it.

create new table predictive maintenance

You also have the option of further editing your dataset in the Databricks’ UI or notebook. We chose not to.

Finally, you can create your notebook through the ‘Workspace’ tab. You can either create a globally accessible notebook or within a ‘Users’ scope, which will require you to set ‘Permissions’ to enable access to others. We created a global notebook, setting the primary language to R as our analyses will be done in R. Setting a primary language won’t constrain you from using other languages within your notebook so don’t be too concerned about which one you choose.

create new notebook predictive maintenance

The cluster you created for the notebook must be running for the notebook to run. We recommend terminating clusters if you don’t need them.  

With your notebook all set up, you are ready to start analysing your Predictive Maintenance dataset in Azure Databricks.

We found setting up an Azure Databricks resource straightforward, although navigating the Azure Databricks web portal can be a little touch and go. However, we suspect this shouldn’t pose too much of a problem once you become more familiar with the UI.

In our next post, we will detail at length the R analyses we conducted on our dataset within Databricks to produce a Predictive Maintenance report we can use in our digital twin.