November 27, 2019

Enhancing Digital Twins Part 3: Predictive Maintenance with Azure Databricks



In this post, we will detail the R analyses we conducted on our Predictive Maintenance dataset from post 2 to help our digital twin projects make more informed maintenance suggestions.

Part 3 of a 4-part series about displaying Predictive Maintenance insights in Digital Twins.

In part 1 of this series, we introduced the concept of Predictive Maintenance in digital twins. In part 2 we showed how we set up Databricks to prepare for Predictive Maintenance analyses.

1. Notebook analyses set up (R)

We’d created our notebook with its primary language set to R, but we will also be making use of Python in part 4 to write our R analyses to Azure blob storage.  

To change the language in Databricks’ cells to either Scala, SQL, Python or R, prefix the cell with ‘%’, followed by the language.


%python //or r, scala, sql

You will not need to prefix the cell with anything if you are using your primary language.  

As we’d stated in part 1, we plan to use Apache Spark to handle all our dataset operations. In order to use Apache Spark in Databricks, we will need to establish a connection with ‘sparklyr’. If working in R studio, we’d need to first install the ‘sparklyr’ package, but we’ve found most packages in CRAN have already been preinstalled for Databricks notebooks so only loading is necessary.


# load Apache Spark's R interface through the ‘sparklyr’ package 

library(sparklyr, quietly = TRUE) 

# then establish a Databricks connection to Apache Spark using 'sparklyr' 

sc <- spark_connect(method = "databricks") 

# The connection is assigned to var 'sc' 

We will reserve Cmd 1 for loading packages and initializations. Insert a new cell to continue.

Add new Cell
Add new Cell

In the new cell, Cmd 2, we will read in our Predictive Maintenance dataset from Databricks’ FileStore, which we uploaded in part 2. Our dataset is located at ‘/FileStore/tables/maintenancedata.csv’.  


# Read our csv file into a Spark dataframe using spark_read_csv
# The first parameter must be a reference to our spark connection variable, 'sc'

sdf <- spark_read_csv(sc, name = "test_maintenance", path = "/FileStore/tables/maintenancedata.csv", memory = FALSE, infer_schema = TRUE) 

To run the notebook cells, click ‘Run All’ from the top toolbar. You will know all the cells succeeded if there are no error tracebacks underneath each present cell. Once you’ve succeeded, you will have finished initializing the notebook for further analyses.

Run All Cells
Run All Cells

We will continue to split each block of analysis into separate cells for better separation of concerns and readability.

Just before we move on to conducting Predictive Maintenance, you might like to load all the packages we ended up using.

Load all relevant packages
Load all relevant packages

Notice that package ‘GGally’ is throwing us an error. It’s not yet available in Databricks so we are unable to load it in. We will have to install it within the cluster that houses our notebook. Head to your cluster, navigate to ‘Libraries’, and ‘Install New’ CRAN Package - GGally.

Install new package
Install new package

Once installed, loading the GGally package should no longer be an issue.

2. Predictive Maintenance analyses (R)

We will now run through a few of the analyses we used to construct our Predictive Maintenance report.

   2.1. We used boxplots to study the distribution of variables based on a 5-number summary.


# Create a subset ('broken_set') of your Spark dataframe where broken == 1 for each row 
# The subset is created using the dplyr package pipe '%>%', which can be read as 'Then';
#(ensure the dplyr package is loaded in Cmd 1)
# The pipe can be thought of like: you take the sdf dataframe, 'Then' you filter the data by broken == 1. 

broken_set <- sdf %>% filter(broken == 1)

# The following 3 boxplots will be positioned according to layout() 






              3,3,3,3), ncol  = 4, byrow = TRUE))  

# For the boxplot() command, a quantitative variable is connected to a qualitative variable using the tilde ‘~’;
#You must then specify the dataset as an additional argument to the function. 

boxplot(lifetime~broken, data=broken_set, main="Broken machines", xlab="", ylab="Lifetime", col="#357EC7") 

boxplot(lifetime~branch, data=broken_set, main="Per Branch", xlab="branch", ylab="Lifetime", col="#357EC7") 

boxplot(lifetime~machine, data=broken_set, main="Per Machine", xlab="machine", ylab="Lifetime", col="#357EC7") 

The first graph above shows the statistical distribution of a machine’s lifetime. It reveals that on average, machines break after 80 weeks, and break within 60 to 100 weeks. The second and third graphs show that branch C and Aircon units tend to break several weeks before the others.

   2.2.0. A survival plot that implements the Kaplan-Meier (KM) method helped us to estimate the time it takes for a machine to break.  


# To calculate a survival analysis, we will need to use the survival package's survfit() function;
#(ensure the 'survival' package is loaded in Cmd 1)
# survfit(survivalObject, dataset), to add a grouping to the survivalObject, add it to the right of the tilde ‘~’, use ~1 if there is none. 

maintenance_graph <- survfit(Surv(lifetime,broken) ~ 1, data = sdf) 


ggsurv(maintenance_graph, cens.size = 1, cens.shape = 4) 

print(maintenance_graph) indicates there are 1000 machines, of which 397 are broken. The median survival is 80 years.

   2.2.1.  Another survival plot that implements the Kaplan-Meier (KM) method helped us to estimate the time it takes for a machine to break, grouped by branch.


maintenance_graph2 <- survfit(Surv(lifetime,broken) ~ branch, data = sdf) 

ggsurv(maintenance_graph2, cens.size = 1, cens.shape = 4) 

print(maintenance_graph2) showed:
branch A – Had 336 machines, of which 123 are broken. The median survival is 80 years.
branch B – Had 356 machines, of which 150 are broken. The median survival is 80 years.
branch C – Had 308 machines, of which 124 are broken. The median survival is 81 years.

   2.2.2.  Another survival plot that implements the Kaplan-Meier (KM) method helped us to estimate the time it takes for a machine to break, grouped by machine.  


maintenance_graph3 <- survfit(Surv(lifetime,broken) ~ machine, data = sdf) 

ggsurv(maintenance_graph3, cens.size = 1, cens.shape = 4) 

print(maintenance_graph3) showed:
Aircon Unit – 242 machines, of which 114 are broken. The median survival is 65 years.
Cooling Unit – 266 machines, of which 91 are broken. The median survival is 92 years.
Forklift – 254 machines, of which 116 are broken. The median survival is 80 years.
L.B.Door Motor – 238 machines, of which 76 are broken. The median survival is 88 years.

   2.3. We performed a survival regression analysis to estimate the relationships between independent and dependent variables


# Perform a Survival regression analysis (SRA) on only Branch A
# To do an SRA, we will need to use the survival package's survreg() function;
#(ensure the 'survival' package is loaded in Cmd 1)  

# Filter the data for only rows where branch = A  

branchA_set <- sdf %>% filter(branch == "Branch A") 

# Extract the columns 'pressure' and 'data' as vectors to force column values to be of 1 type, else Surv() will fail 

pressure = pull(branchA_set, pressure) 

broken = pull(branchA_set, broken) 

# Choose lifetime and broken as dependant variables to be used in the survival regression model. 

dependantvars <- Surv(pressure, broken) 

# Create model (use the gaussian method) 

survreg <- survreg(dependantvars~pressure+moisture+temperature+machine, dist="gaussian",data=branchA_set) 



 2.3.0. Using our survival regression analysis, we can now predict which machine will break next and therefore prioritise maintenance on these machines to avoid a break in business operations.


# p = percentile = 0.5 = expected median
# Analyse Branch A only 
# Ebreak : Estimate of how long a machine will live 

Ebreak<-predict(survreg, newdata=branchA_set, type="quantile", p=.5) 

# Extract the column 'lifetime' as a vector to force column values to be of 1 type 

lifetime <- pull(branchA_set, lifetime)

# Make forecast dataset, coerce columns from branchA_set to all be 1 type using pull() 


# Lifetime: How long a machine has lived 


Forecast$broken=pull(branchA_set, broken) 

Forecast$id=pull(branchA_set, id) #Also, we attach the ID of the machines. 

# Computed Expected Remaining Lifetime of mmachine (remainingLT) 


# Order the elements by Expected Remaining Lifetime 


# Keep only those who are not broken yet 

ActionsPriority <- Forecast %>% filter(broken ==0) 


print(ActionsPriority) showed:  
18 machines that should be changed this month, ordered by remaining lifetime, listed with their ids.

   2.3.1. We might want to change the machines with less than a week of remaining lifetime and re-run our program every few days to change the next machine that will break. Here we have a data table that can be automatically emailed to the manager every week.  
The machines are all classified into 3 classes based on RemainingLT - urgent, medium and good


ActionsPriority$class <- cut(ActionsPriority$RemainingLT, c(-10,1,4,1000)) 

levels(ActionsPriority$class) <- c('urgent', 'medium', 'good') 

summary(select(ActionsPriority,-c(id)))  #leave out the id column 

As shown above, 4 machines are in urgent need of maintenance.

   3. Predictive Maintenance results (R)

Now that we have enough information for our digital twin, we will initialise some variables which will be used populate a JSON body.



# Outputs for the digital twin (Warehouse scenario = Branch A) 
# Get the information for the asset maintenance overview board 
# Total asset count 

totalBranchA <- branchA_set %>% tally()  

# Total broken asset count 

total_Broken_branchA <- branchA_set %>% filter(broken == 1) %>% tally() 

# Total working asset count 

total_Working_branchA <- branchA_set %>% filter(broken == 0) %>% tally() 

# This extracts values from the tally result table by column name 'n' 

totalBranchA <- pull(totalBranchA, n)
total_Broken_branchA <- pull(total_Broken_branchA, n)  
total_Working_branchA <-pull(total_Working_branchA, n)  

# Maintenance_class_counts = count(ActionsPriority, class) 
# To get each count separately 

urgent <- length(which(ActionsPriority$class=='urgent')) 
medium <- length(which(ActionsPriority$class=='medium'))
good <- length(which(ActionsPriority$class=='good')) 

We will commit this JSON body to Azure Blob Storage in part 4.

In this post, we discussed how we used Apache Spark through CRAN’s sparklyr package to process our dataset in the cloud, in R, to produce a Predictive Maintenance report we can access from our digital twin project. We hope we’ve given you some inspiration on the analyses you can conduct in your own Predictive Maintenance reports and how to do this with Databricks.  

In our final post, we will walk through how we committed our Predictive Maintenance report to Azure Blob Storage, extending the capabilities of our digital twin.