Data science focuses on extracting knowledge or insights from structured or unstructured data by adopting techniques and theories from mathematics, statistics, information science and computer science. It can bring new perspectives to data.
The decreasing cost of storage devices has led to an exceptional growth in data volumes. Large datasets are generated at an incredibly high speed and in various formats. Data science is equipped with all the necessary technologies needed to extract meaningful insights from this data - from predictive or prescriptive models to discovering hidden patterns in data.
Data Science vs data analysis
Data analysis is not new - people have been using statistics and related techniques to analyse data for some time. The main difference between data analysis and data science is explanation versus prediction.
Data analysis deals with explaining a phenomenon by extracting interesting patterns from individual data sets using well formatted queries.Data science, on the other hand, aims to discover and extract knowledge that can be used to make decisions and predictions, not just to explain what’s going on. Figure 1 shows the process that is usually adopted for a typical data science activity. The entire process usually needs domain knowledge, mathematics, statistics, software development and machine learning skills.
Figure 1: Foundational methodology for data science (Yash Mehta, 2015)
The modelling stage in this figure relates to creating predictive models which are used for performing predictive analytics.
Predictive analytics deals with extracting meaningful information from data using data mining, statistical modelling and machine learning techniques. This information is used to predict unobserved/unknown patterns in the past, present or future. Instead of providing insight about why a particular event happened it focuses on providing foresight about when a particular event will happen. A predictive model is usually a mathematical equation representing relationships between various variables that contributed towards a specific outcome. One example of a predictive model would be a customer churn model, which identifies factors that contribute towards customer turnover.
In an analytics maturity model (Figure 2) predictive analytics is the third level in the curve. As we go higher and towards the right, the customer value increases with increased sophistication in techniques.
Figure 2: Maturity curve of analytics (Boxever.com, 2015)
Stages of predictive analytics
The stages involved in predictive analytics are shown in Figure 3. It starts from defining a clear question and objective of the activity. Data cleaning, merging and quality assurance are done at the data preparation stage. Modelling is the process of actually creating a predictive model. It also involve verifying and testing the model results. A model can be deployed on a cloud as a web service. It is also very important to continuously update the model according to latest data.
Figure 3: Predictive Analytics Process
The methods used for predictive modelling can be broadly grouped into regression and machine learning based approaches.
Methods of predictive modelling
Regression based techniques
Regression analysis is a technique used in statistical modelling which deals with estimating the relationship among variables. These techniques focus on representing interaction between different variables in the form of a mathematical model.
The above equation represents a simple linear regression model for data points. The equation contains one independent variable (predictor): , one dependent variable (prediction or response), , and two parameters, (intercept) and (co-efficient), and is an error term.
Let us take an example dataset containing heights and weights of university students. We want to find out whether there is any relationship between these two variables. Doing simple linear regression reveals that there is indeed: height and weight are directly proportional to each other. In order to determine whether this relationship is statistically significant or not we look at -value. For this relationship we get a -value of 0.0187 at a confidence level of 0.05 i.e. there is less than two percent chance of observing this relationship by chance.
Table 1: University students' height and weight data set
Figure 4: Regression line for university students' height and weight data set
Table 2: University students' height and weight labelled data set
The other variations of regression analysis include:
- Discrete choice models
- Logistic regression
- Time series models
- Multivariate adaptive regression splines
In cases where the underlying relationships between variables is very complex and it is hard to build a mathematical model to represent these relationships, machine learning techniques are used. These techniques learn from a training data set and encode relationships among attributes of a data set (Table 1).
Classification is another approach used for predictive analytics using machine learning techniques. Classification deals with automatically labelling data points into various categories. For training a classification model a labelled data set (Table 3) is used instead. The last column for this table contains the label assigned to each data point. A model can be tested once training is completed. For the testing stage a new data point (e.g. X [64,145]) is automatically assigned label A as shown in Figure 5.
Figure 5: Height vs weight data set
Machine learning techniques include:
Once a predictive model is built it is important to quantify its performance. Table 3 depicts the approaches commonly used to assess the performance of a predictive model.
Table 3: Performance measures for a predictive model
Sample business cases for predictive analytics
Prescriptive analytics is the final stage of the maturity of analytics curve shown in Figure 2. Once we know what will happen using predictive analytics, it is important to know the right course of action in the light of these predictions.
Figure 6: Prescriptive analytics uses hybrid data sources to predict, prescribe and adapt (Modaniel, n.d.)
For a football game planning software, prescriptive analytics can be used to answer following questions:
- When is it most likely that a particular player will be injured? And who could be the best replacement?
- When and how does the opponent have the best chance of scoring a goal? What strategy we should use for deciding players’ locations?
- How does the selection of a particular player affects success rate?
Figure 7: A football planning software
Data science has opened new avenues for analysing data. Its tools and techniques extend beyond traditional BI to show us insights never known before. Technologies like predictive and prescriptive analytics are capable of changing the way we understand our business and can help us make smarter decisions.