March 30, 2021

Is biased data leading you down the wrong path?



Data bias is a big problem. With the sheer quantity of data now available and an increasing reliance on data and insights led decision-making, mistaking 'truth' with 'the truth you want to see' is a significant risk.

What is bias in a 'data project' context?

Bias happens when we lean in a certain direction, either in favour or against something.  

For data projects, it means we're interfering with the true outcome and therefore making business decisions based on falsities. It could be the data itself that's biased or the person analysing it. Most of the time it's both - so it’s often a tricky problem to solve.

I could list many ways that data bias can impact projects but, ultimately, we're interested in overcoming bias when conducting data projects. In this article, I'll outline some of the frequently encountered types and how to minimise biases that occur.

The bias challenge

Without going into too much detail about the psychology behind this phenomenon – although, it's a fascinating topic if you have the time – there are a couple of things to be aware of:

  1. Reducing bias is an iterative process. Searching for where it occurs should happen continually.  
  2. The answer to "Is my data biased?" is likely yes. The question we need to ask ourselves is: "How can we minimise the bias?"

Having acceptance and awareness that bias will impact your outcomes is always a good start.

Artificial intelligence bias; machine learning bias – a thought-provoking area!

We know that most data sets used in AI/ ML based models have some form of bias. If we're teaching a machine rules based on biased information, there's a risk of enabling discriminatory behaviour. There are real-life examples of this happening that you've probably read about. As the use of AI and ML grows, these biases become more entrenched. Solving algorithmic bias, human bias, and enlisting governing bodies are some of the interesting discussion points around this.    

Top 3 types of data bias to look out for

In my role as a Data Scientist, I've seen when and where biases commonly, and usually inadvertently, occur. Below, I'll talk about 3 common types of bias that impact data projects. For each, I've provided a fictional business example and recommendations for minimising this bias.

1. Sampling bias

What is it?

Sampling bias occurs when you collect and prepare sample data for modelling that does not represent the entire data accurately, i.e. your sample contains elements that are represented disproportionally higher or lower.

Why is this a problem?

Business decisions are based on insights from the underlying data. When such data is represented inaccurately, the decisions based on that data will be inaccurate.

A business example -

Let's say an online retail store is building a recommendation model to improve online sales. It looks at the item purchased by the customer and suggests other items that may interest them.

First, let's assume that this retailer had customers only from New Zealand, but it has started to serve overseas customers who now account for 70% of sales in the past year. The domestic customer data is hosted on-premises, while the overseas data is hosted on the cloud.

When training the recommendation model, the business only considers domestic data. In this example, let's say the domestic data was chosen because it was easier to access due to permissions. However, in doing so, they've ignored the overseas data and created a sampling bias that only considers NZ data. As a side note: selecting data from just one source often happens unknowingly in these types of situations.

How can you minimise this type of bias?

Identify the key elements of the data and use stratified random sampling. In the example above, you'd pick the country for stratification and use all the databases to source the sample. This article from Investopedia is a good starting point to learn about stratified random sampling.

2. Survivorship bias

What is it?

There are criteria for selection, and you only work on data that passes this criteria. It means you knowingly or unknowingly exclude a subset of data. For example, you may choose to analyse customers who made at least one transaction over the past year. This inadvertently excludes all the inactive customers.

Why is this a problem?

While it's important to learn from what worked in the past, it's equally important to learn what did not work so well.

A business example –

Let's say a business wants to increase employee engagement. The HR team sends out a survey to the employees, asking them to select what activities they enjoy the most, e.g. family days, team lunches, better coffee machine etc. They observe that 80% of the respondents voted for a better coffee machine.

Most of the employees who responded to the survey were from the finance team who work exclusively from the office, meaning that the business has not fully understood what all other employees enjoy the most. The results are biased.

How can you minimise this type of bias?

Identify a method that allows you to include a broader set of data. In the example above, you would keep the survey questionnaire open for a longer time and send reminder emails specifically to departments where response rates are significantly lower than the enterprise average response rate.

3. Confirmation bias

We tend to favour information that confirms what we believe. You might prefer, search for and remember evidence that aligns with your beliefs. It can also mean:

  1. Misunderstanding due to incorrect beliefs.
  2. Ignoring opportunities to test beliefs.
  3. Explaining away data that doesn't fit with our beliefs. I'm sure we can all attest to this!

There's some interesting research around Wason's Selection Task suggesting this human tendency, although this theory has expanded somewhat since then.  

Why is this a problem?

You may tend to look for data that supports your hypothesis and risk not looking at the data that's against your beliefs.

A business example –

Let's say a product development team is considering adding new features to the product, and they believe that a particular feature is more important than others.

The team decides to do some online research to see what features make the most sense for similar products. They start by searching for the usefulness of this new feature.

What are the implications for this? It's more likely that the search engine will start to recommend more articles that speak about the usefulness of that feature.

And the result? It could mislead the team into believing that this is the best feature to have due to bias occurring during the research stage.

How can you minimise this type of bias?

Where possible, perform Null Hypothesis testing. In the example above, your Null Hypothesis would be "Feature X is not more important than other features for Product A." If you're performing an online search, you will want to search why it is not a good feature.

Key takeaways

  1. Be mindful of seeing what you want to see. Having the ability to look through a different lens is highly valuable.
  2. Test your assumptions and learn from experimenting.
  3. It's not about the best; it's about balance.

Taking the necessary steps to reduce bias in your data projects is highly recommended. It'll help you achieve more accurate results and stop you from wandering down the wrong path and losing sight of your goal.  

Having challenges around data bias? Talk to us.
If you want to apply data science methodology to your existing data projects, we can help with that too.  

Author: Sidharth Macherla

Sid is a Data Scientist at Theta. He has extensive experience working in multiple industry domains such as Banking, Financial Services and Retail, and is a Microsoft Certified Solutions Associate in Machine Learning.

Data science challenges?

We can help