Statistical Bias Types explained – part 2 (with examples)

Tomi Mester
August 28, 2017

It’s time to continue our discourse about Statistical Bias Types. This is part 2 – if you missed part 1, read it here: Statistical Bias Types part 1. In the previous article I introduced 5 ways (not) to get biased during the data collection/sampling phase of your research. Now I’ll focus on what can (but shouldn’t) go wrong during the analysis and presentation phases.

Statistical Bias #6: Omitted Variable Bias

Omitted Variable Bias occurs when you are leaving out one or more important variables from your model. This issue comes up especially often regarding Predictive Analytics.

statistical bias types - omitted variable bias

Everyday example of Omitted Variable Bias:

Imagine a grocery store. You are finished with shopping and you want to pay. There are 3 lines and you want to pick the one where you have to spend the least time. So you check which one is the shortest and queue up there. Murphy’s Law: the other line is going much faster. Your prediction failed – maybe because you have omitted an important variable, namely how packed the carts were in the different lines. This mistake cost you 5 more minutes in line…

Data analytics example of Omitted Variable Bias:

In real-life data projects you can lose much more than 5 minutes with wrong predictions. Here’s an example:

It’s quite common that online businesses want to predict possible user churn so they can act beforehand. Let’s say you have been monitoring all user activity on your product, and based on your own data, you built up a model that predicts whether a user will cancel her subscription in one week – with 75% accuracy. Nice job! But the next day you see that a big chunk of users are cancelling their subscription without any warning from your model. What just happened? In this hypothetical scenario a strong competitor entered your market and offered the same solution you have, but for half the price. Of course, this is something your model wasn’t ready for. The presence of the competitor is an omitted variable in this case. In fact it’s a variable that’s almost impossible to prepare any predictive models for.

Note: Contemporary Predictive Analytics models work pretty much on the principle of “what happened in the past will happen in the future.” This makes these models very vulnerable. If something new is happening on the market, it’s often not calculated in the predictions and it causes major inaccuracy. The bottom line is: don’t expect a predictive model to be accurate for more than 1 or 2 years.

Statistical Bias #7: Cause-effect Bias

Our brain is wired to see causation everywhere that correlation shows up.
Cause-effect bias is usually not mentioned as a classic statistical bias, but I wanted to include it on this list as many decision makers (business/marketing managers) are not aware of that. Even those who are aware of it (including me), have to remind themselves from time to time: correlation does not imply causation.

statistical bias types - cause effect bias

Everyday example of Cause-Effect Bias:

Here’s my favorite example: kids who had tutors in high school eventually got worse grades than the kids who didn’t. I intentionally put this in this misleading way. But the point is, that even though you see a correlation between bad grades and tutoring, the tutoring wasn’t the cause of the bad grades. The bad grades were the cause the tutors were needed.

Data analytics example of Cause-Effect Bias:

You have a new loyalty program! You see that the customers who signed up for that loyalty program are spending 5 times more money in your e-commerce store than those who didn’t. Is the loyalty program successful? Maybe, but we don’t know that for sure. Because it’s also possible that only those more committed (or, in other words, loyal) customers are interested in the loyalty program in the first place, and they might have been going to spend 5 times more anyway. (See more here: self-selection bias.)

Unfortunately the only way to crack the correlation vs. causation issue is to run experiments. While it’s easy to A/B test your loyalty program online – it’s a bit more difficult to tell half of the kids who perform badly at school they they can’t have tutors because of scientific research. But let this be the problem of the social economists.

The Junior Data Scientist's First Month

A 100% practical online course. A 6-week simulation of being a junior data scientist at a true-to-life startup.

“Solving real problems, getting real experience – just like in a real data science job.”

Statistical Bias #8: Funding Bias

I briefly mentioned Funding Bias (sometimes called sponsorship bias) in Statistical Bias Types part 1. It happens when the results of a scientific study are biased in a way that supports the financial sponsor of the research.

Everyday example of Funding Bias:

I won’t name any particular industry here, but I think we all know what I’m talking about. Any time you are watching “documentaries,” when you are reading the “news,” when you are checking “research results” – try to first make sure that you are consuming content by independent creators who are not biased by their sponsors’ expectations.

Data analytics example of Funding Bias:

If you are working for a company as a Data Scientist or Analyst, you are getting your money from that company – so in a sense, it’s your sponsor. Now, of course you want to deliver good news to make your “sponsors” happy. Let’s imagine a game development company. A data analyst might feel really bad for reporting that the new game that everybody’s been working on for the last 3 months looks like a huge failure. But keep this in mind, and train your colleagues too: as a data scientist or analyst, you are not getting paid to deliver good news. You are getting paid to deliver accurate, useful and actionable information. Was the new product a failure? It’s OK, but make sure that everyone can learn from the data that you have collected during the test phase, so the new version can be better!

Statistical Bias #9: Cognitive Bias

Cognitive biases are related to human perception, thus it’s a much broader category. But they have a relationship to statistical biases too! They can also have a huge effect on how you should present and interpret the data.

Examples of Cognitive Bias:

For cognitive biases I’m gonna lump together the everyday and the data science examples. So here are the 4 most important cognitive bias types:

Hindsight bias.
Even the greatest findings seem very trivial – looking back at them a few days later. You feel that this outcome was so logical. You should have known this the whole time. When you are presenting the results of your 1-month data analysis project, there will be always someone in the room who will say: “I was gonna say the very same thing during the last meeting…” My suggestion: smile inside and try to keep the comment “of course, but then why didn’t you?” – for yourself.
Confirmation bias.
A variation on the previous one, but this is a bit more dangerous. Confirmation bias happens when a decision maker has serious pre-conceptions and listens only to that part of your presentation that confirms their beliefs, completely missing the rest. Suggestion: always have a one-sentence take away for your presentations, that’s impossible to miss even if someone’s eyes are covered by preconceptions. (Also feel free to point out possible confirmation biases and send over this article. ;-))
Belief bias.
When someone is so sure about his own gut feelings that he is ignoring the results of a data research project. Suggestion: ehh… hustle. I’ve given more details here: Data-resistance – how to evangelize the data driven mindset?
Curse of knowledge.
When you are assuming someone has the same background knowledge that you do. It’s especially important to be aware of this bias when you are presenting your data projects to non-data-minded people. Mind that business managers don’t necessarily have in their mental dictionary phrases like “statistically significant,” “multiple regression,” “least square estimates,” so try to communicate using their words (e.g. “statistically significant” = “pretty damn sure”).

There are many, many more cognitive bias types, but I’ll limit my article to these four most important ones. If you want to learn more, take a look at this Wikipedia article: List of Cognitive Biases.

How not to be biased?

Now that we have learned about all the important statistical bias types, the only question left is, how can we overcome them? How can we ultimately avoid being biased? In next week’s article I’ll write about that and will give you some practical advice! UPDATE! here’s the new article: How not to be biased?

If you want to learn more about how to become a data scientist, take my 50-minute video course: How to Become a Data Scientist. (It’s free!)
Also check out my 6-week online course: The Junior Data Scientist’s First Month video course.

Cheers,
Tomi Mester