Statistical Bias Types explained (with examples) – part1

Humans are stupid.
We all are, because our brain has been made that way. The most obvious evidence to this built-in stupidity is the different biases, that our brain produces. Even if it’s so, at least we can be a bit smarter, than the average, if we are aware of them. This is a data blog, so in this article I’ll focus only on the most important statistical bias types – but I promise, that even if you are not an aspiring data professional (yet), you will profit a lot from this write up. For the ease of understanding for each statistical bias type I’ll provide two examples: an everyday one and a more online analytics related one!

And just to make this clear: biased statistics are bad statistics. Everything I will describe here is to help you prevent the same mistakes, that some of the less smart “researcher” folks are doing time to time.

The most important statistical bias types

There is a long list of statistical bias types. I’ll cover those, that can affect your job as a data scientist or analyst the most. These are:

  1. Selection bias
  2. Self-selection bias
  3. Recall bias
  4. Observer bias
  5. Survivorship bias
  6. Omitted variable bias
  7. Cause-effect bias
  8. Funding bias
  9. Cognitive bias

Statistical bias #1: Selection bias

statistical bias types - random sampling

proper random sampling

statistical bias types - selection bias

selection bias

Selection bias occurs, when you are selecting your sample or your data wrong. Usually this means accidentally working with a specific subset of your audience instead of the whole, hence your sample is not representative of the whole population. There are many underlying reasons, but by far the most typical I see: collect and work only with data that is easy to access.

 

Everyday example of selection bias:

Please answer this question: What’s people’s overall opinion about Donald Trump’s presidency?
Most people have an immediate and very “educated” answer for that. Unfortunately for many of them the top source of their information is their Facebook feed. Very bad and sad practice, because what they see there does not show the public opinion – it’s only their friends’ opinion. (In fact, it’s even narrower, because they see there only those friends’ opinion, who are active and posting to Facebook – so most probably 25-35 and extroverted people are overrepresented.) That’s a classic selection bias: easy-to-access data, but only for a very specific, unrepresentative subset of the whole population.

Note 1: I do recommend blocking your Facebook feed for many reasons, but mostly not to get narrow-minded by it: FB News Feed Eradicator!
Note 2: If you want to read another classy selection bias story, check how Literary Digest did a similar mistake (also referred as undercoverage bias) ~80 years ago!

Online analytics related example of selection bias:

Another example for selection bias is, when you send out a survey for your newsletter subscribers – asking what new product would they pay for. Of course, interacting with your audience is important (I send out surveys to my Newsletter Subscribers sometimes too), but when you analyze these survey results, you should be aware, that your newsletter subscribers are not representing your potential paying audience.

There might be a bunch of people, who are willing to pay for you, but they are not a part of your newsletter list. And on the other hand there might be a lot of people on your list, who would never spend money on your products, they are around just to get notified about your free stuff. And that’s only one reason yet (see the rest below), why surveying is just the simple worst research method. By the way, for this particular example, I’d suggest to do fake door testing instead!

Statistical bias #2: Self-Selection bias

Self-selection bias is a subcategory of selection bias. If you let the subjects of your analyses/researches select themselves, that means that less proactive people will be excluded. The bigger issue is that self-selection is a specific behaviour – that implies other specific behaviours – thus this sample does not represent the entire population.

Everyday example of self-selection bias:

Any type of polling/surveying. Eg. when you want to research successful entrepreneurs’ behaviour with surveys, your results will be skewed for sure. Why? Because successful people most probably don’t have time/motivation to answer or even take a look at random surveys. So the 99% of your answers will come from entrepreneurs, who thinks they are successful, but in fact they are not. In this specific case, I’d rather try to lure people who are proven to be successful into face-to-face interviews.

Online analytics related example of self-selection bias:

Say, you have an online product – and a knowledge base for that with 100+ how-to-use-the-product kind of articles in it. Let’s find out how good your knowledge base is and compare the users, who read at least 1 article from it to the users who didn’t. We find that the article-reader users are 50% more active in terms of product usage, than the non-readers. Knowledge base performs great! Or does it? In fact, we don’t know, because the article-readers are a special subset of your whole population, who might have a higher commitment to your product and this might be the reason of their interest in your knowledge base. With other words, they have “selected themselves” into the reader-group. This self-selection bias leads to a classy correlation/causation dilemma, that you can never solve by data research, just by A/B testing.

Statistical bias #3: Recall bias

statistical bias types - recall biasRecall bias is another common error of interview/survey situations, when the respondent doesn’t remember correctly for things. It’s not bad or good memory – humans have selective memory by default. After a few years certain things stay, others fade. It’s normal, but it makes researches much more difficult.

Everyday example of recall bias:

How was that vacation 3 years ago? Awesome, right? Looking back we tend to forget the bad things and keep remembering to the good things only. Although it doesn’t help us to objectively evaluate different memories, I’m pretty sure our brain is like that for a good reason.

Online analytics related example of recall bias:

I’m holding data workshops from time to time. I usually send out feedback forms afterwards, so I can make the workshops better and better based on participants’ feedbacks. I usually send them the day after the workshop, but there was one particular case when I completely forgot it and sent it one week later. Looking at the comments I got, that was my most successful workshop of all time. Except that it’s not necessarily true. It’s more likely that recall bias might have kicked in pretty hard. One week after the workshop neither of the attendees would recall if the coffee were cold or if I was over-explaining a slide here or there. They remembered only to the good things. Not that I wasn’t happy for their good feedback, but if the coffee were cold, I would want to know about it – to get it fixed for the next time…

Statistical bias #4: Observer bias

statistical bias types - observer biasObserver bias is happening, when the researcher subconsciously projects his/her expectations to the research. It can come in many forms. Eg. (unintentionally) influencing the participants (only at interviews and surveys) or doing some serious cherry picking (focusing rather on the statistics that support our hypothesis, than to the statistics, that doesn’t.)

Everyday example of observer bias:

Fake news! 🙂 It needs a very thorough and consequent investigative journalist to be OK with rejecting her own null-hypothesis at the publication phase. Eg. if a journalist spends 1 month on an investigation to prove that the local crime rate is high because of the careless police officers – most probably she will find a way to prove it – leaving aside the counter arguments and any serious statistical considerations.

Extended by other common journalist-kind-of statistical biases, like funding bias (studies tend to support the financial sponsors’ interests) or publication bias (to fake or extremize the research results to get published) led me to the conclusion that reading any type of online media will never get me closer to any sort of truth about our world. So I’d rather suggest to consume trustful statistics than online media – or even better: find trustworthy raw data and do your own analyses to learn a “truer truth”.

Online analytics related example of observer bias:

Observer bias can affect online researches as well. Eg. when you are doing a Usability Tests. As a user researcher, you know your product very well (and maybe you like it too), so subconsciously you might have expectations. If you are a pro User Experience Researcher, you will know, how not to influence your testers by your questions – but if you are new to that field, make sure you spend enough time with preparing good, unbiased questions and scenarios. Maybe consider hiring a professional UX consultant to help.

Note: in my workshop feedback example observer bias can occur if I send out the survey right after the workshop. Participants might be under the influence of the personal encounter – and this might indicate that they don’t want to “hurt my feelings” with negative feedbacks. Workshop feedback forms should be sent 1 day after the workshop itself.

Statistical bias #5: Survivorship bias

statistical bias types - survivorship biasSurvivorship bias is a statistical bias type, where the researcher is focusing only to that part of the data set, that already went through some kind of pre-selection process – and missing those data-points, that fell off during this process (because they are not visible anymore).

Everyday example of survivorship bias:

One of the most interesting stories of statistical biases: falling cats. There was a study written in 1987 about cats falling out from buildings. It stated that the cats who fell from higher have less injuries than cats who fell from lower. Odd. They explained the phenomenon with the terminal velocity, which basically means that cats falling from higher than six stories are reaching their maximum velocity during the fall, so they start to relax, prepare to landing and that’s why they don’t injure themselves that hard.

As ridiculous as it sounds, as mistaken this theory turned out to be. 20 years later, the Straight Dope newspaper pointed out to the fact, that those cats who are falling from higher than six stories might have died with a higher chance, thus people don’t take them to the veterinarian – so they were simply not registered and didn’t become the part of the study. And the cats that fell from higher, but survived were simply falling more luckily, that’s why they had less injuries. Survivorship bias – literally. (I feel sorry for the cats though.)

Online analytics related example of survivorship bias:

Reading case studies. Case studies are super useful to give you inspiration and ideas to your new projects. But remind yourself all the time, that only success stories are published! You will never hear about the stories, where one used the exact same methods, but failed.

Not so long ago I’ve read a bunch of articles about exit intent pop-ups. Every article declared that exit intent pop-ups are great and brought +30%, +40%, +200% in number of newsletter subscriptions. In fact it works pretty decent on my website too… But let’s take a break for a moment. Does it mean that exit-intent popups will work for everyone? Isn’t it possible that those guys, who have tested exit-intent pop-ups and found that it actually hurts the user experience, the brand or the page load time, they have just simply didn’t write an article about this bad experience? Of course, it’s possible – nobody likes to write about unsuccessful experiment results… The point is: if you read a case study, think about it, research it and test it – and decide based on hard evidence if it’s the right solution for you or not.

4 more statistical bias types and some suggestions to avoid them…

This is just the beginning! Next week I’ll continue this article with 4 more statistical bias types – that every data scientist and analyst should know about. And on the week after, I’ll give you some practical suggestions, how to overcome these!
UPDATE: here’s Statistical Bias Types Explained – part 2

Stick with me and subscribe to my weekly Newsletters (no spam, just 100% useful data content)! And if you have any comments, let me know below!

Cheers,
Tomi

← Previous post

Next post →

4 Comments

  1. ManuelPB

    Good article. Waiting for reading second part. Manuel, from Spain

  2. Kuldeepak Kumar Sharma

    Very Interesting article (both). Love the falling cats’ story, except its sad connotation.

Leave a Reply