Correlation: What is it? How to calculate it? .corr() in pandas

Exploratory Data Analysis (EDA), Machine Learning projects, Economical/Financial analysis, scientific research, even single articles on different topics in newspapers involve examining correlation between variables.

  • But what is correlation?
  • How do we use it?
  • Can we measure it?
  • Can we visualize it?
  • What is causation?
  • How is it helping your business?

You’ll find the answers to all those questions in this article!

levente kulcsar

The author of this article is Levente Kulcsar from Sweden. He creates awesome data science content on his twitter account. Follow him, here.

What is correlation?

According to Wikipedia: “Correlation refers to the degree to which a pair of variables are linearly related.” [1]

In plain English: correlation is a measure of a statistical relationship between two sets of data. 

Let’s call those two datasets X and Y now for a little example:.

Variables of X and Y are positively correlated if:

  • high values of X go with high values of Y
  • low values of X go with low values of Y

Variables X and Y are negatively correlated if:

  • high values of X go with low values of Y
  • low values of X go with high values of Y

Note: It’s important to note that correlation does not imply causation. In other words, just because you see that two things are correlated to each other, it doesn’t necessarily mean that one causes the other. More on this later.

Here are visualizations of correlations. (Stay tuned, we will learn how to create these scatterplots!)

positive correlation

In this example the two variables are skill and attacking. It is clearly visible that high skill values go with high attacking values, so they are positively correlated.

The next scatterplot is a visual presentation of a negative correlation (not so strong however):

negative correlation

In this case high height_cm values go with low movement values.

How can we measure correlation?

To measure correlation, we usually use the Pearson correlation coefficient, it gives an estimate of the correlation between two variables.

To compute Pearson’s coefficient, we multiply deviations from the mean for X times those for Y and divide by the product of the standard deviations. Here is the formula: [2]

correlation calculation formula

Note: as always – it’s important to understand how you calculate Pearson’s coefficient – but luckily, it’s implemented in pandas, so you don’t have to type the whole formula into Python all the time, you can just call the right function… more about that later.

Pearson’s correlation coefficient is good to measure linear correlation. 

Wait! Do we have nonlinear correlation as well? Yes, we have, so it’s time to define what is the difference. 

  • Linear correlation: The correlation is linear if the ratio of change is constant. [3] If we double X, Y will be doubled as well.
  • Nonlinear correlation: If the ratio of change is not constant, we are facing nonlinear correlation. [3] To measure nonlinear correlation, we use the Spearman’s correlation coefficient. More on this here [4]

So back to linear correlation and Pearson’s coefficient. The coefficient always has a value between −1 and 1

  • -1 means perfect negative linear correlation
  • +1 means perfect positive linear correlation
  • 0 means no linear dependency between variables.

A few examples from a Wikipedia article:

https://en.wikipedia.org/wiki/Correlation#/media/File:Correlation_examples2.svg
source: https://en.wikipedia.org/wiki/Correlation#/media/File:Correlation_examples2.svg

What does Pearson’s correlation coefficient tells us?

  • the “noisiness” of the relationship,
  • the direction of the relationship

What does the coefficient not tell us?

  • The slope of the relationship
  • If there is a relationship, but not necessarily linear. (E.g. in the image from the Wikipedia article above, we can assume that there is some kind of correlation in the bottom row, but since those are not linear, we cannot measure them with Pearson’s correlation coefficient.)

Correlation vs Causation

It is important to understand that if two values are correlated it doesn’t mean that one causes the other. 

Correlation does not imply causation – as they say.

It only means that X and Y move together. But this correlation can be due to:

  1. Causation
  2. Third variable
  3. Coincidence

What is causation

Causation means that there is a cause-and-effect link between X and Y. The result of this link is that if a change in X occurs, a change in Y will occur as well.

A really simple example:

(Generally) when someone exercises more, they will gain more muscle.

But when we think about causation we need to be careful, because some problems can emerge.

Third variable problem

For example, we usually see a positive correlation between shark attacks and ice cream sales. Can we conclude that there is a causation between these variables? Of course not. The sales of ice cream won’t cause shark attacks and vice versa.

Instead, a third variable enters the conversation: temperature. 

When it’s warmer out, more people buy ice cream and more people swim in the ocean. [5]

This is a typical example for the third variable problem. The third variable problem means that X and Y are correlated, but a third variable Z causes the changes both in X and Y.

Directionality Problem

Another thing we need to consider is the direction of the relationship. 

Aggressive people watch lots of violence on TV. 

But does violence on TV make them aggressive? Or they are aggressive, hence they watch violence on TV?

We cannot tell for sure.

Directionality Problem means that we know that X and Y are correlated and we assume that there is a link between them, but we don’t know if X causes Y or Y causes X.

Spurious Correlation

A spurious correlation is when two variables are related through a hidden third variable or simply by coincidence. [7]

You can find some funny examples of Spurious Correlation here[6]

Chart, line chart

Description automatically generated
source: https://tylervigen.com/spurious-correlations

Correlation in Pandas

Now it is time to code!

First we need to import packages and our data. In this exercise we will use Kaggle’s FIFA 22 top 650 players. 

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
data = pd.read_csv('../input/top-650-fifa-22-players-simplified/Top_650_FIFA.csv')

This dataset contains player details from the well known soccer computer game. We will mainly focus on their skills, such as power, mentality, passing, shooting etc. Each player has a rating out of 100 in these categories.  

Note: you can learn Pandas basics and how to load a dataset into pandas, here: https://data36.com/pandas-tutorial-1-basics-reading-data-files-dataframes-data-selection/

Correlation matrix – How to use .corr()

The easiest way to check the correlation between variables is to use the .corr() method.

data.corr() will give us the correlation matrix for the dataset. Here is a small sample from the big table:

Note: If you want to learn in detail, how to read this matrix, check this article out.

We will use only some of the columns for better understanding. Also, columns like the index (Unnamed 0) and club_jersey_number are not relevant to us. We do not anticipate any connection between a jersey number and the player’s skills. 

We will define a variable with column names and apply .corr() only on those columns:

columns = ['age', 'height_cm', 'weight_kg', 'skill_moves',
'pace','shooting','passing',
'dribbling','defending','physic',
'attacking','skill','movement','power']
data[columns].corr()

Again, here is part of the table:

correlation matrix 2

Note: .corr() by default will use Pearson’s coefficient; we can change that by defining the method inside the parantheses. Use method= 'spearman' to check Spearman’s coefficient and nonlinear correlation.

Coloring the correlation matrix (so it’s easier to read)

Since the matrix contains many numbers, it is hard to read. For better understanding, we can add some coloring. 

In this example I used a gradient background called coolwarm, by adding .style.background_gradient(cmap='coolwarm') to the end of the code defined earlier.

The result for:

data[columns].corr().style.background_gradient(cmap='coolwarm')

will be something like this:

correlation matrix heatmap

From the table presented this way, you can immediately find the negative and positive correlations.

Using these colors it is also easy to spot that the correlation matrix contains every value twice. It is mirrored on the diagonal. 

To clear the table even further we will use seaborn and masks.

Note: For a better understanding of how we use mask in this example click here [9]

import seaborn as sns
import matplotlib.pyplot as plt
corrmat = data[columns].corr()
mask= np.zeros_like(corrmat)
mask[np.triu_indices_from(mask)] = True
sns.heatmap(corrmat,
            vmax=1, vmin=-1,
            annot=True, annot_kws={'fontsize':7},
            mask=mask,
            cmap=sns.diverging_palette(20,220,as_cmap=True))
seaborn corr pandas heatmap
corr matrix heatmap filtered pandas seaborn

The Junior Data Scientist's First Month

A 100% practical online course. A 6-week simulation of being a junior data scientist at a true-to-life startup.

“Solving real problems, getting real experience – just like in a real data science job.”

Scatterplots

We can visualize a pair of variables and check if they are correlated or not on scatter plots as well.

In Pandas we just need to use .plot.scatter() and define our X and Y variables:

data.plot.scatter(x='attacking',y='skill')
positive correlation

Note: Did you notice that this is the chart that we have already discussed at the beginning? 

We know from the matrix that the correlation coefficient for the two variables is 0.95, so they are strongly, positively correlated.

We just need to change the x and y variable names to recreate the example chart for negative correlation. (corr coefficient is -0.7):

data.plot.scatter(x='movement',y='height_cm')
negative correlation

What about no-correlation? What does that look like?

Here is the example (corr coefficient is 0.1):

data.plot.scatter(x='passing',y='pace')
scatter plot corr

You can also use seaborn to visualize not just one pair of variables on scatter plots. 

Adding .pairplots() will create a matrix of scatterplots. 

More on pairplots here [10].

columns = ['age', 'height_cm', 'weight_kg', 'movement','pace']
sns.pairplot(data[columns])
correlation matrix visualized scatter plots histograms

How correlation can help your business?

Correlation is widely used in real-life decision making. You will find correlation in Marketing, Finance, Sales, basically we could mention domains endlessly. 

A few benefits:

  • Pattern recognition. In the big data world looking at millions of rows of raw data will not tell you anything about the business. Using existing information for better decision making will be crucial in the future. It can reveal new business opportunities, give insights about existing processes, and help to communicate clearly. Recognizing patterns is one of the main goals of data science and correlation analysis can help with that.
  • Financial decision making – investment decisions. Diversifying is essential. Investing in negatively correlated sectors can help you mitigate risk. 
    For example: if the airline industry is negatively correlated with the social media industry, the investor may choose to invest in a social media stock. If a negative event affects one of those industries, the other sector will be a safer place for the money [11]
  • Projections. If a company finds a positive correlation between two variables and has some predictions on the one variable involved in the correlation then they can try to make predictions on the second variable as well.
    For example: Company X finds a positive correlation between the number of tourists in city Y and its sales. A 10% rise in visitors for the coming year is predicted in city Y. Company X can anticipate an increase in sales as well. Of course, when it gets to predictions, one should always consider the above mentioned correlation-causation issue.

All of the above-mentioned activities will enhance decision-making, reduce risk, reveal new opportunities through correlation.

Cheers,
Levi Kulcsar

Sources

[1]: https://en.wikipedia.org/wiki/Correlation

[2]: Practical Statistics for Data Scientists by Peter Bruce, Andrew Bruce, and Peter Gedeck (O’Reilly).

[3]: https://www.emathzone.com/tutorials/basic-statistics/linear-and-non-linear-correlation.html

[4]: https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient

[5]: https://www.statology.org/third-variable-problem/

[6]: http://www.tylervigen.com/spurious-correlations

[7]: https://www.scribbr.com/methodology/correlation-vs-causation/

[8]: https://www.statology.org/how-to-read-a-correlation-matrix/

[9]: https://www.kdnuggets.com/2019/07/annotated-heatmaps-correlation-matrix.html#

[10]: https://twitter.com/levikul09/status/1542051235510902784?s=20&t=vPWeG5_Yhi3AJ7RDo4ZsiA

[11]: https://www.investopedia.com/terms/c/correlation.asp

The Junior Data Scientist's First Month
A 100% practical online course. A 6-week simulation of being a junior data scientist at a true-to-life startup.