Statistics For Data Science Course

All About Covariance, Correlation and Causation​

Are Two Events Related?

In Data Science, we are constantly exploring the patterns and relationships among events. Let’s ask some questions.
  • Does the sex ratio of a district is related to the literacy rate? 
  • Does the property rate is affected by the distance from the railway station?
  • Does the advertisement of a product affect its sales?

As evident from above questions, In model designing, we want to know if the output is affected by a change in input or not! And if Yes, then by what amount? Let’s try to answer these questions.

What is Covariance ?

Let’s say we are studying the relationship between two variables X and Y. In this case, there are two questions before us.
  • When X gets bigger, does Y get bigger, or does it get smaller? (direction)
  • Does Y get a lot bigger/smaller, or just a little bit? (strength)
Covariance answers the first question. Using it to answer the second is very uncertain. Correlation answers both questions. To do so, it starts off by calculating the covariance and then makes adjustments.

Covariance measure the directional relation between two random variables. It measures this by comparing the variances of both the variables.

Mathematically covariance of a population is given as below.


Covariance Calculation Using Excel

In Excel (Office 365), we use the following function to calculate covariance.
  • While working with population data, Covariance.P(array X, array Y)
  • While working with sample data, Covariance.S(array X, array Y)

Covariance Calculation Using Python

In python, we can calculate the covariance using cov() function as below.

import pandas as pd
import numpy as np
# Setting a seed so the example is reproducible 
df = pd.DataFrame(np.random.randint(low= 0, high= 20, size= (5, 2)), 
                  columns= ['X', 'Y'])
df[['X', 'Y']].cov()

Interpretation of Covariance

  • A positive value of covariance means that the variables are positively related.
  • A negative value of covariance means that the variables are negatively related.

The following diagram explains the covariance.

covariance interpretation

Issues with Covariance?

Covariance suffers from two major drawbacks. It requires both of the variables to be of same unit to have any meaningful interpretation. It also varies as the scale of unit changes. These issues are summarized below :-

  • Problem of units – The larger the X and Y values, the larger the covariance. A value in smaller units will be small and higher units will be high.
  • Problem of scale – How to compare a dataset in Rupees to a dataset with Meters? A weak covariance in one data set may be a strong one in a different data set with different scales.

To overcome the problem of scale and problem of units, we need to have a metric which is independent of these variations. This is where correlation comes in.

What is a Correlation?

Correlation is Covariance where normalization is done with respect to standard deviation of two different distributions. One of the most popular correlation is known as Pearson’s Correlation. It is defined in terms of correlation coefficient given by


Interpretation of Pearson’s correlation coefficient

Pearson’s correlation coefficient explains about the direction as well as the strength of relationship between two variables. It’s value ranges from -1 to +1 where

  • -1= total negative linear correlation
  • 0= no linear correlation
  • +1= total positive linear correlation
The following diagrams and tables depict the strength of correlation based on Pearson’s coefficient values.


Calculation of Pearson’s Correlation Coefficient using Excel

To calculate Pearson’s correlation coefficient in excel, following formula can be used :-

PEARSON(X array, Y array)

Calculation of Pearson’s Correlation Coefficient using Python

Pearson’s correlation coefficient can be calculated as follows in python. In the code below, we have determined the correlation between literacy rate and sex ratio from data of 640  Indian districts.

from scipy.stats.stats import pearsonr

Correlation does not mean Causation

Correlation coefficient only tells that the two variables are linearly related or not. It doesn’t tell that the change in one variable is due to the changes in 2nd variable. The two variables may or may not be causal in nature. For example, If Ram’s age increases over time and  his brother Shyam’s age also increases over the same time. In this case, the two ages are correlated but are not “Causal” in nature.

The following video explains these concepts in detail.