- 1 Are Two Events Related?
- 2 What is Covariance ?
- 3 Interpretation of Covariance
- 4 Issues with Covariance?
- 5 What is a Correlation?
- 6 Interpretation of Pearson’s correlation coefficient
- 7 Calculation of Pearson’s Correlation Coefficient using Excel
- 8 Calculation of Pearson’s Correlation Coefficient using Python
- 9 Correlation does not mean Causation
Are Two Events Related?
- Does the sex ratio of a district is related to the literacy rate?
- Does the property rate is affected by the distance from the railway station?
- Does the advertisement of a product affect its sales?
As evident from above questions, In model designing, we want to know if the output is affected by a change in input or not! And if Yes, then by what amount? Let’s try to answer these questions.
What is Covariance ?
- When X gets bigger, does Y get bigger, or does it get smaller? (direction)
- Does Y get a lot bigger/smaller, or just a little bit? (strength)
Covariance measure the directional relation between two random variables. It measures this by comparing the variances of both the variables.
Mathematically covariance of a population is given as below.
Covariance Calculation Using Excel
- While working with population data, Covariance.P(array X, array Y)
- While working with sample data, Covariance.S(array X, array Y)
In python, we can calculate the covariance using cov() function as below.
import pandas as pd
import numpy as np
# Setting a seed so the example is reproducible
df = pd.DataFrame(np.random.randint(low= 0, high= 20, size= (5, 2)),
columns= ['X', 'Y'])
Interpretation of Covariance
- A positive value of covariance means that the variables are positively related.
- A negative value of covariance means that the variables are negatively related.
The following diagram explains the covariance.
Issues with Covariance?
Covariance suffers from two major drawbacks. It requires both of the variables to be of same unit to have any meaningful interpretation. It also varies as the scale of unit changes. These issues are summarized below :-
- Problem of units – The larger the X and Y values, the larger the covariance. A value in smaller units will be small and higher units will be high.
- Problem of scale – How to compare a dataset in Rupees to a dataset with Meters? A weak covariance in one data set may be a strong one in a different data set with different scales.
To overcome the problem of scale and problem of units, we need to have a metric which is independent of these variations. This is where correlation comes in.
What is a Correlation?
Interpretation of Pearson’s correlation coefficient
Pearson’s correlation coefficient explains about the direction as well as the strength of relationship between two variables. It’s value ranges from -1 to +1 where
- -1= total negative linear correlation
- 0= no linear correlation
- +1= total positive linear correlation
Calculation of Pearson’s Correlation Coefficient using Excel
To calculate Pearson’s correlation coefficient in excel, following formula can be used :-
PEARSON(X array, Y array)
Calculation of Pearson’s Correlation Coefficient using Python
Pearson’s correlation coefficient can be calculated as follows in python. In the code below, we have determined the correlation between literacy rate and sex ratio from data of 640 Indian districts.
from scipy.stats.stats import pearsonr