Statistics For Data Science Course

Kolmogorov-Smirnov Test

Do two data distributions differ?

In statistical analysis, we often stumble upon a problem where we need to check whether the two distributions alike or different. If they are different, how significant is that difference? E.g., we want to know if the behavior of the two groups before and after medication differs significantly. In other instances, we may want to know if a sample data distributions indeed belong to a known data distribution. E.g., if the sample data distribution follows a normal distribution or not. To study such differences, The Kolmogorov-Smirnov test (KS-test) comes as a handy tool as we may look in the further sections.

What is Kolmogorov-Smirnov test (KS-test)?

The Kolmogorov-Smirnov test is a nonparametric test that tries to find out if the two data distributions differ significantly. In some cases, it outperforms other tests such as t-test as it does not make any assumption about the underlying data distribution. It is an agnostic and nonparametric test. In the KS test, we define the Null and Alternate Hypothesis as below.
  • Ho- The two sample distributions are from the same population distribution
  • H1- The two sample distributions are not from the same population distribution
To test this hypothesis, we use the KS-test statistic, which is explained below.

KS-test Statistic

KS test statistic measures the maximum difference between the empirical cumulative distribution function(ECDF) of the two distributions under study. We can define ECDF as below.

{{F}_{n}}\left(x\right)=\frac{k }{n} 

where k = 1, 2,3….n and x can take values from x1,x2,…,xn which are ordered in ascending manner. If we want to know if the data distribution with ECDF Fn(x) differs from a known distribution( Normal, uniform, etc.) with cumulative distribution function (CDF) F(x), then KS-test statistic is given as below.


The calculation of the KS test statistic is depicted in the following figure.

Usually, KS-test is determined using software packages, and the corresponding p values determine whether the null hypothesis is rejected( both distributions are significantly different) or not. A low p-value( say less than 0.05) signifies that both the data distributions are significantly different.

KS-test example in python

Let’s say we want to know whether a random distribution significantly differs from “Normal” distribution or not. We will use scipy.stats library of python for this purpose. In the following example, as the p-value is less than 0.05, we will conclude that both the distributions are different.

KS test is explained in the following video.

Show Comments

No Responses Yet

Leave a Reply