529

# Feature Selection and Information Gain

## What is Feature Selection?

Feature selection is a technique used in machine learning to select the most relevant subset of available features in a dataset. This is a way to reduce the noise from the data and make sure the prediction/classification is more accurate.

This is not always the case. More information comes with more noise. The following graph shows a typical classifier behavior as the number of features(variables) increases.

Often there are many irrelevant features in a dataset that do not carry much of the information. These features, when not dealt with, fool the machine learning algorithms. This is sometimes also referred to as the curse of dimensionality. To solve the curse of dimensionality, we need to rank the features which affect the target most. This process is called feature selection.

In this article, we will be using the Information Gain method to make feature selection.

## What is Information Entropy?

Let’ s recall that a more probable event has less information content (unsurprising), and a less probable event has more information content (surprising).  Shanon gave a formula for the amount of information that confirms these facts. Amount of information is given as,

$I\left(x\right)=-{\mathrm{log}}_{2}p\left(x\right)$

Entropy measures the disorder of the system. A system with high entropy would be unpredictable and more disordered and would be less surprising. On the other hand, a system with less entropy would be highly predictable and would have less disorder. In the context of Data Science, Information Entropy measures how unpredictable a data distribution is.
The following diagrams represent the entropy of some random data distributions.

Mathematically, Entropy is defined as

$H\left(X\right) = -\displaystyle \sum_i p_i \, \log_2 (p_i)$

Where, i is the number of different values that X can take.

Let’s formulate a simple problem to understand the theory of Information Gain.

Problem- Given the following dataset, we want to predict Y, and we have input X. Where

• X = College Major
• Y = Likes, “Harry Potter.”

From this data, let’s find the Entropy of X and Y. X takes three values,

1. Math – 4 times, p(Math) = 0.5
2. History – 2 times, p(History) = 0.25
3. CS – 2 times, p(CS) = 0.25
Hence,
H(X)=$-0.5{\mathrm{log}}_{2}\left(0.5\right)-0.25{\mathrm{log}}_{2}\left(0.25\right)-0.25{\mathrm{log}}_{2}\left(0.25\right)$

= 0.5 +0.5+0.5 =1.5

Similarly, H(Y) = 1
As per, this, the distribution of X has more disorder than that of Y.

## What is Conditional Entropy?

In the previous example, we may ask the following question.
• What is the Entropy in Y when X is already known H(Y/X)
It is known as the average specific conditional Entropy of X and is given as,

$H\left(\frac{Y}{X}\right)=\sum _{i}P\left(X=v\right)H\left(\frac{Y}{X=v}\right)$

From the previous example, to calculate the conditional entropy H(Y/X), we can make the following table.

From the above figure, we can calculate the Conditional Entropy as,

H(Y/X) = 0.5*1 + 0.25*0 +0.25*0 = 0.5

Now, let’s define the information gain in next section.

## What is Information Gain?

In simple terms, Information gain is the amount of entropy ( disorder) we removed by knowing an input feature beforehand. Mathematically, Information gain is defined as,

IG(Y/X) = H(Y) – H(Y/X)

The more the Information gain, the more entropy is removed, and the more information does the variable X carries about Y.
In our example, IG is given as,

IG(Y/X) = 1 -0.5 = 0.5

## Feature Selection and Information Gain

In our example, we had only one feature X and the output label Y. But in the actual scenario, we will have numerous features X1, X2, X3……..Xn. In that case, we would determine the information gain for each of the features, IG(X1), IG(X2), …., and so on. We would then rank the features in the descending order of their respective information gains. We would decide a threshold and would include all the features above the threshold in the machine learning algorithms. Information Gain method is also used in the decision tree algorithm to decide the splitting criteria. There are many advantages and disadvantages of feature selection using the Information Gain method, which I would discuss in another article.