What is Feature Selection?
Does more features mean more information?
Often there are many irrelevant features in a dataset that do not carry much of the information. These features, when not dealt with, fool the machine learning algorithms. This is sometimes also referred to as the curse of dimensionality. To solve the curse of dimensionality, we need to rank the features which affect the target most. This process is called feature selection.
In this article, we will be using the Information Gain method to make feature selection.
What is Information Entropy?
Mathematically, Entropy is defined as
H\left(X\right) = -\displaystyle \sum_i p_i \, \log_2 (p_i)
Where, i is the number of different values that X can take.
Let’s formulate a simple problem to understand the theory of Information Gain.
Problem- Given the following dataset, we want to predict Y, and we have input X. Where
- X = College Major
- Y = Likes, “Harry Potter.”
From this data, let’s find the Entropy of X and Y. X takes three values,
- Math – 4 times, p(Math) = 0.5
- History – 2 times, p(History) = 0.25
- CS – 2 times, p(CS) = 0.25
What is Conditional Entropy?
- What is the Entropy in Y when X is already known H(Y/X)
From the above figure, we can calculate the Conditional Entropy as,
H(Y/X) = 0.5*1 + 0.25*0 +0.25*0 = 0.5
Now, let’s define the information gain in next section.