Entropy
What is entropy?
Entropy is a core concept in Information Theory. It is defined as the average level of ‘information’, ‘surprise’, or ‘uncertainty’ inherent in the possible outcome of a random variable. For example, a coin flip has two possibilities so we can expect it to have lower entropy than a dice, which has six possibilities. Below, we define the equation for Entropy, sometimes called Shannon’s Entropy:
\[H(X) = -\Sigma_{i=1}^{n} P(X_{i}) logP(X_{i})\]Here is a simple function for calculating entropy. Let’s use it to compare between a coin and a dice.
entropy <- function(prob){
s = -sum(prob * log2(prob))
return(s).
cat('\n The entropy of a fair coin toss is: ',entropy(rep(1/2,2)))
##
## The entropy of a fair coin toss is: 1
cat('\n The entropy of a fair dice roll is: ',entropy(rep(1/6,6)))
##
## The entropy of a fair dice roll is: 2.584963
As expected, the coin toss has lower entropy because there are fewer possibilities compared to the dice roll.
Let’s expand our definition of entropy a bit more by conceptualizing its relationship to information. In Information Theory, we can think of information as being stored in or transmitted as variables. We can obtain information from a variable just by observing its value. The entropy of any variable is the “amount of information” contained within the variable. However, the amount of information is not determined by the number of different values. Instead, it is determined in proportion to the average level of ‘uncertainty’.
For instance, we can take a book and see that a book with more pages could have more information than a book with less pages. However, if you have already ready the larger book, then there is no level of ‘uncertainty’ and therefore, no amount of information. Likewise, the information in a variable is tied to the amount of uncertainy or surprise. Another way of thinking of it is that lower probability events have more information, amd higher probability events have less information.
Exploring entropy in mtcars
Let’s use the mtcars
data set to test to explore Entropy.
data("mtcars")
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
We can calulate the entropy of the engine type. Another way of stating the question is asking how pure is this data set with respect to engine type? If the data set only have 4 cylinder cars, then it is 100% pure.
freq <- table(mtcars$cyl)/length(mtcars$cyl) #Find the frequency for each type of engine
cyl_prob <- as.data.frame(freq)[,2]
e0 <- entropy(cyl_prob)
cat('\n The entropy is:', e0)
##
## The entropy is: 1.530994
Since the entropy is non-zero, it is impure and not 100% for any type of engine. Let’s subset the data and test it with only 4 cylinder cars.
mtcars_subset <- mtcars[mtcars$cyl == '4',]
freq <- table(mtcars_subset$cyl)/length(mtcars_subset$cyl) #Find the frequency for each type of engine
cyl_prob <- as.data.frame(freq)[,2]
cat('\n The entropy is:', entropy(cyl_prob))
##
## The entropy is: 0
The entropy is zero because we know within this subset that there are only 4 cylinder cars. Suppose we want to know if we can reduce entropy by segmenting the data. We can use Information Gain to measure how much “information” a specified feature gives us about the data set.
Information gain
Information gain is defined as the reduction of entropy due to a change in the dataset. It is defined as follows:
\[IS(X, Y)= H(X) - H(X|Y)\] \[H(X|Y) =\Sigma_{i=1}^{n} P(Y_{i}) H(X|Y_{i})\]Let’s take a look at an example. Here we will split the data into four
bins and calculate the min()
and max()
for MPG.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
data <- mtcars
data$bin<-cut(data[,'mpg'], breaks=4, labels=c(1:4))
summary_data <- data %>% group_by(bin) %>% summarise(
n=length(cyl),
min=min(mpg),
max=max(mpg).
summary_data
## # A tibble: 4 × 4
## bin n min max
## <fct> <int> <dbl> <dbl>
## 1 1 10 10.4 15.8
## 2 2 13 16.4 21.5
## 3 3 5 22.8 27.3
## 4 4 4 30.4 33.9
Next, we will calculate the entropy for engine cylinder after segmentation and we will also calculate the probability for choosing a bin.
cyl_prob <- data %>% group_by(bin) %>% count(cyl) %>% summarise(e = entropy(n/sum(n))) #Find the entropy of cylinder for each bin
summary_data <- right_join(summary_data, cyl_prob, c('bin'))
summary_data$p <- summary_data$n/nrow(data) #calculate probability for choosing a bin
summary_data
## # A tibble: 4 × 6
## bin n min max e p
## <fct> <int> <dbl> <dbl> <dbl> <dbl>
## 1 1 10 10.4 15.8 0 0.312
## 2 2 13 16.4 21.5 1.42 0.406
## 3 3 5 22.8 27.3 0 0.156
## 4 4 4 30.4 33.9 0 0.125
Finally, we find our information gain.
IG <- e0-sum(summary_data$p*summary_data$e) #find the information gain
cat('\n The entropy is:', IG)
##
## The entropy is: 0.954299
What does this all mean? To start, we were able to reduce entropy after segmentation. We also learn that some of the bins have an entropy of zero. This suggests that after segmentation, there is a decrease in amount of uncertainty. This means that by selecting any one of the bins, we may have a 100% chance to have cars with single type of engine. Let’s have a look at the data again.
data %>% group_by(bin, cyl) %>% summarise(n=length(cyl),
min=min(mpg),
max=max(mpg))
## `summarise()` has grouped output by 'bin'. You can override using the `.groups` argument.
## # A tibble: 6 × 5
## # Groups: bin [4]
## bin cyl n min max
## <fct> <dbl> <int> <dbl> <dbl>
## 1 1 8 10 10.4 15.8
## 2 2 4 2 21.4 21.5
## 3 2 6 7 17.8 21.4
## 4 2 8 4 16.4 19.2
## 5 3 4 5 22.8 27.3
## 6 4 4 4 30.4 33.9
As you can see, bin 1, 3, and 4 only contain cars with a single type of engine. So this means that we are improving entropy over the complete data set by segmenting on MPG. In other words, if we were to draw a random car and knew the MPG, we have a good idea of how many cylinders the car has. So if given a car that had a MPG of 30.5, than we have a 100% chance of it being a 4 cylinder car.