Entropy

Entropy is a core concept in Information Theory. It is defined as the average level of ‘information’, ‘surprise’, or ‘uncertainty’ inherent in the possible outcome of a random variable. For example, a coin flip has two possibilities so we can expect it to have lower entropy than a dice, which has six possibilities. Below, we define the equation for Entropy, sometimes called Shannon’s Entropy:

\[ H(X) = - \Sigma_{i=1}^{n} P(X_{i}) logP(X_{i})\] Here is a simple function for calculating entropy. Let’s use it to compare between a coin and a dice.

entropy <- function(prob){
  s = -sum(prob * log2(prob))
  return(s)
}
cat('\n The entropy of a fair coin toss is: ',entropy(rep(1/2,2)))
## 
##  The entropy of a fair coin toss is:  1
cat('\n The entropy of a fair dice roll is: ',entropy(rep(1/6,6)))
## 
##  The entropy of a fair dice roll is:  2.584963

As expected, the coin toss has lower entropy because there are fewer possibilities compared to the dice roll.

Let’s expand our definition of entropy a bit more by conceptualizing its relationship to information. In Information Theory, we can think of information as being stored in or transmitted as variables. We can obtain information from a variable just by observing its value. The entropy of any variable is the “amount of information” contained within the variable. However, the amount of information is not determined by the number of different values. Instead, it is determined in proportion to the average level of ‘uncertainty’.

For instance, we can take a book and see that a book with more pages could have more information than a book with less pages. However, if you have already ready the larger book, then there is no level of ‘uncertainty’ and therefore, no amount of information. Likewise, the information in a variable is tied to the amount of uncertainy or surprise. Another way of thinking of it is that lower probability events have more information, amd higher probability events have less information.

Let’s use the mtcars data set to test to explore Entropy.

data("mtcars")
head(mtcars)

We can calulate the entropy of the engine type. Another way of stating the question is asking how pure is this data set with respect to engine type? If the data set only have 4 cylinder cars, then it is 100% pure.

freq <- table(mtcars$cyl)/length(mtcars$cyl) #Find the frequency for each type of engine
cyl_prob <- as.data.frame(freq)[,2]
e0 <- entropy(cyl_prob) 
cat('\n The entropy is:', e0)
## 
##  The entropy is: 1.530994

Since the entropy is non-zero, it is impure and not 100% for any type of engine. Let’s subset the data and test it with only 4 cylinder cars.

mtcars_subset <- mtcars[mtcars$cyl == '4',]
freq <- table(mtcars_subset$cyl)/length(mtcars_subset$cyl) #Find the frequency for each type of engine
cyl_prob <- as.data.frame(freq)[,2]
cat('\n The entropy is:', entropy(cyl_prob) )
## 
##  The entropy is: 0

The entropy is zero because we know within this subset that there are only 4 cylinder cars. Suppose we want to know if we can reduce entropy by segmenting the data. We can use Information Gain to measure how much “information” a specified feature gives us about the data set.

Information gain is defined as the reduction of entropy due to a change in the dataset. It is defined as follows:

\[ IS(X, Y)= H(X) - H(X|Y) \]

\[H(X|Y) = \Sigma_{i=1}^{n} P(Y_{i}) H(X|Y_{i})\]

Let’s take a look at an example. Here we will split the data into four bins and calculate the min() and max() for MPG.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
data <- mtcars
data$bin<-cut(data[,'mpg'], breaks=4, labels=c(1:4))
summary_data <- data %>% group_by(bin) %>% summarise(
                 n=length(cyl),
                 min=min(mpg),
                 max=max(mpg)
                 )
## `summarise()` ungrouping output (override with `.groups` argument)
summary_data

Next, we will calculate the entropy for engine cylinder after segmentation and we will also calculate the probability for choosing a bin.

cyl_prob <- data %>% group_by(bin) %>% count(cyl) %>% summarise(e = entropy(n/sum(n)))  #Find the entropy of cylinder for each bin
## `summarise()` ungrouping output (override with `.groups` argument)
summary_data <- right_join(summary_data, cyl_prob, c('bin'))
summary_data$p <- summary_data$n/nrow(data) #calculate probability for choosing a bin
summary_data

Finally, we find our information gain.

IG <- e0-sum(summary_data$p*summary_data$e) #find the information gain
cat('\n The entropy is:', IG)
## 
##  The entropy is: 0.954299

What does this all mean? To start, we were able to reduce entropy after segmentation. We also learn that some of the bins have an entropy of zero. This suggests that after segmentation, there is a decrease in amount of uncertainty. This means that by selecting any one of the bins, we may have a 100% chance to have cars with single type of engine. Let’s have a look at the data again.

data %>% group_by(bin, cyl) %>% summarise(n=length(cyl),
                 min=min(mpg),
                 max=max(mpg))
## `summarise()` regrouping output by 'bin' (override with `.groups` argument)

As you can see, bin 1, 3, and 4 only contain cars with a single type of engine. So this means that we are improving entropy over the complete data set by segmenting on MPG. In other words, if we were to draw a random car and knew the MPG, we have a good idea of how many cylinders the car has. So if given a car that had a MPG of 30.5, than we have a 100% chance of it being a 4 cylinder car.

Additional Information:

https://rstudio-pubs-static.s3.amazonaws.com/455435_30729e265f7a4d049400d03a18e218db.html

https://medium.com/udacity/shannon-entropy-information-gain-and-picking-balls-from-buckets-5810d35d54b4

https://arxiv.org/pdf/1405.2061.pdf

Written on April 6, 2021

[ tutorial  machine-learning  R  ]