Exploratory Data Analysis in R

13 minute read

In this post, we will be analyzing data as if we were looking at it for the first time. This is called ‘Exploratory Data Analysis’. We will visualize the data and gather insights from the iris data set.

Exploring the data

First, let’s load up the data and have a brief look at it. Using class(), we can see that the data set is a data.frame.

data(iris)
class(iris)

## [1] "data.frame"

Using head(), we can take a look at the first 5 rows. This may be helpful when analyzing large data sets.

head(iris)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

There are 4 columns that are made up of numbers and one with strings. The fifth column Species suggests that each row corresponds to a data from a single sample of that species. We can also use dim() to find the full dimensions of the data.

dim(iris)

## [1] 150   5

This tells us that there are 150 rows and 5 columns in this data frame. It is also good practice to use the str function to briefly look at the structure of the data.

str(iris)

## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

Again, this confirmed our brief look earlier that there are 4 columns of numeric values. We can also see that the fifth column is actually a Factor with 3 levels. Using summmary(), we can acquire some insight on the data distribution.

summary(iris)

##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 

Data Visualization

We can start by looking at the data point distribution of Sepal.Length between the three species. We use the library ggplot2 and create a scatter plot with geom_point()

library(ggplot2)
ggplot(iris, mapping = aes(y = Sepal.Length, x = Species, color = Species)) +
  geom_point()

It is tough to see the distribution in this manner, but it does look like there may be differences between the Species. Using geom_histrogram() we can build a distribution based on the number of observation in each bin. In this case, the x-axis will be divided into 30 equally spaced bins with values between the minimum and maximum Sepal.Length. This method works well for continuous variables.

ggplot(iris, mapping = aes(x = Sepal.Length,  fill = Species)) + 
  geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Let’s try using a box plot, also called box and whiskers, to look at the distribution of Sepal.Length between the three species. A box plot summarizes the data with five numbers:

Minimum
First Quartile
Median
Third Quartile
Maximum

Typically, the rectangle spans the first and third quartile of the data set, which is also known as the interquartile range (IQR). The line in the middle denotes the median, and the whiskers above and below denote the maximum and minimum respectively. Outliers are also shown as data points that fall 1.5 times the IQR from either edge of the box.

ggplot(iris, mapping = aes(y = Sepal.Length, x = Species, fill = Species)) + geom_boxplot()

We can also plot multiple graphs using facet_wrap, but first we need to reshape the data frame in the long format.

library(reshape2)

iris_melt <-  melt(iris, id = 'Species') #reshaping the dataframe
head(iris_melt)

##   Species     variable value
## 1  setosa Sepal.Length   5.1
## 2  setosa Sepal.Length   4.9
## 3  setosa Sepal.Length   4.7
## 4  setosa Sepal.Length   4.6
## 5  setosa Sepal.Length   5.0
## 6  setosa Sepal.Length   5.4

Now the data is in a long format where each row provides a sample id Species, a variable, like Sepal.Length, and the corresponding value. We can use facet_wrap to generate multiple plots.

ggplot(iris_melt, aes(x = Species, y = value, fill = Species)) +
  geom_boxplot() + 
  facet_wrap(~variable)

Correlations

Interestingly, it looks like there is a correlation for some of the variables. We can explore this by generating a correlation matrix. First, we will exclude the fifth column since that is non-numeric. Next, we will use the function cor().

iris_cor <- cor(iris[,-5],iris[,-5])
head(iris_cor)

##              Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length    1.0000000  -0.1175698    0.8717538   0.8179411
## Sepal.Width    -0.1175698   1.0000000   -0.4284401  -0.3661259
## Petal.Length    0.8717538  -0.4284401    1.0000000   0.9628654
## Petal.Width     0.8179411  -0.3661259    0.9628654   1.0000000

It seems that all the variables except for Sepal.Width correlate. Let’s visualize this with ggplot(). We will need to reshape our data first. We will also add a new color scale since the default color scale isn’t too useful.

iris_cor_melt <- melt(iris_cor) #reshaping the data

ggplot(iris_cor_melt, aes(x = Var1, y = Var2, fill = value)) + 
  geom_tile(color = 'white') + #geom_tile generates tiles which are colored based on a value
 scale_fill_gradient2(low = "blue", high = "red", mid = "white", 
   midpoint = 0, limit = c(-1,1), space = "Lab")

It doesn’t look too pretty. We can generate a helper function to reorder the matrix by correlation distance. This way we can easily see the variables that highly correlate.

reorder_cormat <- function(cormat){
# Use correlation between variables as distance
dd <- as.dist((1-cormat)/2)
hc <- hclust(dd)
cormat <-cormat[hc$order, hc$order]
}

iris_cor_ordered <- reorder_cormat(iris_cor)
head(iris_cor_ordered)

##              Sepal.Width Sepal.Length Petal.Length Petal.Width
## Sepal.Width    1.0000000   -0.1175698   -0.4284401  -0.3661259
## Sepal.Length  -0.1175698    1.0000000    0.8717538   0.8179411
## Petal.Length  -0.4284401    0.8717538    1.0000000   0.9628654
## Petal.Width   -0.3661259    0.8179411    0.9628654   1.0000000

Now we reshape the data and use ggplot to visualize the ordered correlation matrix.

iris_cor_ordered_melt <- melt(iris_cor_ordered, na.rm = TRUE)

ggplot(iris_cor_ordered_melt, aes(x = Var1, y = Var2, fill = value)) + 
  geom_tile(color = 'white') +
  scale_fill_gradient2(low = "blue", high = "red", mid = "white", 
   midpoint = 0, limit = c(-1,1), space = "Lab")

This can also be done using pheatmap with the added benefit of pre-built ordering and cluster trees.

library(pheatmap)

pheatmap(iris_cor, 
         color = colorRampPalette(c('blue', 'white', 'red'))(100),
         breaks = seq(-1,1, length.out = 100)) 

Statistical Testing

Based on the data, it looks like virginica has the longest Sepal.Length. We can test for this using basic statistics. Let’s perform a simple Student’s t-Test between virginica and setosa. We will use the reshaped data frame and remove the versicolor species.

iris_subset_melt <- subset(iris_melt, subset = Species == 'virginica' | Species == 'setosa' ) #removing versicolor so we have the two groups we are comparing
head(iris_subset_melt)

##   Species     variable value
## 1  setosa Sepal.Length   5.1
## 2  setosa Sepal.Length   4.9
## 3  setosa Sepal.Length   4.7
## 4  setosa Sepal.Length   4.6
## 5  setosa Sepal.Length   5.0
## 6  setosa Sepal.Length   5.4

Let’s do some more data wrangling to set up our data. Essentially, we will create two numeric vectors corresponding to Sepal.Length for both species.

setosa_sepal_length = subset(iris_subset_melt, 
                             subset = 
                               variable == 'Sepal.Length' & 
                               Species == 'setosa')$value  #selecting only values for setosa

virginica_sepal_length = subset(iris_subset_melt, 
                                subset = 
                                  variable == 'Sepal.Length' & 
                                  Species == 'virginica')$value #selecting only values for virginica

With that, we are ready to perform the Student’s t-Test for two samples. Essentially, we are comparing the means of two groups with a single variable.

t.test(setosa_sepal_length, virginica_sepal_length)

## 
##  Welch Two Sample t-test
## 
## data:  setosa_sepal_length and virginica_sepal_length
## t = -15.386, df = 76.516, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.78676 -1.37724
## sample estimates:
## mean of x mean of y 
##     5.006     6.588

The graph already suggested that the two species were different based on the Petal.Length. Here the p-value is less than 2.2e-16, where a typical alpha is 0.05. Therefore, the null hypothesis where there is no difference is rejected.

If we want to continue performing more comparisons, we will run into the multiple testing problem. In other words, if we keep testing different variables, we will eventually find a difference. Since, we are expecting a 5% chance of incorrectly rejecting the null hypothesis, then performing 100 multiple comparisons will result in 5 incorrect rejections or false positives. A simple way to correct for this is Bonferroni correction, which just divides the alpha by the number of total comparisons. There are additional methods, which you can read more here.

To compare multiple groups, we will perform a One-Way ANOVA. While t-tests compare only two groups, an ANOVA can compare three or more groups. One more note is that an ANOVA only tests if one or more mean is different. Post-hoc comparison test with multiple comparison correction will be needed to find which pairwise comparison was statically different.

iris_anova <- aov(Sepal.Width ~ Species, data = iris)
summary(iris_anova)

##              Df Sum Sq Mean Sq F value Pr(>F)    
## Species       2  11.35   5.672   49.16 <2e-16 ***
## Residuals   147  16.96   0.115                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

There is a difference between Species just on Sepal.Length alone. It is significant with a p-value <2e-16. However, we do not know which direction or between which Species. We can use a pairwise t-test to find out. First, we will try without any multiple testing correction.

pairwise.t.test(x = iris$Sepal.Length, g = iris$Species, p.adj = 'none')

## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  iris$Sepal.Length and iris$Species 
## 
##            setosa  versicolor
## versicolor 8.8e-16 -         
## virginica  < 2e-16 2.8e-09   
## 
## P value adjustment method: none

Now with we can try with Bonferroni correction.

pairwise.t.test(x = iris$Sepal.Length, g = iris$Species, p.adj = 'bonf')

## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  iris$Sepal.Length and iris$Species 
## 
##            setosa  versicolor
## versicolor 2.6e-15 -         
## virginica  < 2e-16 8.3e-09   
## 
## P value adjustment method: bonferroni

The result is still the same, but it may not always be the case. See here.

If we wanted to plot the resulting p-values, then we can use the package ggpubr.

library(ggpubr)

ggplot(iris, mapping = aes(y = Sepal.Length, x = Species, fill = Species)) +
  geom_boxplot() + 
  stat_compare_means()

As you can see, we can plot the p-value from an ANOVA directly onto the plot, which only tells you that there is a difference but not between which pair of Species. To perform the pairwise comparisons, you have to manually generate a list of pairwise comparisons.

my_comparisons = list(c('setosa','virginica'), c('setosa','versicolor'), c('versicolor','virginica'))

ggplot(iris, mapping = aes(y = Sepal.Length, x = Species, fill = Species)) +
  geom_boxplot() + 
  stat_compare_means(comparisons = my_comparisons)

We can also repeat this with all variables as well. It can get a bit messy so we will switch the p-value to “p.signif”. This essentially uses the * symbol for the following:

ns: p > 0.05
*: p <= 0.05
**: p <= 0.01
***: p <= 0.001
****: p <= 0.0001

ggplot(iris_melt, aes(x = Species, y = value, fill = Species)) +
  geom_boxplot() + 
  facet_wrap(~variable) + 
    stat_compare_means(comparisons = my_comparisons, label = 'p.signif')

Additional Resources

Twitter Facebook LinkedIn

Danh Truong

Exploratory Data Analysis in R

Exploring the data

Data Visualization

Correlations

Statistical Testing

Additional Resources

You May Also Enjoy

Keys to Success

Rendering Math from RMarkdown

Introduction to R and RStudio

K-means from scratch in R