Exploratory Data Analysis in R
In this post, we will be analyzing data as if we were looking at it for
the first time. This is called ‘Exploratory Data Analysis’. We will
visualize the data and gather insights from the iris
data set.
Exploring the data
First, let’s load up the data and have a brief look at it. Using
class()
, we can see that the data set is a data.frame
.
data(iris)
class(iris)
## [1] "data.frame"
Using head()
, we can take a look at the first 5 rows. This may be
helpful when analyzing large data sets.
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
There are 4 columns that are made up of numbers and one with strings.
The fifth column Species
suggests that each row corresponds to a data
from a single sample of that species. We can also use dim()
to find
the full dimensions of the data.
dim(iris)
## [1] 150 5
This tells us that there are 150 rows and 5 columns in this data frame.
It is also good practice to use the str
function to briefly look at
the structure of the data.
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
Again, this confirmed our brief look earlier that there are 4 columns of
numeric values. We can also see that the fifth column is actually a
Factor
with 3 levels. Using summmary()
, we can acquire some insight
on the data distribution.
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
Data Visualization
We can start by looking at the data point distribution of Sepal.Length
between the three species. We use the library ggplot2
and create a
scatter plot with geom_point()
library(ggplot2)
ggplot(iris, mapping = aes(y = Sepal.Length, x = Species, color = Species)) +
geom_point()
It is tough to see the distribution in this manner, but it does look
like there may be differences between the Species
. Using
geom_histrogram()
we can build a distribution based on the number of
observation in each bin. In this case, the x-axis will be divided into
30 equally spaced bins with values between the minimum and maximum
Sepal.Length
. This method works well for continuous variables.
ggplot(iris, mapping = aes(x = Sepal.Length, fill = Species)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Let’s try using a box plot, also called box and whiskers, to look at the
distribution of Sepal.Length
between the three species. A box plot
summarizes the data with five numbers:
- Minimum
- First Quartile
- Median
- Third Quartile
- Maximum
Typically, the rectangle spans the first and third quartile of the data set, which is also known as the interquartile range (IQR). The line in the middle denotes the median, and the whiskers above and below denote the maximum and minimum respectively. Outliers are also shown as data points that fall 1.5 times the IQR from either edge of the box.
ggplot(iris, mapping = aes(y = Sepal.Length, x = Species, fill = Species)) + geom_boxplot()
We can also plot multiple graphs using facet_wrap
, but first we need
to reshape the data frame in the long
format.
library(reshape2)
iris_melt <- melt(iris, id = 'Species') #reshaping the dataframe
head(iris_melt)
## Species variable value
## 1 setosa Sepal.Length 5.1
## 2 setosa Sepal.Length 4.9
## 3 setosa Sepal.Length 4.7
## 4 setosa Sepal.Length 4.6
## 5 setosa Sepal.Length 5.0
## 6 setosa Sepal.Length 5.4
Now the data is in a long
format where each row provides a sample id
Species
, a variable, like Sepal.Length
, and the corresponding value.
We can use facet_wrap
to generate multiple plots.
ggplot(iris_melt, aes(x = Species, y = value, fill = Species)) +
geom_boxplot() +
facet_wrap(~variable)
Correlations
Interestingly, it looks like there is a correlation for some of the
variables. We can explore this by generating a correlation matrix.
First, we will exclude the fifth column since that is non-numeric. Next,
we will use the function cor()
.
iris_cor <- cor(iris[,-5],iris[,-5])
head(iris_cor)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length 1.0000000 -0.1175698 0.8717538 0.8179411
## Sepal.Width -0.1175698 1.0000000 -0.4284401 -0.3661259
## Petal.Length 0.8717538 -0.4284401 1.0000000 0.9628654
## Petal.Width 0.8179411 -0.3661259 0.9628654 1.0000000
It seems that all the variables except for Sepal.Width
correlate.
Let’s visualize this with ggplot()
. We will need to reshape our data
first. We will also add a new color scale since the default color scale
isn’t too useful.
iris_cor_melt <- melt(iris_cor) #reshaping the data
ggplot(iris_cor_melt, aes(x = Var1, y = Var2, fill = value)) +
geom_tile(color = 'white') + #geom_tile generates tiles which are colored based on a value
scale_fill_gradient2(low = "blue", high = "red", mid = "white",
midpoint = 0, limit = c(-1,1), space = "Lab")
It doesn’t look too pretty. We can generate a helper function to reorder the matrix by correlation distance. This way we can easily see the variables that highly correlate.
reorder_cormat <- function(cormat){
# Use correlation between variables as distance
dd <- as.dist((1-cormat)/2)
hc <- hclust(dd)
cormat <-cormat[hc$order, hc$order]
}
iris_cor_ordered <- reorder_cormat(iris_cor)
head(iris_cor_ordered)
## Sepal.Width Sepal.Length Petal.Length Petal.Width
## Sepal.Width 1.0000000 -0.1175698 -0.4284401 -0.3661259
## Sepal.Length -0.1175698 1.0000000 0.8717538 0.8179411
## Petal.Length -0.4284401 0.8717538 1.0000000 0.9628654
## Petal.Width -0.3661259 0.8179411 0.9628654 1.0000000
Now we reshape the data and use ggplot
to visualize the ordered
correlation matrix.
iris_cor_ordered_melt <- melt(iris_cor_ordered, na.rm = TRUE)
ggplot(iris_cor_ordered_melt, aes(x = Var1, y = Var2, fill = value)) +
geom_tile(color = 'white') +
scale_fill_gradient2(low = "blue", high = "red", mid = "white",
midpoint = 0, limit = c(-1,1), space = "Lab")
This can also be done using pheatmap
with the added benefit of
pre-built ordering and cluster trees.
library(pheatmap)
pheatmap(iris_cor,
color = colorRampPalette(c('blue', 'white', 'red'))(100),
breaks = seq(-1,1, length.out = 100))
Statistical Testing
Based on the data, it looks like virginica
has the longest
Sepal.Length
. We can test for this using basic statistics. Let’s
perform a simple Student’s t-Test between virginica
and setosa
. We
will use the reshaped data frame and remove the versicolor
species.
iris_subset_melt <- subset(iris_melt, subset = Species == 'virginica' | Species == 'setosa' ) #removing versicolor so we have the two groups we are comparing
head(iris_subset_melt)
## Species variable value
## 1 setosa Sepal.Length 5.1
## 2 setosa Sepal.Length 4.9
## 3 setosa Sepal.Length 4.7
## 4 setosa Sepal.Length 4.6
## 5 setosa Sepal.Length 5.0
## 6 setosa Sepal.Length 5.4
Let’s do some more data wrangling to set up our data. Essentially, we
will create two numeric vectors corresponding to Sepal.Length
for both
species.
setosa_sepal_length = subset(iris_subset_melt,
subset =
variable == 'Sepal.Length' &
Species == 'setosa')$value #selecting only values for setosa
virginica_sepal_length = subset(iris_subset_melt,
subset =
variable == 'Sepal.Length' &
Species == 'virginica')$value #selecting only values for virginica
With that, we are ready to perform the Student’s t-Test for two samples. Essentially, we are comparing the means of two groups with a single variable.
t.test(setosa_sepal_length, virginica_sepal_length)
##
## Welch Two Sample t-test
##
## data: setosa_sepal_length and virginica_sepal_length
## t = -15.386, df = 76.516, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.78676 -1.37724
## sample estimates:
## mean of x mean of y
## 5.006 6.588
The graph already suggested that the two species were different based on
the Petal.Length
. Here the p-value
is less than 2.2e-16, where a
typical alpha is 0.05. Therefore, the null hypothesis where there is no
difference is rejected.
If we want to continue performing more comparisons, we will run into the multiple testing problem. In other words, if we keep testing different variables, we will eventually find a difference. Since, we are expecting a 5% chance of incorrectly rejecting the null hypothesis, then performing 100 multiple comparisons will result in 5 incorrect rejections or false positives. A simple way to correct for this is Bonferroni correction, which just divides the alpha by the number of total comparisons. There are additional methods, which you can read more here.
To compare multiple groups, we will perform a One-Way ANOVA. While t-tests compare only two groups, an ANOVA can compare three or more groups. One more note is that an ANOVA only tests if one or more mean is different. Post-hoc comparison test with multiple comparison correction will be needed to find which pairwise comparison was statically different.
iris_anova <- aov(Sepal.Width ~ Species, data = iris)
summary(iris_anova)
## Df Sum Sq Mean Sq F value Pr(>F)
## Species 2 11.35 5.672 49.16 <2e-16 ***
## Residuals 147 16.96 0.115
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
There is a difference between Species
just on Sepal.Length
alone. It
is significant with a p-value <2e-16. However, we do not know which
direction or between which Species
. We can use a pairwise t-test to
find out. First, we will try without any multiple testing correction.
pairwise.t.test(x = iris$Sepal.Length, g = iris$Species, p.adj = 'none')
##
## Pairwise comparisons using t tests with pooled SD
##
## data: iris$Sepal.Length and iris$Species
##
## setosa versicolor
## versicolor 8.8e-16 -
## virginica < 2e-16 2.8e-09
##
## P value adjustment method: none
Now with we can try with Bonferroni correction.
pairwise.t.test(x = iris$Sepal.Length, g = iris$Species, p.adj = 'bonf')
##
## Pairwise comparisons using t tests with pooled SD
##
## data: iris$Sepal.Length and iris$Species
##
## setosa versicolor
## versicolor 2.6e-15 -
## virginica < 2e-16 8.3e-09
##
## P value adjustment method: bonferroni
The result is still the same, but it may not always be the case. See here.
If we wanted to plot the resulting p-values, then we can use the package
ggpubr
.
library(ggpubr)
ggplot(iris, mapping = aes(y = Sepal.Length, x = Species, fill = Species)) +
geom_boxplot() +
stat_compare_means()
As you can see, we can plot the p-value from an ANOVA directly onto the
plot, which only tells you that there is a difference but not between
which pair of Species
. To perform the pairwise comparisons, you have
to manually generate a list
of pairwise comparisons.
my_comparisons = list(c('setosa','virginica'), c('setosa','versicolor'), c('versicolor','virginica'))
ggplot(iris, mapping = aes(y = Sepal.Length, x = Species, fill = Species)) +
geom_boxplot() +
stat_compare_means(comparisons = my_comparisons)
We can also repeat this with all variables as well. It can get a bit
messy so we will switch the p-value to “p.signif”. This essentially uses
the *
symbol for the following:
- ns: p > 0.05
- *: p <= 0.05
- **: p <= 0.01
- ***: p <= 0.001
- ****: p <= 0.0001
ggplot(iris_melt, aes(x = Species, y = value, fill = Species)) +
geom_boxplot() +
facet_wrap(~variable) +
stat_compare_means(comparisons = my_comparisons, label = 'p.signif')