Principal Component Analysis
In this post, I will be discussing Principal Component Analysis (PCA) and how it can be used in data sets to understand variation with many different variables. Many real-world data sets are often in high dimensions. In other words, they contain so many variables that it is difficult to distinguish a trend.
PCA is a type of dimensional reduction method that can assist us by identifying which variables contribute to the variation within the data. It is similar to fitting a best fit line onto a 2D graph, but in this case there may be more than three-dimensions of data. It is formulated in the language of linear algebra and is an often used technique in data science and bioinformatics.
library(ggplot2) library(cowplot) # required to arrange multiple plots in a grid library(dplyr)
## ## Attaching package: 'dplyr' ## The following objects are masked from 'package:stats': ## ## filter, lag ## The following objects are masked from 'package:base': ## ## intersect, setdiff, setequal, union
Exploring the data
We will be using the often cited
iris data set. As you can see, there
are more than five dimensions of data.
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## 1 5.1 3.5 1.4 0.2 setosa ## 2 4.9 3.0 1.4 0.2 setosa ## 3 4.7 3.2 1.3 0.2 setosa ## 4 4.6 3.1 1.5 0.2 setosa ## 5 5.0 3.6 1.4 0.2 setosa ## 6 5.4 3.9 1.7 0.4 setosa
Let us try to observe some patterns with different combinations of variables.
p1 <- ggplot(iris, aes(x=Sepal.Length, y=Sepal.Width, color=Species)) + geom_point() p2 <- ggplot(iris, aes(x=Sepal.Length, y=Petal.Length, color=Species)) + geom_point() p3 <- ggplot(iris, aes(x=Petal.Length, y=Petal.Width, color=Species)) + geom_point() p4 <- ggplot(iris, aes(x=Sepal.Width, y=Petal.Width, color=Species)) + geom_point() plot_grid(p1, p2, p3, p4, labels = "AUTO")
As you can see from the graphs, petal length, petal width, and perhaps sepal length contribute the most to the differences between the three species.
Let’s try using PCA as a better method for detecting such patterns.
First, we will only include numerical data. We will pass our data set
(excluding the Species column) into the function
scale will be set to true to scale the data to 0 mean and unit
iris.pca <- prcomp(iris[,1:4], center = TRUE, scale. = TRUE) summary(iris.pca)
## Importance of components: ## PC1 PC2 PC3 PC4 ## Standard deviation 1.7084 0.9560 0.38309 0.14393 ## Proportion of Variance 0.7296 0.2285 0.03669 0.00518 ## Cumulative Proportion 0.7296 0.9581 0.99482 1.00000
The main results are that Principal Components 1 and 2 explains the majority of the variance within the data set. These values are derived from the eigenvectors and eigenvalues from the covariance matrix.
To better understand, you will need a basic understanding of eigenvectors and eigenvalues. To simply put, an eigenvector is a vector whose direction remains unchanged after a linear transformation. In that regard, an eigenvector provides direction or orientation of the data, like 90 or 45 degrees. The change after a linear transformation is described as the eigenvalue. In other words, the variance or spread of the data is explained by the eigenvalue. In addition, an n-th dimensional matrix will have the same n eigenvectors/eigenvalues. Therefore, each principal component is an eigenvector and PC1 has the highest eigenvalue (highest variance).
We can observe the results of the linear transformation with
## PC1 PC2 PC3 PC4 ## [1,] -2.257141 -0.4784238 0.12727962 0.024087508 ## [2,] -2.074013 0.6718827 0.23382552 0.102662845 ## [3,] -2.356335 0.3407664 -0.04405390 0.028282305 ## [4,] -2.291707 0.5953999 -0.09098530 -0.065735340 ## [5,] -2.381863 -0.6446757 -0.01568565 -0.035802870 ## [6,] -2.068701 -1.4842053 -0.02687825 0.006586116
Let’s add back in the species data.
pca_data <- data.frame(iris.pca$x, Species=iris$Species) head(pca_data)
## PC1 PC2 PC3 PC4 Species ## 1 -2.257141 -0.4784238 0.12727962 0.024087508 setosa ## 2 -2.074013 0.6718827 0.23382552 0.102662845 setosa ## 3 -2.356335 0.3407664 -0.04405390 0.028282305 setosa ## 4 -2.291707 0.5953999 -0.09098530 -0.065735340 setosa ## 5 -2.381863 -0.6446757 -0.01568565 -0.035802870 setosa ## 6 -2.068701 -1.4842053 -0.02687825 0.006586116 setosa
Next, we will plot the first two principal components as these would explain the highest variation within the data.
ggplot(pca_data, aes(x=PC1, y=PC2, color=Species)) + geom_point()
We can observe that setosa separates further from versicolor and
virginica. To understand which variables contribute to this variation in
the data, we will call
## PC1 PC2 PC3 PC4 ## Sepal.Length 0.5210659 -0.37741762 0.7195664 0.2612863 ## Sepal.Width -0.2693474 -0.92329566 -0.2443818 -0.1235096 ## Petal.Length 0.5804131 -0.02449161 -0.1421264 -0.8014492 ## Petal.Width 0.5648565 -0.06694199 -0.6342727 0.5235971
Here we can see that
contributes quite a bit to PC1 compared to
Sepal.Width contributes greatly to PC2.
Therefore, based on the above graph, we can infer that variance in the
x-axis is due to
variance in the y-axis is due to
Sepal.Width. Running the following
code will show this concept as vectors.
# capture the rotation matrix in a data frame rotation_data <- data.frame(iris.pca$rotation, variable=row.names(iris.pca$rotation)) # define a pleasing arrow style arrow_style <- arrow(length = unit(0.05, "inches"), type = "closed") # now plot, using geom_segment() for arrows and geom_text for labels ggplot(rotation_data) + geom_segment(aes(xend=PC1, yend=PC2), x=0, y=0, arrow=arrow_style) + geom_text(aes(x=PC1, y=PC2, label=variable), hjust=0, size=3, color='blue') + xlim(-1.,1.25) + ylim(-1.,1.) + coord_fixed() # fix aspect ratio to 1:1
As you can see, before performing a PCA, we used four graphs to observe the conclusion that petal length, petal width, and sepal length contributed the most to the variance in species. With PCA, we could easily observe this with the PCA plot of PC1 vs PC2. In addition, we used a simple dataset to illustrate this. Many real-world data sets consists of much higher dimensions.