Since becoming a postdoc at MD Anderson Cancer Center, I have had more failures than successes. Although the countless failures seemed daunting, I used them as opportunities to learn and grow, which turned my failures into the experiences that I used to obtain success. As I reached the end of my tenure as a postdoc, I reflected on the keys to success that were turning points during my postdoc, which I want to share to improve the postdoctoral experience.
A postdoctoral position can be quick or long but is ultimately temporary. When starting a postdoc, working with your mentor or postdoctoral office to define goals you want to achieve during your time as a postdoc is essential. In addition, you will want to define measurable success metrics so that you can track whether you are progressing on your goals. One way to be accountable for your goals is to keep a calendar for when your goals should be accomplished. This way, you record it to feel more responsible for progressing on time.
Communication is crucial in your career, whether you stay in academia or continue elsewhere. When communicating, you must consider many audiences, such as your mentor, colleagues, the scientific community, and layperson. Communication can take many shapes and forms, like verbal, PowerPoint presentations, graphical abstracts, grant applications, and scientific papers. These aspects are crucial to consider when trying to deliver your message. Consider taking writing courses, using your institute’s scientific editing resources, and participating in journal clubs or symposiums since these activities will give you more exposure to communicating with different audiences and help you improve your communication skills.
Being able to deliver your message and ensuring that your audience is engaged takes significant work. Re-using the same abstract or presentation may seem the easiest, but it may not be the most effective. Different audiences will connect to varying styles of communication. Although you may have already prepared an excellent presentation, there is always room for improvement to better engage with your audience and effectively deliver your message.
Networking may seem like a trivial task, but doing it effectively can make a significant impact on your career. You may think otherwise, but regardless of your career trajectory, you will always require the involvement of other folks in your career. Whether it is a letter of recommendation or hearing about a new job, networking enables you to be in a strategic position to take advantage of opportunities when they arise.
As a postdoc, some of the best networking opportunities are at poster symposiums and lunches with the speakers. In functions where people are more willing to talk, you can take this opportunity to have organic conversations where they could lead to solving a complex problem that you’ve had or enabling a new scientific collaboration. If you do not have the confidence to strike up a conversation, you can try to be the presenter at poster symposiums or seminars. Once you share your exciting research results, I am sure many folks will walk up to you to network.
As a postdoc, there are institutional resources that you can take advantage of to boost your career development. Once you step on campus, the first thing you should do is find and learn what the Office of Postdoctoral Affairs offers. This includes career advice, funding sources, teaching opportunities, networking events, etc. The Office of Postdoctoral Affairs supports you and your career development. It is a free resource that only wants the best for you. If, for some reason, there is no Office of Postdoctoral Affairs, the National Postdoctoral Association offers similar resources that are accessible to organizational members.
Another great resource to jumpstart your career is the postdoctoral association. This group of postdocs works to support and represent the interests of postdocs. This group also facilitates professional enrichment, career development, and networking, fostering a sense of community among postdocs.
Participating in the events will surely be beneficial for your training as a postdoc. However, if things still need to be added to your postdoctoral training experience, consider joining as a leader to enact the change you want to see. This will give you leadership opportunities and show prospective recruiters you can manage and complete non-research-related projects.
Your time as a postdoc is temporary, but the resources you have at hand for career development will be abundant and have a lasting impact on your trajectory. It would be best if you strived to take advantage of workshops offered by the Office of Postdoctoral Affairs or through your postdoctoral associations. In addition, when the opportunity arises in your lab to introduce a new technology, take the time to thoroughly learn it, as you may find out that you will become the resident expert in this new technology. These are all opportunities that you can leverage for your next career opportunity.
In addition to learning new technologies and research-related skills, ensure to spend time on soft or transferable skills. These include communication, time management, project management, budgeting, emotional intelligence, teamwork, etc. MD Anderson Cancer Center offers many courses and workshops on these skills as they are commonplace in the working environment.
Regardless of whether your mentor fully funds your position, it is always a great idea to pursue additional funding through your institution, foundations, or government agencies like the NCI. If you are awarded, the money allocated for your stipend can now be used elsewhere while you demonstrate that you can win competitive grants.
Although you may have yet to be awarded the funding, brainstorming, planning, and writing a proposal application are essential skills to obtain. Preparing a proposal incorporates many critical skills needed in the workplace, such as budgeting, written and visual communication, time management, and creativity.
During your postdoctoral training, you will have opportunities to attend different types of conferences – large and small. Do not shy away from either since both can benefit your career.
Larger conferences often involve many folks and can span various topics. It is a great place to interact with folks from research topics that are tangential or different from yours. You can learn about other issues and incorporate these new ideas into your research. Also, at these conferences, vendors and pharmaceutical companies will be in attendance. These are chances to network to discuss potential new technologies and job opportunities.
Smaller conferences are more intimate. The topics are narrower, but there are more opportunities to discuss with collaborators and experts in the field. You can make a name for yourself at these types of conferences.
Mental health is often overlooked – especially as a postdoc. You will be stressed to publish, write grants, and present research. Learning the proper ways to de-stress and saying “no” to things are essential for caring for your mental health. Taking time off and spending it with friends and family is necessary. Exercising is one of the best ways to take your mind off things and de-stress. Another thing you can do is complete tasks that have been on your to-do list for a while. Most important is to develop a sense of when to take a break.
The most important key to success is that the postdoctoral training is for you. It is not for anyone else but you. The only outcome that matters is that you know more about yourself and the world than when you started your postdoc.
There will be projects that you need to complete where you may need to learn new technology. If your primary mentor does not know to teach you these skills, you should not use this as an excuse to give up. It would be best if you tried to find a way by learning from other mentors or colleagues or adapting using a core facility service or a different technology.
]]>My webpage and most others that have pages through GitHub are based on Jekyll, which cannot parse the math even after conversion. To handle this, I have come across a post by Fong Chun Chan that gives some insight. Essentially, you need to protect your latex equations with HTML tags in the R markdown file, so that when you perform the conversion, they are kept intact. Afterwards, you can remove the HTML tags and MathJax will interpret the equations and render them correctly.
First, we need to add the MathJax script to your website. For Jekyll,
you add it to _includes/head.html
. I use the Minimal
Mistakes theme so if you
have that, you can add it to _layouts/single.html
.
<script id="MathJax-script" async
src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-chtml.js">
</script>
This enables the MathJax javascript library so it can parse the math equations and render them on your website.
In R markdown, a typical display equation would be:
$$ y = mx + b $$
y = m**x + b
As you can see, it does not render properly. It should like like the one below:
\[y = mx + b\]To solve this, we can add HTML tags prior to the knitr conversion, where knitr will not touch the equations, and then remove them later so that MathJax can parse them. I found a script that can do just that. It will look like this with tags prior to markdown conversion.
<pre>$$ y = mx + b $$</pre>
I have modified the script below. You can download it here.
library(rmarkdown)
library(dplyr)
library(stringr)
# Get the filename given as an argument in the shell.
args <- commandArgs(TRUE)
filename <- args[1]
# Check that it's a .Rmd file.
if(!grepl(".Rmd", filename)) {
stop("You must specify a .Rmd file.")
}
tempfile <- sub('.Rmd', '_deleteme.Rmd', filename)
mdtempfile <- sub('.Rmd', '_deleteme.md', filename)
mdfile <- sub('.Rmd', '.md', filename)
read_lines <- readLines(filename)
# add pre tags around $...$
read_lines <- gsub("(\\${2}(.+?)\\${2})", "\\1<\\/pre>", read_lines)
writeLines(read_lines, tempfile)
rmarkdown::render(tempfile, output_format = "md_document"
, output_file = mdtempfile)
read_lines <- readLines(mdtempfile)
# remove pre tags
sel <- grepl("", read_lines)
read_lines[sel] <- str_replace(read_lines[sel], '', '') %>%
str_replace('</pre>', '')
# remove multiple spaces from lists
sel <- grepl("[[:space:]]{2}\\${1}(.+?)\\${1}", read_lines)
read_lines[sel] <- gsub('\\s+', ' ', read_lines[sel])
# add correct path for files
read_lines <- gsub('files/', "/files/" , read_lines) # this one is for me to add correct file path when you have output files, but you can change it to where your files are.
writeLines(read_lines, mdfile)
# delete temp files
unlink(mdtempfile)
unlink(tempfile)
Place the script in the same folder as your R markdown file. Then run the following in your terminal:
Rscript --vanilla r2jekyll.R your_RMarkdownFile.Rmd
Before running, if you generate output files or figures, make sure to add the following to the top of your R markdown notebook.
Now, for inline math equations. MathJax does not handle this unless you properly configure it. So simply add this right before where you added the MathJax script. For instance, see below:
<script>
MathJax = {
tex: {
inlineMath: [['$', '$'], ['\\(', '\\)']]
}
};
</script>
<script id="MathJax-script" async
src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-chtml.js">
</script>
R is a language and environment for statistical computing and graphics. It is widely used for a variety of statistical analysis (i.e., linear and nonlinear modeling, classical statistical tests, clustering, etc.). We can also use R for data visualization and producing figures for presentations. R is freely available and has a large collection of developed packages of different tools. Overall, R is a powerful tool that can be used to handle big data, perform statistics, and visualize results, while enabling end-users to easily develop tools and packages for customized functionality.
While R, also called “base R”, itself is an interpreted computer language and comes with a terminal, ease of use and more functionality is introduced when using RStudio, an open source Integrated Development Environment (IDE). Rather than using a terminal, RStudio provides a graphical user interface that is platform agnostic and integrates additional packages, project management, version control, and notebooks.
The Comprehensive R Archive Network or CRAN actively maintains and develops R. It also hosts a repository of R packages that enhances the base functionality of R. To get started, base R can be installed on either Linux, macOS, or Windows by going to the CRAN homepage.
To use R, you need to open a terminal window and type R
and press
enter. You will see some text denoting the R version and some helpful
functions. To quit, type q()
and press enter.
Rather than work in a terminal, RStudio is the way to use R like how many use Microsoft Word to prepare documents. Download RStudio for free here and follow the installation instructions. Once installed, proceed to opening RStudio, typically done by clicking on the RStudio icon.
Once you open RStudio, you will see four main panes. Each will contain different information. You can see my screen below.
Starting from the top left pane and going from left to right, we have the descriptions of each pane below:
Run
command.When working with R, it is always a good idea to create a new project directory. This will help you keep your files and data organized by project.
To get started: 1. Open RStudio 2. Go to the File menu and select New Project. 3. In the New Project window, choose New Directory. Then, choose New Project. Name your new directory. You can use a name like, Intro-to-R, and then “Create the project as subdirectory of:” in a location of your choice. 4. Click on Create Project. 5. The project should open up automatically in Rstudio.
We can view our working directory by using the function getwd()
getwd()
## [1] "/Users/dtruong4/Desktop/Intro-to-R"
Your working directory will be the location where R will automatically look for files. If you want to find files in a different location, you will either need to provide the full path or type the path in relation to the working directory. Files that you output will automatically save into your working directory unless a path is provided.
To organize your working directory, it is highly recommended to generate
sub-folders like data/
or results/
. You can do so in the Files tab
and select New Folder
.
Below are some of the most common math calculations that can be done in R.
Operation | Symbol |
---|---|
Addition | a + b |
Subtraction | a - b |
Multiplication | a * b |
Division | a / b |
Exponent | a ^ b |
Remainder | a %% b |
Integer Division | a %/% b |
For instance, here is an example of addition.
5 + 3
## [1] 8
R has several pre-built functions. For instance, we can use sum()
instead of the +
symbol for addition.
sum(1,3) #This gives the sum of 1 and 3
## [1] 4
We can also call the R documentation for a function by using ?
before the function name. The documentation will show up under the Help
tab. It contains information regarding description, usage, and arguments
for a function.
?sum #Find the R Documentation for sum()
We can see that sum()
can return the sum of more than two values.
sum(1,2,3,4,5)
## [1] 15
In addition to the basic calculations, there are many common pre-built math functions in R.
Operation | Function |
---|---|
Square root | sqrt() |
Logarithm | log() |
Logarithm, base 10 | log10() |
Exponential | exp() |
Summation | sum() |
Round | round() |
Mean | mean() |
Median | median() |
Minimum | min() |
Maximum | max() |
#What is the square root of 4?
sqrt(4)
## [1] 2
#Can we round 3.14?
round(3.14)
## [1] 3
A key aspect of programming is defining variables. We store data or
values as variables so that we can use in other functions or recall it
at a later time. It allows us to save time by storing the data and not
having to re-calculate it again. R has two assignment operators for
defining variables: <-
and =
. The operator <-
can be used
anywhere, whereas the operator =
is only allowed at the top level.
x <- 1
#What is in `x`?
x
## [1] 1
y = 15.3
#What is in `y`?
y
## [1] 15.3
#What is in x + y?
x + y
## [1] 16.3
We can also define variables within functions.
sqrt(y <- 5)
## [1] 2.236068
#What is in `y`?
y
## [1] 5
However, we cannot use the =
to do so. This is because functions
already have pre-defined arguments that the function is looking for. In
this case, sqrt()
is looking for the argument x
, not y
.
sqrt(y = 5)
## Error in sqrt(y = 5): supplied argument name 'y' does not match 'x'
Instead, we can do the following:
sqrt(x = y <- 5) #Here we define y as 5 and pass y into x
## [1] 2.236068
sqrt(x = 5) #Here we pass 5 into x
## [1] 2.236068
Importantly, when defining variable names, ensure that you use an
informative name. This enables yourself and others when reviewing code
to know how the variable was used. For instance, we used x
and y
in
the examples, but their meaning is unknown. Something like
country_population
or room_capacity
provides better definition for a
variable.
There are different types of data in R, which can be stored as a variable. Below is a table of some of the most commonly used data types.
Data Type | Definition | Example |
---|---|---|
numeric | Any number value | 3.14 |
integer | Any whole number value | 42 |
character | Any number of ASCII characters | "Hello world!" |
defined within quotation marks | ||
logical | A value of TRUE or FALSE |
TRUE |
factor | A categorical type of data | #> [1] Male Male Male Female Female |
#> Levels: Male Female |
The function class()
can be used to find out the type of data that you
are dealing with.
x <- 3
class(x)
## [1] "numeric"
x <- TRUE
class(x)
## [1] "logical"
Vectors are a data structure in R containing one or more values. In
fact, you may have noticed a [1]
in the output of x
. This indicates
that it is a vector of length 1
.
length(x) #This function gives you the length of a vector
## [1] 1
We use the function c()
to define a vector with multiple elements. The
c
stands for combine.
my_first_vector <- c(1,2,3,4,5) #We can also do this with the following, 1:5 instead of c()
my_first_vector
## [1] 1 2 3 4 5
We can add more elements to the same vector.
my_first_vector <- c(my_first_vector, 6,7)
my_first_vector
## [1] 1 2 3 4 5 6 7
We can call specific elements in a vector by using a process called
indexing. Basically, we can subset specific elements of a vector for
further analysis. We do this by defining which position we want in
brackets []
after the vector.
#Let's take out the 3rd element
my_first_vector[3]
## [1] 3
What if we wanted to select multiple elements? We can use another vector with the positions we want.
#Let's take out the 2nd and 4th elements
my_first_vector[c(2,4)]
## [1] 2 4
We can also do the opposite and select all elements but a single or
multiple element by using -
.
#Let's keep all but the 5th element
my_first_vector[-5]
## [1] 1 2 3 4 6 7
#Let's keep all but the 1st and 3rd elements
my_first_vector[-c(1,3)]
## [1] 2 4 5 6 7
Importantly, R functions are typically vectorized. This means that the function will perform its operation on all elements of the vector without having to loop through for each element.
my_first_vector
## [1] 1 2 3 4 5 6 7
my_first_vector * 2
## [1] 2 4 6 8 10 12 14
We can also test some of math functions we listed above.
mean(my_first_vector) #This gives the mean of a numeric vector
## [1] 4
min(my_first_vector) #This gives the minimum numeric value in a vector
## [1] 1
max(my_first_vector) #This gives the maximmum numeric value in a vector
## [1] 7
In R, there are relational operators that compare values between two variables. Typically, these are numerical equalities or inequalities. The result of comparison is a Boolean value.
Operator | Description |
---|---|
< |
less than |
<= |
less than or equal to |
> |
greater than |
>= |
greater than or equal to |
== |
equal to |
!= |
not equal to |
%in% |
is ‘in’ a given vector |
#Is 3 greater than 5?
3 > 5
## [1] FALSE
We can use the %in%
operator to determine if a given value is in a
vector.
#Is 5 in 1 through 10?
5 %in% 1:10
## [1] TRUE
#Is 'a' in c('a','b','c')?
'a' %in% c('a','b','c')
## [1] TRUE
There are also logical operators which connect two or more expressions depending on the meaning of the operator. These are typically combined with relational operators.
Operator | Description |
---|---|
| |
OR |
& |
AND |
! |
NOT |
#Is 3 greater than 1 and 5?
3 > 1 & 3 > 5
## [1] FALSE
#Is 3 greater than 1 or 5?
3 > 1 | 3 > 5
## [1] TRUE
We can build custom functions to perform operations that are not pre-built in base R. Custom functions can include pre-built or other custom functions. In fact, many pre-built functions are functions of other pre-built functions.
my_function <- function(arg1, arg2,...){
statements
return(object)
}
Let’s make a function that finds the average of a numeric vector.
average <- function(numeric_vector){
out <- sum(numeric_vector)/length(numeric_vector)
return(out)
}
Once we have created our function, you will see it under the Functions section in the Environment tab.
Let’s test our function below.
average(c(1,2,3))
## [1] 2
What happens if we put a non-numeric vector?
average(c('a', 'b', 'c'))
## Error in sum(numeric_vector): invalid 'type' (character) of argument
While R has an error message for some cases, we can also build your own error function with a custom message.
average <- function(numeric_vector){
if (class(numeric_vector) != 'numeric') #condition to throw the error
out <- 'This is not a numeric vector' #The custom error message
else
out <- sum(numeric_vector)/length(numeric_vector)
return(out)
}
average(c('a', 'b', 'c'))
## [1] "This is not a numeric vector"
We can also view the code of any function by typing it without the ()
.
This can help you understand how a function works, especially if it
created by someone else.
average
## function(numeric_vector){
## if (class(numeric_vector) != 'numeric') #condition to throw the error
## out <- 'This is not a numeric vector' #The custom error message
## else
## out <- sum(numeric_vector)/length(numeric_vector)
## return(out)
## }
.Rmd
or .r
file. Use #
for commenting. This will aid in
reproducible of your code for yourself and others.x
, y
, or
similar.The goal of this algorithm is to the find the optimal division of n
observations into k
clusters, so that the total squared distance of
the group members to the cluster centroid is minimized.
The K-means algorithm attempts to do the follow:
k
number of clusters.k
number of observations are randomly selected to be the initial
centroids.We can develop a simple K-means function using the above algorithm. Here
we have a data set USArrests
, which contains statistics for arrests
per 100,000 residents in each state for either murder, assault, or rape.
In addition, the percentage of people living in urban areas is also
listed.
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
data("USArrests")
head(USArrests)
## Murder Assault UrbanPop Rape
## Alabama 13.2 236 58 21.2
## Alaska 10.0 263 48 44.5
## Arizona 8.1 294 80 31.0
## Arkansas 8.8 190 50 19.5
## California 9.0 276 91 40.6
## Colorado 7.9 204 78 38.7
The data is then scaled to standardize the values.
USArrests_scaled <- scale(USArrests)
head(USArrests_scaled)
## Murder Assault UrbanPop Rape
## Alabama 1.24256408 0.7828393 -0.5209066 -0.003416473
## Alaska 0.50786248 1.1068225 -1.2117642 2.484202941
## Arizona 0.07163341 1.4788032 0.9989801 1.042878388
## Arkansas 0.23234938 0.2308680 -1.0735927 -0.184916602
## California 0.27826823 1.2628144 1.7589234 2.067820292
## Colorado 0.02571456 0.3988593 0.8608085 1.864967207
We can use principal component analysis to generate a low-dimensional representation of the graph.
pca_USArrests <- prcomp(USArrests_scaled, scale. = F)
pca_USArrests_df <- as.data.frame(pca_USArrests$x) %>%
dplyr::select(PC1, PC2) %>%
cbind(States = rownames(USArrests))
ggplot(pca_USArrests_df, aes(x = PC1, y = PC2)) +
geom_text(aes(label = States)) +
theme_classic()
We can already see possible clusters or grouping of states with similar
statistics. Let’s start with an easy k
value of 2
. We initialize k
and select for centroids.
k = 2
centroids = sample.int(dim(USArrests_scaled)[1], k) #randomly select k integers from 1 to the length of the data.
centroid_points = USArrests_scaled[centroids,] %>% as.matrix() #use the selected integers as indices and select for them in the data frame.
centroid_points
## Murder Assault UrbanPop Rape
## Oregon -0.6630682 -0.1411127 0.1008652 0.8613783
## Tennessee 1.2425641 0.2068693 -0.4518209 0.6051428
Next, we use a distance metric to compare the observation and the centroids. This will result in a matrix that gauges dissimilarity. Observations that are further apart from the centroid and less likely to be part of that cluster. Choice of distance metric will affect the formation of the clusters. Here we choose the Euclidean distance:
\[d_{euc}(x,y) =\sqrt{\Sigma_{i=1}^{n}(x_i-y_i)^2}\]$n$ is the number of observations
$y_i$ is the value of the centroid
dataPoints <- as.matrix( USArrests_scaled)
dist_mat <- matrix(0, nrow = nrow(dataPoints), ncol = k) #initialize an empty matrix
for (j in 1:k)
{
for (i in 1:nrow(dataPoints))
{
dist_mat[i,j] = sqrt(sum((dataPoints[i,1:ncol(dataPoints)] - centroid_points[j,1:ncol(centroid_points)])^2))
}
}
head(dist_mat)
## [,1] [,2]
## [1,] 2.370568 0.8407489
## [2,] 2.699070 2.3362541
## [3,] 2.000866 2.2989846
## [4,] 1.847763 1.4254486
## [5,] 2.657402 3.0119267
## [6,] 1.533198 2.1972111
The cluster for each observation is chosen by the centroid with the
smallest distance to the observation. We can use the which.min()
function.
cluster = factor(apply(dist_mat, 1, which.min)) #selects the column index with the smallest distance
head(cluster)
## [1] 2 2 1 2 1 1
## Levels: 1 2
Recall that we are minimizing the squared Euclidean distances between the observation and the assigned centroid. This is also the within-cluster sum of squares (WCSS).
\[W(C_k) =\Sigma_{x_i\in C_k}(x_i-\mu_k)^2\]We define a total within-cluster sum of squares (total_WCSS) which measures the compactness of the clustering. Minimizing this value results in tighter clusters.
\[totalWCSS =\Sigma^k_{k=1}W(C_k) =\Sigma^k_{k=1}\Sigma_{x_i\in C_k}(x_i-\mu_k)^2\]dist_mat_cluster <- list()
for(i in 1:k){
dist_mat_cluster[[i]] <- dist_mat[which(cluster == i),i]^2
}
within_cluster_ss <- unlist(lapply(dist_mat_cluster, sum))
cat('Within-cluster sum of squares:')
## Within-cluster sum of squares:
within_cluster_ss
## [1] 133.74617 54.87014
total_WCSS = sum(within_cluster_ss)
cat('\nTotal within-cluster sum of squares:', total_WCSS)
##
## Total within-cluster sum of squares: 188.6163
Using the PCA graph, we can observe how our clusters look and where the initial centroids are located.
pca_USArrests_df <- as.data.frame(pca_USArrests$x) %>%
dplyr::select(PC1, PC2) %>%
cbind(States = rownames(USArrests)) %>%
cbind(Clusters = cluster)
centroid_points_unscaled <- apply(centroid_points, 1, function(x)
{ x * pca_USArrests$scale + pca_USArrests$center}) %>% t()
rownames(centroid_points_unscaled) <- c(1:k)
centroid_coord <- predict(pca_USArrests, centroid_points_unscaled) %>% as.data.frame() # adding the centroid coordinates
ggplot(pca_USArrests_df, aes(x = PC1, y = PC2,)) +
geom_text(aes(label = States, color = Clusters)) + # labeling the centroids
geom_point(data = centroid_coord,
mapping = aes(x = PC1, y = PC2, color = rownames(centroid_coord)),
size = 3) +
theme_classic()
As you can see, the clustering is not that great since we have only
initialized the algorithm and had randomly selected k
observations as
centroids. The next step is to form new centroids and iteratively assign
observations to clusters until a maximum number of iterations are
reached or the observations no longer assigned to another cluster.
We can generate new centroid values by taking the mean of all values of each observations that are part of the cluster
new_centroid = USArrests_scaled %>%
as.data.frame() %>%
cbind(Clusters = cluster) %>%
group_by(Clusters) %>%
summarise_all(mean)
new_centroid
## # A tibble: 2 × 5
## Clusters Murder Assault UrbanPop Rape
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 1 -0.629 -0.517 0.0620 -0.328
## 2 2 1.12 0.919 -0.110 0.584
centroid_points = new_centroid[,-1] %>% as.matrix()
Rather than repeating the code over and over, we can write a function that will do it for us.
k_means_ <- function(df, k, iters){
#initialize random centroids
centroids = sample.int(dim(df)[1], k)
centroid_points = df[centroids,] %>% as.matrix()
dataPoints <- as.matrix(df)
#initialize WCSS
within_cluster_ss <- c()
for (i in 1:iters){
dist_mat <- matrix(0, nrow = nrow(dataPoints), ncol = k)
for (j in 1:k)
{
for (i in 1:nrow(dataPoints))
{
dist_mat[i,j] = sqrt(sum((dataPoints[i,1:ncol(dataPoints)] - centroid_points[j,1:ncol(centroid_points)])^2))
}
}
cluster = factor(apply(dist_mat, 1, which.min))
dist_mat_cluster <- list()
for(i in 1:k){
dist_mat_cluster[[i]] <- dist_mat[which(cluster == i),i]^2
}
within_cluster_ss_temp <- unlist(lapply(dist_mat_cluster, sum))
within_cluster_ss <- append(within_cluster_ss, within_cluster_ss_temp)
new_centroid = df %>%
as.data.frame() %>%
cbind(Clusters = cluster) %>%
group_by(Clusters) %>%
summarise_all(mean)
centroid_points = new_centroid[,-1] %>% as.matrix()
}
within_cluster_ss <- t(array(within_cluster_ss, dim = c(k, iters)))
return(list(Cluster = cluster,
WCSS = within_cluster_ss))
}
We use the same parameters as before and pass our variables into our new
function k_means_(
)`.
iters = 10
k = 2
USArrests_scaled <- scale(USArrests)
k_means <- k_means_(USArrests_scaled, k, iters)
k_means
## $Cluster
## [1] 2 2 2 1 2 2 1 1 2 2 1 1 2 1 1 1 1 2 1 2 1 2 1 2 2 1 1 2 1 1 2 2 2 1 1 1 1 1
## [39] 1 2 1 2 2 1 1 1 1 1 1 1
## Levels: 1 2
##
## $WCSS
## [,1] [,2]
## [1,] 147.75655 82.14443
## [2,] 64.39358 50.06885
## [3,] 56.22017 46.82608
## [4,] 56.11445 46.74796
## [5,] 56.11445 46.74796
## [6,] 56.11445 46.74796
## [7,] 56.11445 46.74796
## [8,] 56.11445 46.74796
## [9,] 56.11445 46.74796
## [10,] 56.11445 46.74796
The total WCSS is minimized after reaching the maximum iterations.
df <- rowSums(k_means$WCSS) %>%
as.data.frame() %>%
cbind(iter = c(1:iters))
ggplot(df, aes(y =., x = iter)) +
geom_line() + labs(x = 'Iteration', y = 'Total WCSS') +
theme_classic()
Using the PCA graph, we can observe how our clusters look after reaching the maximum number of iterations.
pca_USArrests_df <- as.data.frame(pca_USArrests$x) %>%
dplyr::select(PC1, PC2) %>%
cbind(States = rownames(USArrests)) %>%
cbind(Clusters = k_means$Cluster)
ggplot(pca_USArrests_df, aes(x = PC1, y = PC2)) +
geom_text(aes(label = States, color = Clusters)) +
theme_classic()
We can also observe the clustering with a scatter plot of two features
like UrbanPop
and Murder
.
USArrests_df <- USArrests_scaled %>%
as.data.frame() %>%
cbind(States = rownames(USArrests)) %>%
cbind(Clusters = k_means$Cluster)
ggplot(USArrests_df, aes(x = UrbanPop, y = Murder,)) +
geom_text(aes(label = States, color = Clusters)) +
theme_classic()
Initially, we looked at 2 possible clusters. We can test out different
numbers for k
. Let’s repeat the process but with 2,3,4, and 5
clusters. The results are below:
k_means_test <- lapply(c(2:5), function(k) {k_means_(USArrests_scaled, k, iters)})
cluster_list <- lapply(k_means_test, function(x) x[[1]])
names(cluster_list) <- paste('k =',c(2:5))
cluster_list_df <- do.call(cbind, cluster_list)
pca_USArrests_df <- as.data.frame(pca_USArrests$x) %>%
dplyr::select(PC1, PC2) %>%
cbind(States = rownames(USArrests)) %>%
cbind(cluster_list_df) %>%
pivot_longer(cols = names(cluster_list))
ggplot(pca_USArrests_df, aes(x = PC1, y = PC2)) + geom_point(aes(shape = factor(value), color = factor(value))) + facet_wrap(~name) + labs(color = "Cluster", shape = "Cluster")
Of course we can continue testing additional values of k
. However, it
may be more advantageous to determine the optimal k
value based on the
total within-cluster sum of squares. Recall that this value must be
minimized to find the optimal cluster assignments. We can also use this
to determine the optimal k
.
k
. For
instance, we can vary k
from 1 to 10.k
, the total within-cluster sum of squares.k
.k
.k_means_test <- lapply(c(1:10), function(k) {k_means_(USArrests_scaled, k, iters)})
WCSS_list <- lapply(k_means_test, function(x) x[[2]][iters,])
total_WCSS_list <- lapply(WCSS_list, sum)
df <- data.frame(Y = unlist(total_WCSS_list), X = c(1:10))
ggplot(df, aes(x = X, y = Y)) +
geom_line() +
geom_point() +
labs(x = 'k clusters', y = 'Total WCSS') +
scale_x_continuous(breaks = c(1:10)) +
theme_classic()
The optimal k
look to be either 4 or 5. As you can see, k-means
clustering is simple and quick. One caveat is choosing the number of
clusters. Another is the random initialization of centroids. This could
slow down the algorithm in very large data sets. One possible
improvement is to generate different initial centroids and select the
set that has the smallest total within-cluster sum of squares.
Conditional probability is the likelihood of an event or outcome occurring based on the occurrence of a previous event or outcome. Below is the equation for the conditional probability.
$P(A|B) = \frac{P(A \cap B) }{P(B)}$
where:
P(A ∩ B) is the probability that event A and event B both occur.
P(B) is the probability that event B occurs.
For example:
Looks like working hard does pay off. Conditional probability implies there is a relationship between these two events, such as the probability as completing your work on time in a year, and receiving a raise. In other words, conditional probability depends on a previous result. In this case, it is the probability of completing your work on time.
Let’s take a look at a survey given to male and female students and
asking what their favorite past times are. Below a prepared a function
rng_survey()
to generate the survey answers.
#helper function for our survey
rng_survey <- function(max, n){
#max is the maximum number of surveys for a participant group (ie. Male)
#n is the number of possible answers
total = 0
while (total != max){
x <- sample(1:max, n, replace = TRUE)
total = sum(x)
}
return(x)
}
#male = rng_survey(150,4)
#female = rng_survey(150,4)
#cat("male answers:", male, "\n")
#cat("female answers:", female, "\n")
#male answers: 7 19 26 98
#female answers: 7 69 40 34
Using the function, we create a data frame with the survey answers.
#create data frame for the survey responses
df <- data.frame(gender=rep(c('Male', 'Female'), each=150),
sport=rep(c('Exercise', 'Cooking', 'Reading', 'Television',
'Exercise', 'Cooking', 'Reading', 'Television'),
times=c(7, 19, 26, 98, 7, 69, 40, 34 )))
df
## gender sport
## 1 Male Exercise
## 2 Male Exercise
## 3 Male Exercise
## 4 Male Exercise
## 5 Male Exercise
## 6 Male Exercise
## 7 Male Exercise
## 8 Male Cooking
## 9 Male Cooking
## 10 Male Cooking
## 11 Male Cooking
## 12 Male Cooking
## 13 Male Cooking
## 14 Male Cooking
## 15 Male Cooking
## 16 Male Cooking
## 17 Male Cooking
## 18 Male Cooking
## 19 Male Cooking
## 20 Male Cooking
## 21 Male Cooking
## 22 Male Cooking
## 23 Male Cooking
## 24 Male Cooking
## 25 Male Cooking
## 26 Male Cooking
## 27 Male Reading
## 28 Male Reading
## 29 Male Reading
## 30 Male Reading
## 31 Male Reading
## 32 Male Reading
## 33 Male Reading
## 34 Male Reading
## 35 Male Reading
## 36 Male Reading
## 37 Male Reading
## 38 Male Reading
## 39 Male Reading
## 40 Male Reading
## 41 Male Reading
## 42 Male Reading
## 43 Male Reading
## 44 Male Reading
## 45 Male Reading
## 46 Male Reading
## 47 Male Reading
## 48 Male Reading
## 49 Male Reading
## 50 Male Reading
## 51 Male Reading
## 52 Male Reading
## 53 Male Television
## 54 Male Television
## 55 Male Television
## 56 Male Television
## 57 Male Television
## 58 Male Television
## 59 Male Television
## 60 Male Television
## 61 Male Television
## 62 Male Television
## 63 Male Television
## 64 Male Television
## 65 Male Television
## 66 Male Television
## 67 Male Television
## 68 Male Television
## 69 Male Television
## 70 Male Television
## 71 Male Television
## 72 Male Television
## 73 Male Television
## 74 Male Television
## 75 Male Television
## 76 Male Television
## 77 Male Television
## 78 Male Television
## 79 Male Television
## 80 Male Television
## 81 Male Television
## 82 Male Television
## 83 Male Television
## 84 Male Television
## 85 Male Television
## 86 Male Television
## 87 Male Television
## 88 Male Television
## 89 Male Television
## 90 Male Television
## 91 Male Television
## 92 Male Television
## 93 Male Television
## 94 Male Television
## 95 Male Television
## 96 Male Television
## 97 Male Television
## 98 Male Television
## 99 Male Television
## 100 Male Television
## 101 Male Television
## 102 Male Television
## 103 Male Television
## 104 Male Television
## 105 Male Television
## 106 Male Television
## 107 Male Television
## 108 Male Television
## 109 Male Television
## 110 Male Television
## 111 Male Television
## 112 Male Television
## 113 Male Television
## 114 Male Television
## 115 Male Television
## 116 Male Television
## 117 Male Television
## 118 Male Television
## 119 Male Television
## 120 Male Television
## 121 Male Television
## 122 Male Television
## 123 Male Television
## 124 Male Television
## 125 Male Television
## 126 Male Television
## 127 Male Television
## 128 Male Television
## 129 Male Television
## 130 Male Television
## 131 Male Television
## 132 Male Television
## 133 Male Television
## 134 Male Television
## 135 Male Television
## 136 Male Television
## 137 Male Television
## 138 Male Television
## 139 Male Television
## 140 Male Television
## 141 Male Television
## 142 Male Television
## 143 Male Television
## 144 Male Television
## 145 Male Television
## 146 Male Television
## 147 Male Television
## 148 Male Television
## 149 Male Television
## 150 Male Television
## 151 Female Exercise
## 152 Female Exercise
## 153 Female Exercise
## 154 Female Exercise
## 155 Female Exercise
## 156 Female Exercise
## 157 Female Exercise
## 158 Female Cooking
## 159 Female Cooking
## 160 Female Cooking
## 161 Female Cooking
## 162 Female Cooking
## 163 Female Cooking
## 164 Female Cooking
## 165 Female Cooking
## 166 Female Cooking
## 167 Female Cooking
## 168 Female Cooking
## 169 Female Cooking
## 170 Female Cooking
## 171 Female Cooking
## 172 Female Cooking
## 173 Female Cooking
## 174 Female Cooking
## 175 Female Cooking
## 176 Female Cooking
## 177 Female Cooking
## 178 Female Cooking
## 179 Female Cooking
## 180 Female Cooking
## 181 Female Cooking
## 182 Female Cooking
## 183 Female Cooking
## 184 Female Cooking
## 185 Female Cooking
## 186 Female Cooking
## 187 Female Cooking
## 188 Female Cooking
## 189 Female Cooking
## 190 Female Cooking
## 191 Female Cooking
## 192 Female Cooking
## 193 Female Cooking
## 194 Female Cooking
## 195 Female Cooking
## 196 Female Cooking
## 197 Female Cooking
## 198 Female Cooking
## 199 Female Cooking
## 200 Female Cooking
## 201 Female Cooking
## 202 Female Cooking
## 203 Female Cooking
## 204 Female Cooking
## 205 Female Cooking
## 206 Female Cooking
## 207 Female Cooking
## 208 Female Cooking
## 209 Female Cooking
## 210 Female Cooking
## 211 Female Cooking
## 212 Female Cooking
## 213 Female Cooking
## 214 Female Cooking
## 215 Female Cooking
## 216 Female Cooking
## 217 Female Cooking
## 218 Female Cooking
## 219 Female Cooking
## 220 Female Cooking
## 221 Female Cooking
## 222 Female Cooking
## 223 Female Cooking
## 224 Female Cooking
## 225 Female Cooking
## 226 Female Cooking
## 227 Female Reading
## 228 Female Reading
## 229 Female Reading
## 230 Female Reading
## 231 Female Reading
## 232 Female Reading
## 233 Female Reading
## 234 Female Reading
## 235 Female Reading
## 236 Female Reading
## 237 Female Reading
## 238 Female Reading
## 239 Female Reading
## 240 Female Reading
## 241 Female Reading
## 242 Female Reading
## 243 Female Reading
## 244 Female Reading
## 245 Female Reading
## 246 Female Reading
## 247 Female Reading
## 248 Female Reading
## 249 Female Reading
## 250 Female Reading
## 251 Female Reading
## 252 Female Reading
## 253 Female Reading
## 254 Female Reading
## 255 Female Reading
## 256 Female Reading
## 257 Female Reading
## 258 Female Reading
## 259 Female Reading
## 260 Female Reading
## 261 Female Reading
## 262 Female Reading
## 263 Female Reading
## 264 Female Reading
## 265 Female Reading
## 266 Female Reading
## 267 Female Television
## 268 Female Television
## 269 Female Television
## 270 Female Television
## 271 Female Television
## 272 Female Television
## 273 Female Television
## 274 Female Television
## 275 Female Television
## 276 Female Television
## 277 Female Television
## 278 Female Television
## 279 Female Television
## 280 Female Television
## 281 Female Television
## 282 Female Television
## 283 Female Television
## 284 Female Television
## 285 Female Television
## 286 Female Television
## 287 Female Television
## 288 Female Television
## 289 Female Television
## 290 Female Television
## 291 Female Television
## 292 Female Television
## 293 Female Television
## 294 Female Television
## 295 Female Television
## 296 Female Television
## 297 Female Television
## 298 Female Television
## 299 Female Television
## 300 Female Television
We convert the data frame to a table.
#create two-way table from data frame
survey_data <- addmargins(table(df$gender, df$sport))
survey_data
##
## Cooking Exercise Reading Television Sum
## Female 69 7 40 34 150
## Male 19 7 26 98 150
## Sum 88 14 66 132 300
We can extract information from our table by calling a row and a column. For instance, let’s ask for the number of males that prefer cooking.
survey_data['Male', 'Cooking']
## [1] 19
Now we can ask the probability of being male given that they prefer
cooking. We know that the probability of being male is 0.5
. We can
calculate the rest from the table.
P_male = 0.5
P_cooking_male = survey_data['Male', 'Cooking'] / survey_data['Male', 'Sum'] #probability of only males that prefer cooking
P_male_P_cooking_male = P_male * P_cooking_male
P_cooking = survey_data['Sum', 'Cooking'] / survey_data['Sum', 'Sum'] #probability of male and female that prefers cooking
P_male_cooking = P_male_P_cooking_male / P_cooking
P_male_cooking #probability of being male given they prefer cooking
## [1] 0.2159091
Alternatively, we can use the table to easily answer the same problem.
survey_data['Male', 'Cooking'] / survey_data['Sum', 'Cooking']
## [1] 0.2159091
Next, we can ask the probability of being female given that they prefer reading.
survey_data['Female', 'Reading'] / survey_data['Sum', 'Reading']
## [1] 0.6060606
Suppose that we have the same survey but male and female was not recorded by accident. How could we solve the probability of being male given that they prefer cooking? Let us assume from a previous survey we knew that 12.67% of males prefer to cook or is the probability of preferring to cook given that they are male. We can use something called Bayes’ Theorem,
$P(A|B) = \frac{P(B|A) P(A) }{P(B)}$.
Bayes’ Theorem describes the probability of an event based on prior knowledge of conditions that may be related to the event.
Knowing that the equation for conditional probability is
$P(A|B) = \frac{P(A \cap B) }{P(B)}$
then
P(A ∩ B) = P(A|B)P(B)
and
P(A ∩ B) = P(B|A)P(A).
We can solve for P(A ∩ B) with substitution to yield
$P(A|B) = \frac{P(B|A)P(A)}{P(B)}$.
#view table
survey_data
##
## Cooking Exercise Reading Television Sum
## Female 69 7 40 34 150
## Male 19 7 26 98 150
## Sum 88 14 66 132 300
P_male = 0.5
P_cooking_male = 0.1267 #probability of prefering to cook given that they are male
P_cooking = survey_data['Sum', 'Cooking'] / survey_data['Sum', 'Sum'] #probability of male and female that prefers cooking
P_male_cooking = (P_cooking_male * P_male) / P_cooking
P_male_cooking #probability of being male given they prefer cooking
## [1] 0.2159659
Let’s try to solve for the reverse, the probability of preferring to cook given that they are male, using Bayes’ Theorem.
P_male_cooking * P_cooking / P_male
## [1] 0.1267
Is there a way to solve for the probability of preferring to cook given that they are female using Bayes’ Theorem?
P_female = 1 - P_male #We have a binary choice and the probabilities sum up to 1
P_female_cooking = 1 - P_male_cooking
P_female_cooking * P_cooking / P_female
## [1] 0.4599667
Looking back at the table, we can use Bayes’ Theorem to solve this problem.
survey_data['Female', 'Cooking'] / survey_data['Female', 'Sum']
## [1] 0.46
iris
data set.
First, let’s load up the data and have a brief look at it. Using
class()
, we can see that the data set is a data.frame
.
data(iris)
class(iris)
## [1] "data.frame"
Using head()
, we can take a look at the first 5 rows. This may be
helpful when analyzing large data sets.
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
There are 4 columns that are made up of numbers and one with strings.
The fifth column Species
suggests that each row corresponds to a data
from a single sample of that species. We can also use dim()
to find
the full dimensions of the data.
dim(iris)
## [1] 150 5
This tells us that there are 150 rows and 5 columns in this data frame.
It is also good practice to use the str
function to briefly look at
the structure of the data.
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
Again, this confirmed our brief look earlier that there are 4 columns of
numeric values. We can also see that the fifth column is actually a
Factor
with 3 levels. Using summmary()
, we can acquire some insight
on the data distribution.
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
We can start by looking at the data point distribution of Sepal.Length
between the three species. We use the library ggplot2
and create a
scatter plot with geom_point()
library(ggplot2)
ggplot(iris, mapping = aes(y = Sepal.Length, x = Species, color = Species)) +
geom_point()
It is tough to see the distribution in this manner, but it does look
like there may be differences between the Species
. Using
geom_histrogram()
we can build a distribution based on the number of
observation in each bin. In this case, the x-axis will be divided into
30 equally spaced bins with values between the minimum and maximum
Sepal.Length
. This method works well for continuous variables.
ggplot(iris, mapping = aes(x = Sepal.Length, fill = Species)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Let’s try using a box plot, also called box and whiskers, to look at the
distribution of Sepal.Length
between the three species. A box plot
summarizes the data with five numbers:
Typically, the rectangle spans the first and third quartile of the data set, which is also known as the interquartile range (IQR). The line in the middle denotes the median, and the whiskers above and below denote the maximum and minimum respectively. Outliers are also shown as data points that fall 1.5 times the IQR from either edge of the box.
ggplot(iris, mapping = aes(y = Sepal.Length, x = Species, fill = Species)) + geom_boxplot()
We can also plot multiple graphs using facet_wrap
, but first we need
to reshape the data frame in the long
format.
library(reshape2)
iris_melt <- melt(iris, id = 'Species') #reshaping the dataframe
head(iris_melt)
## Species variable value
## 1 setosa Sepal.Length 5.1
## 2 setosa Sepal.Length 4.9
## 3 setosa Sepal.Length 4.7
## 4 setosa Sepal.Length 4.6
## 5 setosa Sepal.Length 5.0
## 6 setosa Sepal.Length 5.4
Now the data is in a long
format where each row provides a sample id
Species
, a variable, like Sepal.Length
, and the corresponding value.
We can use facet_wrap
to generate multiple plots.
ggplot(iris_melt, aes(x = Species, y = value, fill = Species)) +
geom_boxplot() +
facet_wrap(~variable)
Interestingly, it looks like there is a correlation for some of the
variables. We can explore this by generating a correlation matrix.
First, we will exclude the fifth column since that is non-numeric. Next,
we will use the function cor()
.
iris_cor <- cor(iris[,-5],iris[,-5])
head(iris_cor)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length 1.0000000 -0.1175698 0.8717538 0.8179411
## Sepal.Width -0.1175698 1.0000000 -0.4284401 -0.3661259
## Petal.Length 0.8717538 -0.4284401 1.0000000 0.9628654
## Petal.Width 0.8179411 -0.3661259 0.9628654 1.0000000
It seems that all the variables except for Sepal.Width
correlate.
Let’s visualize this with ggplot()
. We will need to reshape our data
first. We will also add a new color scale since the default color scale
isn’t too useful.
iris_cor_melt <- melt(iris_cor) #reshaping the data
ggplot(iris_cor_melt, aes(x = Var1, y = Var2, fill = value)) +
geom_tile(color = 'white') + #geom_tile generates tiles which are colored based on a value
scale_fill_gradient2(low = "blue", high = "red", mid = "white",
midpoint = 0, limit = c(-1,1), space = "Lab")
It doesn’t look too pretty. We can generate a helper function to reorder the matrix by correlation distance. This way we can easily see the variables that highly correlate.
reorder_cormat <- function(cormat){
# Use correlation between variables as distance
dd <- as.dist((1-cormat)/2)
hc <- hclust(dd)
cormat <-cormat[hc$order, hc$order]
}
iris_cor_ordered <- reorder_cormat(iris_cor)
head(iris_cor_ordered)
## Sepal.Width Sepal.Length Petal.Length Petal.Width
## Sepal.Width 1.0000000 -0.1175698 -0.4284401 -0.3661259
## Sepal.Length -0.1175698 1.0000000 0.8717538 0.8179411
## Petal.Length -0.4284401 0.8717538 1.0000000 0.9628654
## Petal.Width -0.3661259 0.8179411 0.9628654 1.0000000
Now we reshape the data and use ggplot
to visualize the ordered
correlation matrix.
iris_cor_ordered_melt <- melt(iris_cor_ordered, na.rm = TRUE)
ggplot(iris_cor_ordered_melt, aes(x = Var1, y = Var2, fill = value)) +
geom_tile(color = 'white') +
scale_fill_gradient2(low = "blue", high = "red", mid = "white",
midpoint = 0, limit = c(-1,1), space = "Lab")
This can also be done using pheatmap
with the added benefit of
pre-built ordering and cluster trees.
library(pheatmap)
pheatmap(iris_cor,
color = colorRampPalette(c('blue', 'white', 'red'))(100),
breaks = seq(-1,1, length.out = 100))
Based on the data, it looks like virginica
has the longest
Sepal.Length
. We can test for this using basic statistics. Let’s
perform a simple Student’s t-Test between virginica
and setosa
. We
will use the reshaped data frame and remove the versicolor
species.
iris_subset_melt <- subset(iris_melt, subset = Species == 'virginica' | Species == 'setosa' ) #removing versicolor so we have the two groups we are comparing
head(iris_subset_melt)
## Species variable value
## 1 setosa Sepal.Length 5.1
## 2 setosa Sepal.Length 4.9
## 3 setosa Sepal.Length 4.7
## 4 setosa Sepal.Length 4.6
## 5 setosa Sepal.Length 5.0
## 6 setosa Sepal.Length 5.4
Let’s do some more data wrangling to set up our data. Essentially, we
will create two numeric vectors corresponding to Sepal.Length
for both
species.
setosa_sepal_length = subset(iris_subset_melt,
subset =
variable == 'Sepal.Length' &
Species == 'setosa')$value #selecting only values for setosa
virginica_sepal_length = subset(iris_subset_melt,
subset =
variable == 'Sepal.Length' &
Species == 'virginica')$value #selecting only values for virginica
With that, we are ready to perform the Student’s t-Test for two samples. Essentially, we are comparing the means of two groups with a single variable.
t.test(setosa_sepal_length, virginica_sepal_length)
##
## Welch Two Sample t-test
##
## data: setosa_sepal_length and virginica_sepal_length
## t = -15.386, df = 76.516, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.78676 -1.37724
## sample estimates:
## mean of x mean of y
## 5.006 6.588
The graph already suggested that the two species were different based on
the Petal.Length
. Here the p-value
is less than 2.2e-16, where a
typical alpha is 0.05. Therefore, the null hypothesis where there is no
difference is rejected.
If we want to continue performing more comparisons, we will run into the multiple testing problem. In other words, if we keep testing different variables, we will eventually find a difference. Since, we are expecting a 5% chance of incorrectly rejecting the null hypothesis, then performing 100 multiple comparisons will result in 5 incorrect rejections or false positives. A simple way to correct for this is Bonferroni correction, which just divides the alpha by the number of total comparisons. There are additional methods, which you can read more here.
To compare multiple groups, we will perform a One-Way ANOVA. While t-tests compare only two groups, an ANOVA can compare three or more groups. One more note is that an ANOVA only tests if one or more mean is different. Post-hoc comparison test with multiple comparison correction will be needed to find which pairwise comparison was statically different.
iris_anova <- aov(Sepal.Width ~ Species, data = iris)
summary(iris_anova)
## Df Sum Sq Mean Sq F value Pr(>F)
## Species 2 11.35 5.672 49.16 <2e-16 ***
## Residuals 147 16.96 0.115
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
There is a difference between Species
just on Sepal.Length
alone. It
is significant with a p-value <2e-16. However, we do not know which
direction or between which Species
. We can use a pairwise t-test to
find out. First, we will try without any multiple testing correction.
pairwise.t.test(x = iris$Sepal.Length, g = iris$Species, p.adj = 'none')
##
## Pairwise comparisons using t tests with pooled SD
##
## data: iris$Sepal.Length and iris$Species
##
## setosa versicolor
## versicolor 8.8e-16 -
## virginica < 2e-16 2.8e-09
##
## P value adjustment method: none
Now with we can try with Bonferroni correction.
pairwise.t.test(x = iris$Sepal.Length, g = iris$Species, p.adj = 'bonf')
##
## Pairwise comparisons using t tests with pooled SD
##
## data: iris$Sepal.Length and iris$Species
##
## setosa versicolor
## versicolor 2.6e-15 -
## virginica < 2e-16 8.3e-09
##
## P value adjustment method: bonferroni
The result is still the same, but it may not always be the case. See here.
If we wanted to plot the resulting p-values, then we can use the package
ggpubr
.
library(ggpubr)
ggplot(iris, mapping = aes(y = Sepal.Length, x = Species, fill = Species)) +
geom_boxplot() +
stat_compare_means()
As you can see, we can plot the p-value from an ANOVA directly onto the
plot, which only tells you that there is a difference but not between
which pair of Species
. To perform the pairwise comparisons, you have
to manually generate a list
of pairwise comparisons.
my_comparisons = list(c('setosa','virginica'), c('setosa','versicolor'), c('versicolor','virginica'))
ggplot(iris, mapping = aes(y = Sepal.Length, x = Species, fill = Species)) +
geom_boxplot() +
stat_compare_means(comparisons = my_comparisons)
We can also repeat this with all variables as well. It can get a bit
messy so we will switch the p-value to “p.signif”. This essentially uses
the *
symbol for the following:
ggplot(iris_melt, aes(x = Species, y = value, fill = Species)) +
geom_boxplot() +
facet_wrap(~variable) +
stat_compare_means(comparisons = my_comparisons, label = 'p.signif')
Text mining is a process of discovering new and latent features within a body of text. It uses Natural Language Processing (NLP) to enable computers to digest human language. It has many uses in the real world. A couple of examples included are:
In this post, we will use the package tidytext
, which is part of the tidyverse
. You can read more about it here. tidytext
is a package that uses tidy data principles to wrangle and visualize text data. In this post, we will process some text and maybe discover new information. First, make sure you have the following libraries installed, tidytext
, dplyr
, ggplot2
,stringi
, textclean
, and forcats
.
#install.packages(c('dplyr','ggplot2','stringi','forcats', 'tidytext','textclean', 'tidyr'))
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(stringi)
library(textclean)
library(tidytext)
library(forcats)
library(tidyr)
We can gather data by mining text from Wikipedia. Let’s import nine articles with topics including math, sports, and fantasy books. The specific topics will be important later when we try to perform text analytics. We will use the function readLines
to directly connect to a Wikipedia URL and read all text lines.
wiki <- "http://en.wikipedia.org/wiki/"
titles <- c("Integral", "Derivative",
"Calculus", "Football", "Soccer", "Basketball",
"A_Song_of_Ice_and_Fire", "The_Lord_of_the_Rings", "His_Dark_Materials")
articles <- character(length(titles))
for (i in 1:length(titles)) {
articles[i] <- stri_flatten(readLines(stri_paste(wiki, titles[i]), warn = F), col = " ")
}
length(articles)
## [1] 9
Now that we have imported our articles, which we will now call documents. We can the documents are represented as a large character vector with a length of 9. For each document, we need to separate each word and then count them. To do so, we use a process called tokenization. While a computer will read strings as a series of character, humans will read it as a sentence of words. To reproduce this for computers, we can split the string using a space delimiter, " "
, and then split the sentence into words or in this case, tokens. Let’s take a look at an example. Here we have a random sentence extracted from one of our documents.
text <- c('A Song of Ice and Fire is a series of epic fantasy novels by the American novelist and screenwriter George R. R. Martin. He began the first volume of the series, A Game of Thrones, in 1991, and it was published in 1996.')
We can can use the following code to tokenize this text.
tokenize_text <- (unlist(lapply(text, function (x) strsplit(x, split = ' ' )))) #' ' is used as the space delimiter
tokenize_text
## [1] "A" "Song" "of" "Ice" "and"
## [6] "Fire" "is" "a" "series" "of"
## [11] "epic" "fantasy" "novels" "by" "the"
## [16] "American" "novelist" "and" "screenwriter" "George"
## [21] "R." "R." "Martin." "He" "began"
## [26] "the" "first" "volume" "of" "the"
## [31] "series," "A" "Game" "of" "Thrones,"
## [36] "in" "1991," "and" "it" "was"
## [41] "published" "in" "1996."
Fortunately, tokenization is built into the tidytext
package. First, we will do some text cleaning by removing some common HTML tags with the function replace_html()
. Since HTML tags are not useful information for us, we can discard them. Then we will convert our large character vector into a data frame. We will follow this with unnest_tokens()
to tokenize the documents.
articles <- replace_html(articles, symbol = TRUE)
articles <- data.frame('articles' = articles)
articles$titles <- titles
tokenized <- articles %>% unnest_tokens(word, articles)
head(tokenized)
If you haven’t noticed already, there are still some nonsense words that pertain to HTML code still. We can create a new database with additional HTML code to remove, but for now, this is okay as later we will weight the important words more than the non-important words, which should reduce the importance of the HTML code. Next, let’s count how many times each word appears in each document using the count()
function.
tokenized %>%
count(titles, word, sort = TRUE) %>%
head()
As you can see, the top words are “the” and “of”, which are not important for understanding the hidden features of the documents. This set of words are called stop words. They are commonly used words in any language, where they are unimportant for NLP, and are removed to focus on the important words. We will load a generic stop_words
dataset and filter out these words from our data frame. Please keep in the mind that depending on the context of the documents, you may need a different set of stop words.
data(stop_words)
tokenized <- tokenized %>%
anti_join(stop_words) #this filters out the stop words
## Joining, by = "word"
document_words <- tokenized %>%
count(titles, word, sort = TRUE)
head(document_words)
Now we can see important words that make sense for each of our documents, like the word football appearing the most for the Wikipedia article “Football”. Let’s take a look at a list of the most common words in “Football”.
document_words[document_words$titles == 'Football',] %>%
filter(n > 50) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(n, word)) +
geom_col() +
labs(y = NULL)
Let’s try to contextualize this data by ranking them based on how often they appear within the document, instead of looking at the absolute count. This is called term frequency (TF) and it gives the frequency of the word in each document. The TF is defined as \(tf(t,d)\): \[tf(t,d) = \frac{f_{t,d}}{\Sigma_{k}f_{t,d}}\] where \(f_{t,d}\) is the raw count of each term. In addition, each document will have its own TF. Let’s calculate it by finding the total number of words that appears per document.
total_words <- document_words %>%
group_by(titles) %>%
summarize(total = sum(n))
document_words <- left_join(document_words, total_words)
## Joining, by = "titles"
head(document_words)
Now we can rank each word by dividing the raw count of the word by the total number of words in the document.
freq_by_rank <- document_words %>%
group_by(titles) %>%
mutate(rank = row_number(),
`term frequency` = n/total) %>%
ungroup()
head(freq_by_rank)
However, there are also rare words that may be important that may be lost by using TF, since TF is influenced by the length of the document. Here we introduce inverse document frequency (IDF), which weighs the rare words across all documents by dividing the total number of documents by the number of documents containing the term. We add a $ + 1$ term to prevent division by 0. \[idf(t,D) = log \frac{N}{|d \in D: t \in d| + 1}\] We can combine this with TF to produce the TF-IDF score. Now TF will be weighted based on if they show up in other documents. \[tfidf(t,d,D) = tf(t,d) \times idf(t,D)\]
document_tf_idf <- document_words %>%
bind_tf_idf(word, titles, n)
head(document_tf_idf)
We can see that the word 93
in A_Song_of_Ice_and_Fire
had a very high term frequency. However, when we weight it to produce the TF_IDF score, it is reduced to a value of 0. Now let’s take a look at the top 5 words per document.
library(forcats)
document_tf_idf %>%
group_by(titles) %>%
slice_max(tf_idf, n = 5) %>%
ungroup() %>%
ggplot(aes(tf_idf, fct_reorder(word, tf_idf), fill = titles)) +
geom_col(show.legend = FALSE) +
facet_wrap(~titles, ncol = 2, scales = "free") +
labs(x = "tf-idf", y = NULL)