Introduction to R and RStudio
What is R?
R is a language and environment for statistical computing and graphics. It is widely used for a variety of statistical analysis (i.e., linear and nonlinear modeling, classical statistical tests, clustering, etc.). We can also use R for data visualization and producing figures for presentations. R is freely available and has a large collection of developed packages of different tools. Overall, R is a powerful tool that can be used to handle big data, perform statistics, and visualize results, while enabling end-users to easily develop tools and packages for customized functionality.
What is RStudio?
While R, also called “base R”, itself is an interpreted computer language and comes with a terminal, ease of use and more functionality is introduced when using RStudio, an open source Integrated Development Environment (IDE). Rather than using a terminal, RStudio provides a graphical user interface that is platform agnostic and integrates additional packages, project management, version control, and notebooks.
Getting started with R and RStudio
The Comprehensive R Archive Network or CRAN actively maintains and develops R. It also hosts a repository of R packages that enhances the base functionality of R. To get started, base R can be installed on either Linux, macOS, or Windows by going to the CRAN homepage.
To use R, you need to open a terminal window and type R
and press
enter. You will see some text denoting the R version and some helpful
functions. To quit, type q()
and press enter.
Rather than work in a terminal, RStudio is the way to use R like how many use Microsoft Word to prepare documents. Download RStudio for free here and follow the installation instructions. Once installed, proceed to opening RStudio, typically done by clicking on the RStudio icon.
Once you open RStudio, you will see four main panes. Each will contain different information. You can see my screen below.
Starting from the top left pane and going from left to right, we have the descriptions of each pane below:
- Source Editor: This pane is where you can write R scripts or
notebooks. You can write in other programming languages if you
wanted to as well. Each document will have its own tab. Code written
here can be run with the
Run
command. - Environment: This pane displays objects, variables, and functions that are generated in your R session. There is also a history of all code that was run.
- Console: This pane is where you can type commands and interactively run R code. The output will display in the console.
- Files, Plots, Packages, Help, Viewer: This pane has several tabs
that are important.
- Files: This tab shows the structure and content of a directory on your computer. This could be your working directory or a directory that you manually navigated to.
- Plots: This tab will display plots or figures as an output from the console.
- Packages: This tab contains a list of all packages that are installed. Packages that are loaded in your R session will have a checked box.
- Help: This tab reveals the help pages for an R package or function.
- Viewer: This tab shows compiled R Markdown documents.
Starting a new project in RStudio
When working with R, it is always a good idea to create a new project directory. This will help you keep your files and data organized by project.
To get started: 1. Open RStudio 2. Go to the File menu and select New Project. 3. In the New Project window, choose New Directory. Then, choose New Project. Name your new directory. You can use a name like, Intro-to-R, and then “Create the project as subdirectory of:” in a location of your choice. 4. Click on Create Project. 5. The project should open up automatically in Rstudio.
We can view our working directory by using the function getwd()
getwd()
## [1] "/Users/dtruong4/Desktop/Intro-to-R"
Your working directory will be the location where R will automatically look for files. If you want to find files in a different location, you will either need to provide the full path or type the path in relation to the working directory. Files that you output will automatically save into your working directory unless a path is provided.
To organize your working directory, it is highly recommended to generate
sub-folders like data/
or results/
. You can do so in the Files tab
and select New Folder
.
Basic math calculations in R
Below are some of the most common math calculations that can be done in R.
Operation | Symbol |
---|---|
Addition | a + b |
Subtraction | a - b |
Multiplication | a * b |
Division | a / b |
Exponent | a ^ b |
Remainder | a %% b |
Integer Division | a %/% b |
For instance, here is an example of addition.
5 + 3
## [1] 8
Functions in R
R has several pre-built functions. For instance, we can use sum()
instead of the +
symbol for addition.
sum(1,3) #This gives the sum of 1 and 3
## [1] 4
We can also call the R documentation for a function by using ?
before the function name. The documentation will show up under the Help
tab. It contains information regarding description, usage, and arguments
for a function.
?sum #Find the R Documentation for sum()
We can see that sum()
can return the sum of more than two values.
sum(1,2,3,4,5)
## [1] 15
Math functions in R
In addition to the basic calculations, there are many common pre-built math functions in R.
Operation | Function |
---|---|
Square root | sqrt() |
Logarithm | log() |
Logarithm, base 10 | log10() |
Exponential | exp() |
Summation | sum() |
Round | round() |
Mean | mean() |
Median | median() |
Minimum | min() |
Maximum | max() |
#What is the square root of 4?
sqrt(4)
## [1] 2
#Can we round 3.14?
round(3.14)
## [1] 3
Defining variables
A key aspect of programming is defining variables. We store data or
values as variables so that we can use in other functions or recall it
at a later time. It allows us to save time by storing the data and not
having to re-calculate it again. R has two assignment operators for
defining variables: <-
and =
. The operator <-
can be used
anywhere, whereas the operator =
is only allowed at the top level.
x <- 1
#What is in `x`?
x
## [1] 1
y = 15.3
#What is in `y`?
y
## [1] 15.3
#What is in x + y?
x + y
## [1] 16.3
We can also define variables within functions.
sqrt(y <- 5)
## [1] 2.236068
#What is in `y`?
y
## [1] 5
However, we cannot use the =
to do so. This is because functions
already have pre-defined arguments that the function is looking for. In
this case, sqrt()
is looking for the argument x
, not y
.
sqrt(y = 5)
## Error in sqrt(y = 5): supplied argument name 'y' does not match 'x'
Instead, we can do the following:
sqrt(x = y <- 5) #Here we define y as 5 and pass y into x
## [1] 2.236068
sqrt(x = 5) #Here we pass 5 into x
## [1] 2.236068
Importantly, when defining variable names, ensure that you use an
informative name. This enables yourself and others when reviewing code
to know how the variable was used. For instance, we used x
and y
in
the examples, but their meaning is unknown. Something like
country_population
or room_capacity
provides better definition for a
variable.
R Data Types
There are different types of data in R, which can be stored as a variable. Below is a table of some of the most commonly used data types.
Data Type | Definition | Example |
---|---|---|
numeric | Any number value | 3.14 |
integer | Any whole number value | 42 |
character | Any number of ASCII characters | "Hello world!" |
defined within quotation marks | ||
logical | A value of TRUE or FALSE |
TRUE |
factor | A categorical type of data | #> [1] Male Male Male Female Female |
#> Levels: Male Female |
The function class()
can be used to find out the type of data that you
are dealing with.
x <- 3
class(x)
## [1] "numeric"
x <- TRUE
class(x)
## [1] "logical"
Vectors
Vectors are a data structure in R containing one or more values. In
fact, you may have noticed a [1]
in the output of x
. This indicates
that it is a vector of length 1
.
length(x) #This function gives you the length of a vector
## [1] 1
We use the function c()
to define a vector with multiple elements. The
c
stands for combine.
my_first_vector <- c(1,2,3,4,5) #We can also do this with the following, 1:5 instead of c()
my_first_vector
## [1] 1 2 3 4 5
We can add more elements to the same vector.
my_first_vector <- c(my_first_vector, 6,7)
my_first_vector
## [1] 1 2 3 4 5 6 7
We can call specific elements in a vector by using a process called
indexing. Basically, we can subset specific elements of a vector for
further analysis. We do this by defining which position we want in
brackets []
after the vector.
#Let's take out the 3rd element
my_first_vector[3]
## [1] 3
What if we wanted to select multiple elements? We can use another vector with the positions we want.
#Let's take out the 2nd and 4th elements
my_first_vector[c(2,4)]
## [1] 2 4
We can also do the opposite and select all elements but a single or
multiple element by using -
.
#Let's keep all but the 5th element
my_first_vector[-5]
## [1] 1 2 3 4 6 7
#Let's keep all but the 1st and 3rd elements
my_first_vector[-c(1,3)]
## [1] 2 4 5 6 7
Importantly, R functions are typically vectorized. This means that the function will perform its operation on all elements of the vector without having to loop through for each element.
my_first_vector
## [1] 1 2 3 4 5 6 7
my_first_vector * 2
## [1] 2 4 6 8 10 12 14
We can also test some of math functions we listed above.
mean(my_first_vector) #This gives the mean of a numeric vector
## [1] 4
min(my_first_vector) #This gives the minimum numeric value in a vector
## [1] 1
max(my_first_vector) #This gives the maximmum numeric value in a vector
## [1] 7
Relational and Logical Operators
In R, there are relational operators that compare values between two variables. Typically, these are numerical equalities or inequalities. The result of comparison is a Boolean value.
Operator | Description |
---|---|
< |
less than |
<= |
less than or equal to |
> |
greater than |
>= |
greater than or equal to |
== |
equal to |
!= |
not equal to |
%in% |
is ‘in’ a given vector |
#Is 3 greater than 5?
3 > 5
## [1] FALSE
We can use the %in%
operator to determine if a given value is in a
vector.
#Is 5 in 1 through 10?
5 %in% 1:10
## [1] TRUE
#Is 'a' in c('a','b','c')?
'a' %in% c('a','b','c')
## [1] TRUE
There are also logical operators which connect two or more expressions depending on the meaning of the operator. These are typically combined with relational operators.
Operator | Description |
---|---|
| |
OR |
& |
AND |
! |
NOT |
#Is 3 greater than 1 and 5?
3 > 1 & 3 > 5
## [1] FALSE
#Is 3 greater than 1 or 5?
3 > 1 | 3 > 5
## [1] TRUE
User-Written Functions
We can build custom functions to perform operations that are not pre-built in base R. Custom functions can include pre-built or other custom functions. In fact, many pre-built functions are functions of other pre-built functions.
my_function <- function(arg1, arg2,...){
statements
return(object)
}
Let’s make a function that finds the average of a numeric vector.
average <- function(numeric_vector){
out <- sum(numeric_vector)/length(numeric_vector)
return(out)
}
Once we have created our function, you will see it under the Functions section in the Environment tab.
Let’s test our function below.
average(c(1,2,3))
## [1] 2
What happens if we put a non-numeric vector?
average(c('a', 'b', 'c'))
## Error in sum(numeric_vector): invalid 'type' (character) of argument
While R has an error message for some cases, we can also build your own error function with a custom message.
average <- function(numeric_vector){
if (class(numeric_vector) != 'numeric') #condition to throw the error
out <- 'This is not a numeric vector' #The custom error message
else
out <- sum(numeric_vector)/length(numeric_vector)
return(out)
}
average(c('a', 'b', 'c'))
## [1] "This is not a numeric vector"
We can also view the code of any function by typing it without the ()
.
This can help you understand how a function works, especially if it
created by someone else.
average
## function(numeric_vector){
## if (class(numeric_vector) != 'numeric') #condition to throw the error
## out <- 'This is not a numeric vector' #The custom error message
## else
## out <- sum(numeric_vector)/length(numeric_vector)
## return(out)
## }
Best Practices
- Document your code, thoughts, and decision making. Keep this in a
.Rmd
or.r
file. Use#
for commenting. This will aid in reproducible of your code for yourself and others. - Create and work inside an R project. This helps with code and data organization.
- Use informative naming for variables. Try to stay from
x
,y
, or similar. - Do create functions to simplify your code.
- Practice and keep learning!
Additional Resources
- Introduction to R and Rstudio: Harvard Bioinformatics
- Introduction to R and RStudio: Alex Lemonade Stand
- Introduction to R and RStudio: Yale CRC