Seurat Guided Clustering Tutorial

While the vignette on the Seurat website already provides good instructions, I will be using this to give additional thoughts and details that could help beginners to Seurat. In addition, I will provide some recommendations on the workflow as well.

Loading the files

The first thing the tutorial asks you to do is download the raw data from 10x Genomics. The raw data contains three key files within it: barcodes.tsv, genes.tsv, and matrix.mtx. You will also notice they are contained in a folder called hg19. That simply means the counts were generated using the hg19 genome (Genome Reference Consortium Human Build 37 (GRCh37)) as the reference transcriptome.

Let’s load in each of the files and take a look at them.

barcodes <- read.table("data/filtered_gene_bc_matrices/hg19/barcodes.tsv")
head(barcodes)
##                 V1
## 1 AAACATACAACCAC-1
## 2 AAACATTGAGCTAC-1
## 3 AAACATTGATCAGC-1
## 4 AAACCGTGCTTCCG-1
## 5 AAACCGTGTATGCG-1
## 6 AAACGCACTGGTAC-1
genes <- read.delim("data/filtered_gene_bc_matrices/hg19/genes.tsv", header=FALSE)
head(genes)
##                V1           V2
## 1 ENSG00000243485   MIR1302-10
## 2 ENSG00000237613      FAM138A
## 3 ENSG00000186092        OR4F5
## 4 ENSG00000238009 RP11-34P13.7
## 5 ENSG00000239945 RP11-34P13.8
## 6 ENSG00000237683   AL627309.1
library(Matrix)
mat <- readMM(file = "data/filtered_gene_bc_matrices/hg19/matrix.mtx")
mat[1:5, 1:10]
## 5 x 10 sparse Matrix of class "dgTMatrix"
##                         
## [1,] . . . . . . . . . .
## [2,] . . . . . . . . . .
## [3,] . . . . . . . . . .
## [4,] . . . . . . . . . .
## [5,] . . . . . . . . . .

You will notice when we take a look at the matrix file, it contains .. These are used when no count is detected rather using a value of 0. This is called a sparse matrix to reduce memory and increase computational speed. It is pretty much standard to work using sparse matrices when dealing with single-cell data.

Generating the Seurat Object

Next, we will generate a Seurat object based on the files we loaded up earlier.

library(dplyr)
library(Seurat)
library(patchwork)

# Load the PBMC dataset
pbmc.data <- Read10X(data.dir = "data/filtered_gene_bc_matrices/hg19/")
# Initialize the Seurat object with the raw (non-normalized data).
pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", min.cells = 3, min.features = 200)
## Warning: Feature names cannot have underscores ('_'), replacing with dashes
## ('-')
pbmc
## An object of class Seurat 
## 13714 features across 2700 samples within 1 assay 
## Active assay: RNA (13714 features, 0 variable features)

Pre-processing the data

Now we will pre-process the data and perform quality control on the cells. There are a couple of metrics that are used within the community.

  • The number of unique genes detected in each individual cell.
    • Low-quality cells or empty droplets will often have very few genes.
    • Sometimes this could be ambient mRNA that is detected.
    • Cell doublets or multiplets may exhibit an aberrantly high gene count. There are some additional packages that can be used to detect doublets.
  • The percentage of reads that map to the mitochondrial genome
    • Low-quality / dying cells often exhibit extensive mitochondrial contamination.
    • This is often calculated by searching for genes containing MT-
# The [[ operator can add columns to object metadata. This is a great place to stash QC stats
pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")

Let’s take a look at the metadata which includes some of the QC metrics.

  • nCount_RNA is the number of unique counts in each cell.
  • nFeature_RNA is the number of unique genes in each cell.
  • percent.mt is the mitochondrial mapping that we just calculated.
head(pbmc@meta.data, 5)
##                  orig.ident nCount_RNA nFeature_RNA percent.mt
## AAACATACAACCAC-1     pbmc3k       2419          779  3.0177759
## AAACATTGAGCTAC-1     pbmc3k       4903         1352  3.7935958
## AAACATTGATCAGC-1     pbmc3k       3147         1129  0.8897363
## AAACCGTGCTTCCG-1     pbmc3k       2639          960  1.7430845
## AAACCGTGTATGCG-1     pbmc3k        980          521  1.2244898

Seurat recommends a threshold for filtering for the QC metrics.

  • Cells are filtereed for unique feature counts over 2,500 or less than 200
  • Cells are filtereed for <5% mitochondrial counts
# Visualize QC metrics as a violin plot
VlnPlot(pbmc, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)

We can take a look and see that the unique counts and feature are correlated. In addition, we see that low counts appear to correlate with high mitochondrial mapping percentage.

# FeatureScatter is typically used to visualize feature-feature relationships, but can be used
# for anything calculated by the object, i.e. columns in object metadata, PC scores etc.

plot1 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "percent.mt")
plot2 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
plot1 + plot2

I find that the density plots provide a better visualization of the distributions in case you may have a bimodal distribution.

# Visualize QC metrics as ridge plots
RidgePlot(pbmc, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol =1)
## Picking joint bandwidth of 36.2

## Picking joint bandwidth of 139

## Picking joint bandwidth of 0.153

Let’s use an adaptive threshold rather than fixed threshold. This provides a more elegant method of detecting the thresholds rather than by eye using the graphs. First, we assume that most of the cells are high-quality. Next, we then identify cells that are outliers using the median absolute deviation (MAD) from the median value of each metric for all cells. Outliers will be beyond the MAD threshold and we can specify higher or lower than the threshold.

To identify outliers based on unique genes and counts, we use log-transformed nCount_RNA and nFeature_RNA that are more than 3 MADs above and below the median. We will be using the scater library for this.

library(scater)

qc.nCount_RNA <- isOutlier(pbmc$nCount_RNA, log=TRUE, type="both")
qc.nFeature_RNA  <- isOutlier(pbmc$nFeature_RNA, log=TRUE, type="both")

We can see the thresholds that are identified in both QC metric.

attr(qc.nCount_RNA , "thresholds")
##     lower    higher 
##  802.1223 6012.0706
attr(qc.nFeature_RNA, "thresholds")
##     lower    higher 
##  399.7497 1665.6823

We can also do the same for percent.mt. In this case, we want to remove cells above the MAD.

qc.percent.mt <- isOutlier(pbmc$percent.mt,  type="higher")
attr(qc.percent.mt, "thresholds")
##    lower   higher 
##     -Inf 4.436775

Another reason to use adapative thresholds is if your data contains multiple batches. In this case, you would detect the QC threshold for each batch rather than for your entire dataset. It makes little sense to a single threshold from a data set with samples from multiple batches. While there are no batches in this current data set, you would need to include a batch parameter when using the isOutlier function.

Let’s continue on with the original tutorial and use the thresholds that were fixed but feel free to try the adaptive thresholds.

pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)

Normalizing the data

After filtering out the low quality cells from the dataset, the next step is to normalize the data. By default, Seurat employs a global-scaling normalization method “LogNormalize” that normalizes the feature expression measurements for each cell by diviing by the total expression, multiplies the result by a scale factor (10,000 by default), and then log-transforms the result to obtain the normalized data.

pbmc <- NormalizeData(pbmc)

It is common to identify highly variable features or genes for dimensional reduction. By reducing your analysis to the highly variable genes, you account for most of the biolgical heterogeneity or factors in your data and hopefully ignore a majority of the noise while reducing computational work and time. As such, the highly variable genes should enable us to isolate the real biological signals.

pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)

# Identify the 10 most highly variable genes
top10 <- head(VariableFeatures(pbmc), 10)

# plot variable features with and without labels
plot1 <- VariableFeaturePlot(pbmc) + theme(legend.text = element_text(size = 6))
plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE) + theme(legend.text = element_text(size = 6))
## When using repel, set xnudge and ynudge to 0 for optimal results
plot1 + plot2
## Warning: Transformation introduced infinite values in continuous x-axis

## Warning: Transformation introduced infinite values in continuous x-axis

Scaling the data

Next, we will need to scale the data. This is a standard pre-processing step prior dimensional reduction, like PCA, which I discussed in a previous post.

  • All gene expression will be centered to 0.
  • Scales the expression of each gene to have a variance of 1 so all genes have equal contributions

As an additional step, we typically scale the highly variable genes as these are the genes that will be used for dimensional reductiobn.

pbmc <- ScaleData(pbmc, features = VariableFeatures(object = pbmc)) #VariableFeatures is used to call the highly variable genes from the object.
## Centering and scaling data matrix

Perform linear dimensional reduction

PCA is already built into Seurat and can be called the function RunPCA.

pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc), verbose = FALSE)
# Examine and visualize PCA results a few different ways
print(pbmc[["pca"]], dims = 1:5, nfeatures = 5)
## PC_ 1 
## Positive:  CST3, TYROBP, LST1, AIF1, FTL 
## Negative:  MALAT1, LTB, IL32, IL7R, CD2 
## PC_ 2 
## Positive:  CD79A, MS4A1, TCL1A, HLA-DQA1, HLA-DQB1 
## Negative:  NKG7, PRF1, CST7, GZMB, GZMA 
## PC_ 3 
## Positive:  HLA-DQA1, CD79A, CD79B, HLA-DQB1, HLA-DPB1 
## Negative:  PPBP, PF4, SDPR, SPARC, GNG11 
## PC_ 4 
## Positive:  HLA-DQA1, CD79B, CD79A, MS4A1, HLA-DQB1 
## Negative:  VIM, IL7R, S100A6, IL32, S100A8 
## PC_ 5 
## Positive:  GZMB, NKG7, S100A8, FGFBP2, GNLY 
## Negative:  LTB, IL7R, CKB, VIM, MS4A7

We can visualize the first two prinicpal components.

DimPlot(pbmc, reduction = "pca")

Let’s inspect the contribution of each of the prinicpal components. Typically only the the principal components containing a majority of the variance is used. This can be estimated by using an ‘elbow plot’ and observing where there is a large drop off in variance.

ElbowPlot(pbmc)

It may be difficult to estimate by visualization so you can use the second derivative which should find the maximum change in the slope. The following code should provide another method for estimating the drop off.

variance <- pbmc@reductions[["pca"]]@stdev
dy <- -diff(range(variance))
dx <- length(variance) - 1
l2 <- sqrt(dx^2 + dy^2)
dx <- dx/l2
dy <- dy/l2
dy0 <- variance - variance[1]
dx0 <- seq_along(variance) - 1
parallel.l2 <- sqrt((dx0 * dx)^2 + (dy0 * dy)^2)
normal.x <- dx0 - dx * parallel.l2
normal.y <- dy0 - dy * parallel.l2
normal.l2 <- sqrt(normal.x^2 + normal.y^2)
below.line <- normal.x < 0 & normal.y < 0
if (!any(below.line)) {
    length(variance)
} else {
    which(below.line)[which.max(normal.l2[below.line])]
}
## [1] 3

While the largest drop off does occur at PC3, we can see that there in additional drop off at around PC9-10, suggesting that the majority of true signal is captured in the first 10 PCs.

Cluster the cells

To aid in summarizing the data for easier interpretation, scRNA-seq is often clustered to empirically define groups of cells within the data that have similar expression profiles. This generates discrete groupings of cells for the downstream analysis. Seurat uses a graph-based clustering approach. There are additional approaches such as k-means clustering or hierachical clustering.

The major advantage of graph-based clustering compared to the other two methods is its scalability and speed. Simply, Seurat first constructs a KNN graph based on the euclidean distance in PCA space. Each node is a cell that is connected to its nearest neighbors. Edges, which are the lines between the neighbors, are weighted based on the similarity between the cells involved, with higher weight given to cells that are more closely related. A refinement step (Jaccard similarity) is used to refine the edge weights between any two cells based on the shared overlap in their local neighborhoods.

Here we use the first 10 PCs to construct the neighbor graph.

pbmc <- FindNeighbors(pbmc, dims = 1:10)
## Computing nearest neighbor graph

## Computing SNN

Next, we can apply algorithms to identify “communities” of cells. There are two main community detection algorithm, Louvain and the improved version Leiden. We use the default Louvain while controlling the resolution parameter to adjust the number of clusters.

pbmc <- FindClusters(pbmc, resolution = 0.5)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 2638
## Number of edges: 96033
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8720
## Number of communities: 9
## Elapsed time: 0 seconds

Visualization of 2D Embeddsings

The high-dimensional space can be embedded in 2D using either tSNE and UMAP to visualize and explore the datasets. These embeddings attempt to summarize all of the data using a 2D graph to organize the data to better interpret the relationships between the cells. Cells that are more similar to one another should localize within the graph. Different types of embeddings will relay different information on the relationship between cells. UMAP is more recommended to be more faithful to the global connectivity of the manifold than tSNE, while tSNE might preserve the local structure more.

# If you haven't installed UMAP, you can do so via reticulate::py_install(packages =
# 'umap-learn')
# Here we generate a UMAP embedding.
pbmc <- RunUMAP(pbmc, dims = 1:10)
## Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
## To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
## This message will be shown once per session

## 13:36:49 UMAP embedding parameters a = 0.9922 b = 1.112

## 13:36:49 Read 2638 rows and found 10 numeric columns

## 13:36:49 Using Annoy for neighbor search, n_neighbors = 30

## 13:36:49 Building Annoy index with metric = cosine, n_trees = 50

## 0%   10   20   30   40   50   60   70   80   90   100%

## [----|----|----|----|----|----|----|----|----|----|

## **************************************************|
## 13:36:49 Writing NN index file to temp file /var/folders/05/_drvy3j57yb2pndzt041kp9c0n9q3g/T//RtmpLpoZdq/file176bf67fb8bb8
## 13:36:49 Searching Annoy index using 1 thread, search_k = 3000
## 13:36:50 Annoy recall = 100%
## 13:36:50 Commencing smooth kNN distance calibration using 1 thread
## 13:36:51 Initializing from normalized Laplacian + noise
## 13:36:51 Commencing optimization for 500 epochs, with 105132 positive edges
## 13:36:55 Optimization finished
# Here we generate a tSNE embedding.
pbmc <- RunTSNE(pbmc, dims = 1:10)

We can visualize all three of the dimensional reductions that we generated. You will notice that UMAP and tSNE generate better visualizations as they are more adept at interpreting non-linear data. PCA is linear technique and does not capture the non-linear relationship of gene expression profiles very well.

# note that you can set `label = TRUE` or use the LabelClusters function to help label
# individual clusters
plot1 <- DimPlot(pbmc, reduction = "umap")
plot2 <- DimPlot(pbmc, reduction = "tsne")
plot3 <- DimPlot(pbmc, reduction = "pca")
plot1 +  plot2 + plot3 + plot_layout(nrow = 2)

We can see that cells are clustered closer toether while also providing some global relationship between the clusters within the UMAP embedding. tSNE generates some similar clusters with the same local relationship but the global relationship can not be estimated using tSNE. In addition, tSNE does not scale with larger data sets. The PCA captures highest variation within the first two PCs but does not include information from the additional PCs. You can see here the failure of PCA to resolve the differences between the clusters unlike UMAP and tSNE.

Finding Marker Genes

To interpret our clusters, we can identify the genes and markers that drive separation of the clusters. Seurat canfind these markers via differential expression. By default, FindAllMarkers uses Wilcoxon rank-sum (Mann-Whitney-U) test to find DEs. You can choose which test by modifying the test.use parameter. If you have blocking factors (i.e., batches), you can include it using the latent.vars parameter.

# find markers for every cluster compared to all remaining cells, report only the positive ones
pbmc.markers <- FindAllMarkers(pbmc, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)
## Calculating cluster 0

## Calculating cluster 1

## Calculating cluster 2

## Calculating cluster 3

## Calculating cluster 4

## Calculating cluster 5

## Calculating cluster 6

## Calculating cluster 7

## Calculating cluster 8
pbmc.markers %>% group_by(cluster) %>% top_n(n = 2, wt = avg_logFC)
## # A tibble: 18 x 7
## # Groups:   cluster [9]
##        p_val avg_logFC pct.1 pct.2 p_val_adj cluster gene    
##        <dbl>     <dbl> <dbl> <dbl>     <dbl> <fct>   <chr>   
##  1 1.96e-107     0.730 0.901 0.594 2.69e-103 0       LDHB    
##  2 1.61e- 82     0.922 0.436 0.11  2.20e- 78 0       CCR7    
##  3 7.95e- 89     0.892 0.981 0.642 1.09e- 84 1       LTB     
##  4 1.85e- 60     0.859 0.422 0.11  2.54e- 56 1       AQP3    
##  5 0.            3.86  0.996 0.215 0.        2       S100A9  
##  6 0.            3.80  0.975 0.121 0.        2       S100A8  
##  7 0.            2.99  0.936 0.041 0.        3       CD79A   
##  8 9.48e-271     2.49  0.622 0.022 1.30e-266 3       TCL1A   
##  9 2.96e-189     2.12  0.985 0.24  4.06e-185 4       CCL5    
## 10 2.57e-158     2.05  0.587 0.059 3.52e-154 4       GZMK    
## 11 3.51e-184     2.30  0.975 0.134 4.82e-180 5       FCGR3A  
## 12 2.03e-125     2.14  1     0.315 2.78e-121 5       LST1    
## 13 7.95e-269     3.35  0.961 0.068 1.09e-264 6       GZMB    
## 14 3.13e-191     3.69  0.961 0.131 4.30e-187 6       GNLY    
## 15 1.48e-220     2.68  0.812 0.011 2.03e-216 7       FCER1A  
## 16 1.67e- 21     1.99  1     0.513 2.28e- 17 7       HLA-DPB1
## 17 7.73e-200     5.02  1     0.01  1.06e-195 8       PF4     
## 18 3.68e-110     5.94  1     0.024 5.05e-106 8       PPBP

You can export the markers as a ‘csv’ for analysis outside of R.

write.csv(pbmc.markers, 'pbmc.markers.csv')

We can visualize some of these markers using VlnPlot.

VlnPlot(pbmc, features = c("MS4A1", "CD79A"))

# you can plot raw counts as well
VlnPlot(pbmc, features = c("NKG7", "PF4"), slot = "counts", log = TRUE)

We can visualize the localization of these markers in a UMAP embedding by using FeaturePlot.

FeaturePlot(pbmc, features = c("MS4A1", "GNLY", "CD3E", "CD14", "FCER1A", "FCGR3A", "LYZ", "PPBP", 
    "CD8A"))

We can generate a dot plot to visualize the markers.

DotPlot(pbmc, features = c("MS4A1", "GNLY", "CD3E", "CD14", "FCER1A", "FCGR3A", "LYZ", "PPBP", 
    "CD8A")) + theme(axis.text.x = element_text(angle = 90))

Another common method of visualization is to generate a heatmap. We can use DoHeatmap to see the top 10 markers per cluster.

top10 <- pbmc.markers %>% group_by(cluster) %>% top_n(n = 10, wt = avg_logFC)
DoHeatmap(pbmc, features = top10$gene) + NoLegend()
## Warning in DoHeatmap(pbmc, features = top10$gene): The following features were
## omitted as they were not found in the scale.data slot for the RNA assay: CD8A,
## VPREB3, TRAT1, PIK3IP1, PRKCQ-AS1, LEF1, NOSIP, CD3E, CD3D, CCR7, LDHB, RPS3A

Assigning cell type identity to clusters

We can use canonical markers to easily match the unbiased clustering to known cell types based on the table below:

Cluster ID Markers Cell Type
0 IL7R, CCR7 Naive CD4+ T
1 IL7R, S100A4 Memory CD4+
2 CD14, LYZ CD14+ Mono
3 MS4A1 B
4 CD8A CD8+ T
5 FCGR3A, MS4A7 FCGR3A+ Mono
6 GNLY, NKG7 NK
7 FCER1A, CST3 DC
8 PPBP Platelet
new.cluster.ids <- c("Naive CD4 T", "Memory CD4 T", "CD14+ Mono", "B", "CD8 T", "FCGR3A+ Mono", 
    "NK", "DC", "Platelet")
names(new.cluster.ids) <- levels(pbmc)
pbmc <- RenameIdents(pbmc, new.cluster.ids)
DimPlot(pbmc, reduction = "umap", label = TRUE, pt.size = 0.5) + NoLegend()

We can also use annotation packages like SingleR to perform automatic annotation. You can find more information here. We initialize the Humany Primary Cell Atlas data.

library(SingleR)
ref <- HumanPrimaryCellAtlasData()
## Warning: `select_()` is deprecated as of dplyr 0.7.0.
## Please use `select()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

## Warning: `filter_()` is deprecated as of dplyr 0.7.0.
## Please use `filter()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

## snapshotDate(): 2019-10-22

## see ?SingleR and browseVignettes('SingleR') for documentation

## loading from cache

## see ?SingleR and browseVignettes('SingleR') for documentation

## loading from cache

We can see some of the labels within this reference set.

as.data.frame(colData(ref))
##                                              label.main
## GSM112490                                            DC
## GSM112491                                            DC
## GSM112540                                            DC
## GSM112541                                            DC
## GSM112661                                            DC
## GSM112664                                            DC
## GSM112665                                            DC
## GSM112666                                            DC
## GSM112667                                            DC
## GSM112668                                            DC
## GSM112669                                            DC
## GSM112670                                            DC
## GSM116101                           Smooth_muscle_cells
## GSM116102                           Smooth_muscle_cells
## GSM116103                           Smooth_muscle_cells
## GSM116104                           Smooth_muscle_cells
## GSM116105                           Smooth_muscle_cells
## GSM116106                           Smooth_muscle_cells
## GSM119354                              Epithelial_cells
## GSM119357                              Epithelial_cells
## GSM119359                              Epithelial_cells
## GSM119360                              Epithelial_cells
## GSM119361                              Epithelial_cells
## GSM119362                              Epithelial_cells
## GSM119366                              Epithelial_cells
## GSM119369                              Epithelial_cells
## GSM119371                              Epithelial_cells
## GSM119372                              Epithelial_cells
## GSM1209554_HH1763_UI33plus2_201004               B_cell
## GSM1209555_HH1778_u133plus2_211004               B_cell
## GSM1209556_HH1786_U133plus2_091104               B_cell
## GSM1209557_HH1791_u133plus2_251104               B_cell
## GSM1209558_HH1713_u133plus2_011004          Neutrophils
## GSM1209559_HH1712_u133plus2_011004          Neutrophils
## GSM1209560_HH1714_u133plus2_011004          Neutrophils
## GSM1209561_TW1681_u133plus2_061004              T_cells
## GSM1209562_TW1685_u133plus2_061004              T_cells
## GSM1209563_TW1689_u133plus2_061004              T_cells
## GSM1209564_HH1765_UI33plus2_201004              T_cells
## GSM1209565_HH1769_UI33plus2_201004              T_cells
## GSM1209566_HH1770_U133plus2_041104              T_cells
## GSM1209567_HH1774_u133plus2_211004              T_cells
## GSM1209568_HH1775_u133plus2_211004              T_cells
## GSM1209569_HH1780_u133plus2_211004              T_cells
## GSM1209570_HH1788_U133plus2_091104              T_cells
## GSM1209571_HH1792_u133plus2_251104              T_cells
## GSM1209572_HH1793_u133plus2_251104              T_cells
## GSM1209573_TW1678_u133plus2_061004              T_cells
## GSM1209574_TW1682_u133plus2_061004              T_cells
## GSM1209575_TW1686_u133plus2_061004              T_cells
## GSM1209576_TW1690_u133plus2_061004              T_cells
## GSM1209577_TW1675_u133plus2_061004              T_cells
## GSM1209578_TW1679_u133plus2_061004              T_cells
## GSM1209579_TW1683_u133plus2_061004              T_cells
## GSM1209580_TW1687_u133plus2_061004              T_cells
## GSM1209581_TW1676_u133plus2_061004              T_cells
## GSM1209582_TW1680_u133plus2_061004              T_cells
## GSM1209583_TW1684_u133plus2_061004              T_cells
## GSM1209584_TW1688_u133plus2_061004              T_cells
## GSM1209585_HH1762_UI33plus2_201004             Monocyte
## GSM1209586_HH1767_UI33plus2_201004             Monocyte
## GSM1209587_HH1772_u133plus2_211004             Monocyte
## GSM1209588_HH1777_U133plus2_041104             Monocyte
## GSM1209589_HH1785_U133plus2_011204             Monocyte
## GSM1209590_HH1790_U133plus2_091104             Monocyte
## GSM1209591_HH1719_u133plus2_011004         Erythroblast
## GSM1209592_HH1720_u133plus2_011004         Erythroblast
## GSM1209593_HH1721_u133plus2_011004         Erythroblast
## GSM1209594_HH1722_u133plus2_011004         Erythroblast
## GSM1209595_HH1723_u133plus2_011004         Erythroblast
## GSM1209596_HH1716_u133plus2_011004         Erythroblast
## GSM1209597_HH1717_u133plus2_011004         Erythroblast
## GSM1209598_HH1718_u133plus2_011004         Erythroblast
## GSM1209599_HH1715_u133plus2_011004           BM & Prog.
## GSM132919                                            DC
## GSM132921                                            DC
## GSM132922                                            DC
## GSM132923                                            DC
## GSM132924                                            DC
## GSM132925                                            DC
## GSM132926                                            DC
## GSM132927                                            DC
## GSM132928                                            DC
## GSM132929                                            DC
## GSM132930                                            DC
## GSM140244                                      Monocyte
## GSM140245                                      Monocyte
## GSM140246                                      Monocyte
## GSM140247                                      Monocyte
## GSM140248                                      Monocyte
## GSM140249                                      Monocyte
## GSM140953                                            DC
## GSM140968                                            DC
## GSM140969                                            DC
## GSM140970                                            DC
## GSM140971                                            DC
## GSM140973                                            DC

We use our ref reference to annotate each cell in pbmc via the SingleR function.

pbmc.sce <- as.SingleCellExperiment(pbmc) #convert to SingleCellExperiment
pred.pbmc <- SingleR(test = pbmc.sce, ref = ref, assay.type.test=1,
    labels = ref$label.main)

Each row of the output DataFrame contains prediction results for each cell.

pred.pbmc
## DataFrame with 2638 rows and 5 columns
##                                                                      scores
##                                                                    <matrix>
## AAACATACAACCAC-1  0.188643513016495:0.126545677221482:0.148817362386119:...
## AAACATTGAGCTAC-1  0.262440847080309:0.141877125510439:0.173883875462993:...
## AAACATTGATCAGC-1  0.201531694691989:0.102183839968423:0.119175368433024:...
## AAACCGTGCTTCCG-1   0.29535017400362:0.130158307902165:0.157554338627856:...
## AAACCGTGTATGCG-1  0.149838962207997:0.0789937558147814:0.09940263612008:...
## ...                                                                     ...
## TTTCGAACTCTCAT-1   0.32171248767637:0.146440795008671:0.176046214253916:...
## TTTCTACTGAGGCA-1  0.255811947610562:0.175397380908548:0.194313874751393:...
## TTTCTACTTCCTCG-1 0.189605052702718:0.0707210463786043:0.120214202766769:...
## TTTGCATGAGAGGC-1 0.157296302532056:0.0703783076303343:0.109879822603735:...
## TTTGCATGCCTCAC-1  0.192359514330612:0.105920696799686:0.144491410113365:...
##                      first.labels                        tuning.scores
##                       <character>                          <DataFrame>
## AAACATACAACCAC-1          T_cells   0.30197542955515:0.213314539202627
## AAACATTGAGCTAC-1           B_cell 0.326635981507555:0.0220241142719439
## AAACATTGATCAGC-1          T_cells  0.307280332559489:0.109094837818784
## AAACCGTGCTTCCG-1         Monocyte   0.28638593546712:0.236027443382121
## AAACCGTGTATGCG-1          NK_cell  0.299925203850882:0.210705283688696
## ...                           ...                                  ...
## TTTCGAACTCTCAT-1 Pre-B_cell_CD34-  0.259345379219524:0.134352894882063
## TTTCTACTGAGGCA-1 Pre-B_cell_CD34-  0.157335894267966:0.129646880250965
## TTTCTACTTCCTCG-1           B_cell  0.235416822028172:0.173612567558807
## TTTGCATGAGAGGC-1           B_cell  0.228548164914059:0.149524152731392
## TTTGCATGCCTCAC-1          T_cells  0.288418116982557:0.194117507349502
##                            labels    pruned.labels
##                       <character>      <character>
## AAACATACAACCAC-1          T_cells          T_cells
## AAACATTGAGCTAC-1           B_cell           B_cell
## AAACATTGATCAGC-1          T_cells          T_cells
## AAACCGTGCTTCCG-1         Monocyte         Monocyte
## AAACCGTGTATGCG-1          NK_cell          NK_cell
## ...                           ...              ...
## TTTCGAACTCTCAT-1         Monocyte         Monocyte
## TTTCTACTGAGGCA-1 Pre-B_cell_CD34- Pre-B_cell_CD34-
## TTTCTACTTCCTCG-1           B_cell           B_cell
## TTTGCATGAGAGGC-1           B_cell           B_cell
## TTTGCATGCCTCAC-1          T_cells          T_cells
# Summarizing the distribution:
table(pred.pbmc$labels)
## 
##           B_cell              CMP         Monocyte          NK_cell 
##              333                4              615              193 
##        Platelets Pre-B_cell_CD34- Pro-B_cell_CD34+          T_cells 
##               12               65                5             1411

Let’s see how it compared to our manual annotations. First, we will add the predicted labels as new metadata.

pbmc <- AddMetaData(pbmc, pred.pbmc$labels, col.name = 'SingleR.Annotations')
plot1 <- DimPlot(pbmc, reduction = "umap", label = TRUE, pt.size = 0.5) + NoLegend() + ggtitle('Manual Annotations')
plot2 <- DimPlot(pbmc, reduction = "umap", group.by = 'SingleR.Annotations', label = TRUE, pt.size = 0.5) + NoLegend() + ggtitle('SingleR Annotations')
plot1 + plot2

We can see that the predicted labels and manually anotated labels match up pretty well. Your choice of reference data and parameters can be used to further fine-tine the predicted labels. Here we used a bulk reference data set but a single-cell reference data that is well-annotated set could be used as well.

Finally, always make sure to save your data.

saveRDS(pbmc, file = "pbmc3k_final.rds")
Written on July 2, 2020

[ single-cell  tutorials  bioinformatics  R  ]