Getting data ready

Input data

FR-Match is a cell type matching method for two independently conducted scRNAseq experiments, namely, query and reference. For each experiment, FR-Match takes in the following input data:

Reference dataset
- A cell-by-gene expression matrix
- Cell type cluster membership
- Informative marker genes for the reference cell types
Query dataset
- A cell-by-gene expression matrix
- Cell type cluster membership

There are many pieces of data information needed for various scRNAseq data analyses. We choose to use the SingleCellExperiment class to organize the input data, which is a convenient container for scRNAseq data. For instructions on how to construct a SingleCellExperiment object, please see An introduction to the SingleCellExperiment class.

For an FR-Match data object, the following data are essential:

a gene expression data matrix (assay())
cell type cluster labels (colData())
NS-Forest or your own marker genes for the reference dataset (rowData())

In addition, metadata (@metadata) such as F-beta score and cluster order are not essential for the core matching algorithm, but these information will facilitate visualization and other customized analysis tools provided in this package.

An example data object

In this package, we include an example data object sce.example. For more details, please see help("sce.example"). A quick look at the data object below.

data(sce.example)
sce.example
#> class: SingleCellExperiment 
#> dim: 16497 865 
#> metadata(3): cluster_marker_info f_score cluster_order
#> assays(1): logcounts
#> rownames(16497): A1CF A2M ... ZZEF1 ZZZ3
#> rowData names(1): marker_gene
#> colnames(865): A01_1_Nuclei_NeuNP_H200_1025_MTG_layer1_BCH9
#>   A01_BCH3_1NeuNP_H200.1030_MTG_Layer_1 ...
#>   P11_1_Nuclei_NeuNN_H200_1025_MTG_layer1_BCH9
#>   P11_1_Nuclei_NeuNN_H200_1030_MTG_layer1_BCH8
#> colData names(1): cluster_membership
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):

The naming convention and data structure are listed below.

Naming convention of the data object.

Layer 1 and full MTG data

Here, we introduce two datasets to be used in this vignette. Both datasets are generated by the Allen Institute for Brain Science using Smart-seq2 protocol for single nucleus RNA sequencing. The first is the pre-containerized dataset in the data object sce.example, which is sampled from a single layer (layer 1) of cerebral cortex from the middle temporal gyrus (MTG) region of human brain (Boldog et al., 2018). In the layer 1 dataset, there are 16487 genes and 865 cells; 15 cell type clusters are identified and labeled by the authors.

## rename the data object
sce.layer1 <- sce.example
## cell type clusters and cluster sizes
knitr::kable(table(colData(sce.layer1)$cluster_membership), col.names=c("Cluster", "Size"))

Cluster	Size
e1_e299_SLC17A7_L5b_Cdh13	299
g1_g48_GLI3_Astro_Gja1	48
g2_g27_APBB1IP_Micro_Ctss	27
g3_g18_GPNMB_OPC_Pdgfra	18
g4_g9_MOG_Oligo_Opalin	9
i10_i16_TSPAN12_Vip_Mybpc1	16
i1_i90_COL5A2_Ndnf_Car4	90
i2_i77_LHX6_Sst_Cbln4	77
i3_i56_BAGE2_Ndnf_Cxcl14	56
i4_i54_MC4R_Ndnf_Cxcl14	54
i5_i47_TRPC3_Ndnf_Car4	47
i6_i44_GPR149_Vip_Mybpc1	44
i7_i31_CLMP_Ndnf_Cxcl14	31
i8_i27_SNCG_Vip_Mybpc1	27
i9_i22_TAC3_Vip_Mybpc1	22

The second dataset is sampled from the full laminar depth (layer 1-6) of the MTG cerebral cortex (Hodge et al., 2019). The MTG dataset consist of 13945 genes and 15603 cells; 75 cell type clusters are identified and labeled by the authors. We are going to devote the next section to introduce a function that compiles the necessary input data files from the MTG dataset into a data object that can be used in our matching function.

The biological ground truth is that these two datasets are from neuroanatomical overlapping regions, therefore, we would expect each layer 1 cell type to be matched with one full MTG cell type. Good matches should reflect the laminar characteristics of these cell types, i.e. matches should be found in the upper layer cell types of the full MTG dataset.

Construct data object using `make_data_object()`

The following raw data files are here.

## read in pieces of input data - this may take a few minutes
cell_by_gene_expression <- fread("cell_by_gene_expression.csv")
cell_cluster_labels <- fread("cell_cluster_labels.csv")
NSForest_marker_genes <- fread("NSForest_marker_genes.csv")
NSForest_fscores <- fread("NSForest_fscores.csv")
MTG_taxonomy <- fread("MTG_taxonomy.csv")$x #need to be a vector

## unique markers
unique_markers <- unique(NSForest_marker_genes$markerGene)

Use the make_data_object() function to compile the pieces of input data into a data object.

sce.MTG <- make_data_object(dat = cell_by_gene_expression,
                            tab = cell_cluster_labels,
                            markers = unique_markers,
                            ## below are optional input data
                            cluster_marker_info = NSForest_marker_genes,
                            f_score = NSForest_fscores,
                            cluster_order = MTG_taxonomy)

Take a look at the data object for the full MTG.

sce.MTG
#> class: SingleCellExperiment 
#> dim: 13945 15603 
#> metadata(3): cluster_marker_info f_score cluster_order
#> assays(1): logcounts
#> rownames(13945): A1BG A2M ... ZZEF1 ZZZ3
#> rowData names(1): marker_gene
#> colnames(15603): F1S4_160106_001_B01 F1S4_160106_001_C01 ...
#>   F2S4_170405_060_F01 F2S4_170405_060_H01
#> colData names(1): cluster_membership
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):

It may cause some confusion on the cell-by-gene or gene-by-cell orientation of the expression data. To be clear, the raw input data (.csv or .txt file) for gene expression should have cells in rows and genes in columns (cell-by-gene). After the input data is read into the data object (e.g. sce.example), the expression data is stored as genes in rows and cells in columns (gene-by-cell). We made such arrangement because conventionally R packages, such as Seurat, deal with gene-by-cell expression data; and Python packages, such as Scanpy, deal with cell-by-gene expression data.

FR-Match

Cell type “barcode”

In the FR-Match workflow, we utilize the informative marker genes for dimensionality reduction. Good marker genes display on-off binary expression pattern producing, in combination, a unique gene expression “barcode” for each cell type cluster. We provide the plot_cluster_by_markers() function for plotting the marker gene expression patterns for a random set of cells in each cluster, visualizing the cell type “barcode”. Examples below show that the “barcodes” for excitatory cell type, glial cell type, and inhibitory cell type from the layer 1 data are distinctively different using the NS-Forest marker genes.

plot_cluster_by_markers(sce.example, cluster.name = "e1_e299_SLC17A7_L5b_Cdh13", name.self = "Layer1_")

plot_cluster_by_markers(sce.example, cluster.name = "g1_g48_GLI3_Astro_Gja1", name.self = "Layer1_")

plot_cluster_by_markers(sce.example, cluster.name = "i1_i90_COL5A2_Ndnf_Car4", name.self = "Layer1_")

Cluster sizes

Take a look at the cluster sizes in these two datasets.

plot_clusterSize(sce.layer1, sce.MTG, name.E1 = "Layer1", name.E2 = "MTG")

Note that cluster sizes range widely, which may cause the unbalance cluster size issue in the FR test.

Run `FRmatch()`

The FRmatch() function is a wrapper function that takes in two input data objects in the sce.query = and sce.ref = arguments. By performing this function, it carries out our matching workflow in default setting. Key steps are reported while running. To start, let’s regard sce.layer1 as the query data and sce.MTG as the reference data.

rst.layer1toMTG <- FRmatch(sce.query = sce.layer1, sce.ref = sce.MTG, subsamp.size = 10)
#> * Check query data object. 
#> * Check reference data object. 
#> * Feature selection: 108 out of 157 reference marker genes are presented in the query experiment. 
#> * Filtering small clusters: query and reference clusters with less than 5 cells are not considered. 
#> * Comparing 15 query clusters with 75 reference clusters... 
#> ** Parallel computing with 12 cores. 
#> ** method = cluster2cluster | subsamp.size = 10 | subsamp.iter = 1000 
#> * Done parallel and returning results.

A few remarks:

Note that not all clusters are passed to the matching algorithm. In the default setting, query and reference clusters with less 10 cells are filtered out. This is can be tuned using the filter.size = argument. It is also optional to filter clusters based on F-beta score using the filter.fscore = argument.
FRmatch() uses an iterative subsampling scheme to overcome the unbalanced cluster sizes between the query and reference clusters when applying the FR test. The subsampling parameters are subsamp.size =, subsamp.iter =, subsamp.seed =.
Due to the iterative scheme, it may take some computational time. We provide the numCores = argument to specific the number of cores to utilize for parallel computing.

Also, we may want to swap the query and reference data to perform FR-Match in the other direction. In practice, if we want to match between two independently conducted experiments (for example, each experiment may focus on a different specimen region, inducing different cell type clusters and different set of marker genes), different directions of matching may lead to quiet different matching results. We recommend to perform both directions of matching, and conclude with a consensus matching results from both directions.

rst.MTGtolayer1 <- FRmatch(sce.query = sce.MTG, sce.ref = sce.layer1, subsamp.size = 10)
#> * Check query data object. 
#> * Check reference data object. 
#> * Feature selection: 30 out of 32 reference marker genes are presented in the query experiment. 
#> * Filtering small clusters: query and reference clusters with less than 5 cells are not considered. 
#> * Comparing 75 query clusters with 15 reference clusters... 
#> ** Parallel computing with 12 cores. 
#> ** method = cluster2cluster | subsamp.size = 10 | subsamp.iter = 1000 
#> * Done parallel and returning results.

The FRmatch() output is a list of results. The best way to present these results is to use our customized plotting functions.

Plot bi-directional matching results

The final matching results from both directions can be combined using the plot_bi_FRmatch() function. The function takes in the FRmatch() outputs from both directions, and displays the two-way match (match found in both directions), one-way match (match found in either direction), and no match in the following plot.

plot_bi_FRmatch(rst.layer1toMTG, rst.MTGtolayer1, name.E1="Layer1_", name.E2="MTG_")

FR-Match uniquely maps cell types reflecting the overlapping anatomic regions. Using FR-Match, we mapped most of the layer 1 cell types uniquely to one MTG cell type, i.e. 1-to-1 two-way matches. These matches precisely reflect the overlapping anatomic regions in these two independent experiments in that the matched MTG cell types all have an ‘L1’ layer indicator in their nomenclature. The couple exceptions of 1-to-many two-way matches may suggest under-partitioning of the layer 1 cell type; but the multiple matches are next to each other given the order of the MTG cell types follow the hierarchical taxonomy in Figure 1c of Hodge et al. (2019).

Plot one-directional matching results

We also provide the function plot_FRmatch() that takes in single FRmatch() output with argument type = "matches" by default, and optionally type = "padj", to visualize the matching results from one direction.

## to visualize the one-directional matches
plot_FRmatch(rst.layer1toMTG)

## to visualize the adjusted p-values for each query cluster
plot_FRmatch(rst.layer1toMTG, type = "padj")

A few remarks:

For the plots above, note that the prefixes of these cluster names are “query.” and “ref.”, which are coded in our program as default option. We deliberately leave it that way since: 1) it helps our users to distinguish the direction of the one-directional matching; and 2) we sort of regard the one-directional matching as an internal step and strongly recommend to perform the matching in both directions. These prefixes can be changed by specifying the prefix= argument in the FRmatch() function.
Users may also obtain the actual values from above plots by specifying the return.value = TRUE argument in the corresponding plotting function.

Look at the matching results in the other direction

## to visualize the one-directional matches
plot_FRmatch(rst.MTGtolayer1)

## to visualize the adjusted p-values for each query cluster
plot_FRmatch(rst.MTGtolayer1, type = "padj")

It is interesting to see that the matching performance depends on the granularity and/or quality of the reference dataset. It is expected that the reference dataset is more comprehensive, such as the MTG data in this example, so that the one directional matching results in this direction align more with the bi-directional matching results.

Other useful functions

Friedman-Rafsky test

We also implemented our own function FRtest() for Friedman-Rafsky (FR) test with customized options. FR test is a multivariate generalization of non-parametric two-sample test. It is a graphical model based on the concept of minimum spanning tree (MST). The MST provides a way to visualize high-dimensional clustered data in a low-dimensional visualization. A minimal working example of FR test and MST visualization is below.

# simulate some synthetic data from the same distribution
samp1 <- matrix(rnorm(1000), nrow = 50) #a 50-by-20 matrix: 50 dimensional, 20 data points
samp2 <- matrix(rnorm(1000), nrow = 50) #a 50-by-20 matrix: 50 dimensional, 20 data points
# FR test with MST plot
FRtest(samp1, samp2, plot.MST = TRUE, main = "Minimum spanning tree")

#>       runs runs.samp1 runs.samp2       stat    p.value 
#> 20.0000000  7.0000000 13.0000000 -0.3378778  0.3677276

In the above test, the p-value suggests that no difference between the two simulated samples.

We encourage our users to visually examine their interested cell type clusters on MST plots.

Session info

sessionInfo()
#> R version 4.2.2 (2022-10-31)
#> Platform: x86_64-apple-darwin17.0 (64-bit)
#> Running under: macOS Big Sur ... 10.16
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
#> 
#> locale:
#> [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> attached base packages:
#> [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
#> [8] methods   base     
#> 
#> other attached packages:
#>  [1] data.table_1.14.2           tibble_3.1.8               
#>  [3] dplyr_1.0.10                FRmatch_2.0.0              
#>  [5] pbmcapply_1.5.1             SingleCellExperiment_1.18.1
#>  [7] SummarizedExperiment_1.26.1 Biobase_2.56.0             
#>  [9] GenomicRanges_1.48.0        GenomeInfoDb_1.32.4        
#> [11] IRanges_2.30.1              S4Vectors_0.34.0           
#> [13] BiocGenerics_0.42.0         MatrixGenerics_1.8.1       
#> [15] matrixStats_0.62.0          shiny_1.7.2                
#> 
#> loaded via a namespace (and not attached):
#>  [1] lsa_0.73.3             bitops_1.0-7           fs_1.5.2              
#>  [4] usethis_2.1.6          devtools_2.4.4         RColorBrewer_1.1-3    
#>  [7] rprojroot_2.0.3        SnowballC_0.7.0        bslib_0.4.0           
#> [10] tools_4.2.2            profvis_0.3.7          utf8_1.2.2            
#> [13] R6_2.5.1               DBI_1.1.3              colorspace_2.0-3      
#> [16] ade4_1.7-19            urlchecker_1.0.1       withr_2.5.0           
#> [19] gridExtra_2.3          tidyselect_1.1.2       prettyunits_1.1.1     
#> [22] processx_3.7.0         compiler_4.2.2         cli_3.4.0             
#> [25] xml2_1.3.3             desc_1.4.2             DelayedArray_0.22.0   
#> [28] labeling_0.4.2         sass_0.4.2             scales_1.2.1          
#> [31] callr_3.7.2            stringr_1.4.1          digest_0.6.29         
#> [34] rmarkdown_2.16         XVector_0.36.0         pkgconfig_2.0.3       
#> [37] htmltools_0.5.3        sessioninfo_1.2.2      highr_0.9             
#> [40] fastmap_1.1.0          htmlwidgets_1.5.4      rlang_1.0.5           
#> [43] rstudioapi_0.14        farver_2.1.1           jquerylib_0.1.4       
#> [46] generics_0.1.3         jsonlite_1.8.0         RCurl_1.98-1.8        
#> [49] magrittr_2.0.3         GenomeInfoDbData_1.2.8 Matrix_1.5-1          
#> [52] Rcpp_1.0.9             munsell_0.5.0          fansi_1.0.3           
#> [55] viridis_0.6.2          lifecycle_1.0.2        yaml_2.3.5            
#> [58] stringi_1.7.8          MASS_7.3-58.1          zlibbioc_1.42.0       
#> [61] pkgbuild_1.3.1         grid_4.2.2             promises_1.2.0.1      
#> [64] forcats_0.5.2          crayon_1.5.1           miniUI_0.1.1.1        
#> [67] lattice_0.20-45        knitr_1.40             ps_1.7.1              
#> [70] pillar_1.8.1           igraph_1.3.5           pkgload_1.3.0         
#> [73] glue_1.6.2             evaluate_0.16          remotes_2.4.2         
#> [76] vctrs_0.4.1            httpuv_1.6.6           gtable_0.3.1          
#> [79] purrr_0.3.4            tidyr_1.2.1            assertthat_0.2.1      
#> [82] cachem_1.0.6           ggplot2_3.3.6          xfun_0.33             
#> [85] mime_0.12              xtable_1.8-4           roxygen2_7.2.1        
#> [88] later_1.3.0            viridisLite_0.4.1      pheatmap_1.0.12       
#> [91] memoise_2.0.1          ellipsis_0.3.2

FR-Match: A cell type matching algorithm for single cell RNA sequencing (scRNAseq) data

Yun (Renee) Zhang, zhangy@jcvi.org

November 21, 2022 - update raw data link

Overview

Workflow

Installation

Getting data ready

Input data

An example data object

Layer 1 and full MTG data

Construct data object using `make_data_object()`

FR-Match

Cell type “barcode”

Cluster sizes

Run `FRmatch()`

Plot bi-directional matching results

Plot one-directional matching results

Other useful functions

Friedman-Rafsky test

Session info

FR-Match: A cell type matching algorithm for single cell RNA sequencing (scRNAseq) data

Yun (Renee) Zhang, zhangy@jcvi.org

November 21, 2022 - update raw data link

Overview

Workflow

Installation

Getting data ready

Input data

An example data object

Layer 1 and full MTG data

Construct data object using make_data_object()

FR-Match

Cell type “barcode”

Cluster sizes

Run FRmatch()

Plot bi-directional matching results

Plot one-directional matching results

Other useful functions

Friedman-Rafsky test

Session info

Construct data object using `make_data_object()`

Run `FRmatch()`