Download data here. For how to construct the data objects, see Tutorial.
Quick facts about the datasets:
- Reference data
- Query data
Load library and data.
library(FRmatch)
load("sce-mouse-M1-10X-snRNA-subclass.rda")
load("sce-mouse-M1-10X-scRNA-subclass.rda")
Rename the data objects for simplification.
## query data
sce.query <- sce.mouse.M1.10X.scRNA.subclass
## reference data
sce.ref <- sce.mouse.M1.10X.snRNA.subclass
Take a look at the gene expression data distribution. We preprocessed the 10X data by taking the log1p()
transformation of the raw count data, which was stored in the @assays$logcounts
slot of the data object.
par(mfrow=c(1,2))
hist(logcounts(sce.query))
hist(logcounts(sce.ref))
Since both datasets follow similar distribution, just take the scaling option in the normalization step.
sce.query.scale <- normalization(sce.query, scale=TRUE)
sce.ref.scale <- normalization(sce.ref, scale=TRUE)
Check again the scaled data distribution, ranging from 0 to 1.
par(mfrow=c(1,2))
hist(logcounts(sce.query.scale))
hist(logcounts(sce.ref.scale))
Take an overview of the cell type clusters in both datasets. In FR-Match, we are going to ignore the clusters with very few cells.
## cluster size
plot_clusterSize(sce.query, sce.ref, name.E1 = "Mouse M1 10X scRNA", name.E2 = "Mouse M1 10X snRNA")
The following codes are suggested to plot all cell type barcode plots in an assigned folder. NOT RUN HERE.
ref.clusters <- unique(colData(sce.ref)$cluster_membership)
for(cl in ref.clusters){
plot_cluster_by_markers(sce.ref.scale, cluster.name = cl, name.E1 = "M1_10X_snRNA_",
filename = paste0("plot_barcodes_M1_10X_snRNA/scale_",cl,".pdf"))
}
query.clusters <- unique(colData(sce.query)$cluster_membership)
for(cl in query.clusters){
plot_cluster_by_markers(sce.ref.scale, sce.query.scale, cluster.name = cl, name.E2 = "M1_10X_scRNA_",
filename = paste0("plot_barcodes_M1_10X_scRNA/scale_",cl,".pdf"))
}
Run FR-Match
Cell-to-cluster matching
Below we provide some guidance on parameter configuration.
- We ignore small clusters that have less than 5 cells (
filter.size=5
)
- For cell2cluster matching, it is recommended to use
subsamp.size=10
. Your best choice depends on your data. The choice of this parameter will impact the matching score (i.e. p-value), which is used to determine a match or unassigned match.
use.cosine=TRUE
indicates to use the cosine distance instead of the Euclidean distance in the calculation, which will bring more robustness in matching between different types of samples.
prefix=c("scRNAseq.","snRNAseq.")
is the customizable names of your query and reference data.
rst.cell2cluster <- FRmatch_cell2cluster(sce.query.scale, sce.ref.scale,
filter.size=5, subsamp.size=10, use.cosine=TRUE,
prefix=c("scRNAseq.","snRNAseq."))
The easiest way to look at the results is to use its plot function, which shows the proportion of query cells that are matched to the reference cluster. Note that the column sum is 1.
plot_FRmatch_cell2cluster(rst.cell2cluster, type="match.prop")
The under-partitioned SMC_Peri cluster
In this example, there is an under-partitioned query cluster (SMC_Peri), which is matched in part to two reference clusters (SMC and Peri). We can look at the specific barcodes of the under-partitioned query cluster and the matched reference clusters for more insights.
plot_cluster_by_markers(sce.ref.scale, sce.query.scale, cluster.name = "SMC_Peri", name.E2 = "scRNA.")
plot_cluster_by_markers(sce.ref.scale, cluster.name = "SMC", name.E1 = "snRNA.")
plot_cluster_by_markers(sce.ref.scale, cluster.name = "Peri", name.E1 = "snRNA.")
Available cell2cluster results
Matching results of each query cell can be accessed below.
rst.cell2cluster$cell2cluster
Cluster-to-cluster matching
Though it is suggested to use slightly larger subsampling size (subsamp.size=20
by default) for the cluster-to-cluster matching. Here, we use subsamp.size=10
again in this use case. Your best setting may depend on your data, therefore we recommend to check the cluster sizes in the beginning.
rst <- FRmatch(sce.query.scale, sce.ref.scale,
filter.size=5, subsamp.size=10, use.cosine=TRUE,
prefix=c("scRNAseq.","snRNAseq."))
Visualize the results. Since larger subsampling size is used and the nature of taking a cluster as a whole, the cluster-level matching is more conserved, resulting to some unassigned matches including the under-partitioned cluster.
plot_FRmatch(rst, sig.level=.1)
Other presentations of the results
We may use the following plot to look at the adjusted p-values, and choose a reasonable sig.level
(red dashed line) to determine matches (above the red line) and unassigned matches (below the red line).
plot_FRmatch(rst, type="padj", sig.level = .1)
We can also use the prediction function to list the most similar reference cluster (i.e. highest adjusted p-value) to each query cluster, regardless the sig.level
threshold.
## top match
predict_most_similar_cluster(rst)
Minimum spanning tree plots
Lastly, we can also use the minimum spanning tree (MST) plot to diagnose the under-partitioned cluster and its partial matches.
plot_MST(sce.query.scale, sce.ref.scale, "SMC_Peri", "SMC", use.cosine=T)
runs runs.samp1 runs.samp2 stat p.value
15.000000000 8.000000000 7.000000000 -2.540459701 0.005535343
plot_MST(sce.query.scale, sce.ref.scale, "SMC_Peri", "Peri", use.cosine=T)
runs runs.samp1 runs.samp2 stat p.value
17.0000000000 11.0000000000 6.0000000000 -3.3707719915 0.0003747893
---
title: "Matching mouse M1 cell subclasses across different sample types - scRNAseq and snRNAseq"
output: html_notebook
date: "`r format(Sys.time(), '%B %d, %Y')`"
---

```{r, include = FALSE}
knitr::opts_chunk$set(cache = TRUE)
setwd("~/Documents/BICCN-BrainStandards-2020/Anlayses/Mapping/Mouse-M1/FRmatch/tutorial/")
```

---

Download data [here](https://jcvioffice365-my.sharepoint.com/:f:/r/personal/zhangy_jcvi_org/Documents/FR-Match%20demo%20data/Tutorial_mouse_M1_10X_scRNAseq_snRNAseq?csf=1&web=1&e=bPna1r). 
For how to construct the data objects, see [Tutorial](https://jcventerinstitute.github.io/celligrate/FRmatch-vignette.html).

Quick facts about the datasets:

+ Reference data
  + Anatomic region: healthy mouse primary motor cortex (M1)
  + Experimental platform: 10X v3
  + Sample type: single-nucleus
  + Citation: [An integrated transcriptomic and epigenomic atlas of mouse primary motor cortex cell types](https://www.biorxiv.org/content/10.1101/2020.02.29.970558v2). Data used here is part of the integrated transcriptomic and epigenomic atlas in above publication. Also used as the reference dataset in the [Azimuth app](https://app.azimuth.hubmapconsortium.org/app/mouse-motorcortex).
+ Query data
  + Anatomic region: healthy mouse primary motor cortex (MOp, a.k.a. M1)
  + Experimental platform: 10X v2
  + Sample type: single-cell
  + Citation: [A taxonomy of transcriptomic cell types across the isocortex and hippocampal formation](https://doi.org/10.1016/j.cell.2021.04.021). Data used here is part of the mouse whole cortex taxonomy in the above publication. Also used as the demo dataset in the [Azimuth app](https://app.azimuth.hubmapconsortium.org/app/mouse-motorcortex).

---

Load library and data.
```{r}
library(FRmatch)
load("sce-mouse-M1-10X-snRNA-subclass.rda")
load("sce-mouse-M1-10X-scRNA-subclass.rda")
```


Rename the data objects for simplification.
```{r}
## query data
sce.query <- sce.mouse.M1.10X.scRNA.subclass
## reference data
sce.ref <- sce.mouse.M1.10X.snRNA.subclass
```


Take a look at the gene expression data distribution. _We preprocessed the 10X data by taking the `log1p()` transformation of the raw count data, which was stored in the `@assays$logcounts` slot of the data object._ 
```{r, fig.width=10, fig.height=4}
par(mfrow=c(1,2))
hist(logcounts(sce.query))
hist(logcounts(sce.ref))
```

Since both datasets follow similar distribution, just take the scaling option in the normalization step.
```{r}
sce.query.scale <- normalization(sce.query, scale=TRUE)
sce.ref.scale <- normalization(sce.ref, scale=TRUE)
```

Check again the scaled data distribution, ranging from 0 to 1.
```{r, fig.width=10, fig.height=4}
par(mfrow=c(1,2))
hist(logcounts(sce.query.scale))
hist(logcounts(sce.ref.scale))
```


Take an overview of the cell type clusters in both datasets. _In FR-Match, we are going to ignore the clusters with very few cells._
```{r, fig.width=15,fig.height=10}
## cluster size
plot_clusterSize(sce.query, sce.ref, name.E1 = "Mouse M1 10X scRNA", name.E2 = "Mouse M1 10X snRNA")
```


The following codes are suggested to plot all cell type barcode plots in an assigned folder. NOT RUN HERE.
```{r, eval=FALSE}
ref.clusters <- unique(colData(sce.ref)$cluster_membership)
for(cl in ref.clusters){
  plot_cluster_by_markers(sce.ref.scale, cluster.name = cl, name.E1 = "M1_10X_snRNA_",
                          filename = paste0("plot_barcodes_M1_10X_snRNA/scale_",cl,".pdf"))
}

query.clusters <- unique(colData(sce.query)$cluster_membership)
for(cl in query.clusters){
  plot_cluster_by_markers(sce.ref.scale, sce.query.scale, cluster.name = cl, name.E2 = "M1_10X_scRNA_",
                          filename = paste0("plot_barcodes_M1_10X_scRNA/scale_",cl,".pdf"))
}
```


# Run FR-Match

## Cell-to-cluster matching

Below we provide some guidance on parameter configuration.

  + We ignore small clusters that have less than 5 cells (`filter.size=5`)
  + For cell2cluster matching, it is recommended to use `subsamp.size=10`. Your best choice depends on your data. The choice of this parameter will impact the matching score (i.e. p-value), which is used to determine a match or unassigned match.
  + `use.cosine=TRUE` indicates to use the cosine distance instead of the Euclidean distance in the calculation, which will bring more robustness in matching between different types of samples.
  + `prefix=c("scRNAseq.","snRNAseq.")` is the customizable names of your query and reference data.

```{r, eval=FALSE}
rst.cell2cluster <- FRmatch_cell2cluster(sce.query.scale, sce.ref.scale, 
                                         filter.size=5, subsamp.size=10, use.cosine=TRUE,
                                         prefix=c("scRNAseq.","snRNAseq."))
```

The easiest way to look at the results is to use its plot function, which shows the proportion of query cells that are matched to the reference cluster. Note that the column sum is 1.
```{r, fig.width=10, fig.height=9}
plot_FRmatch_cell2cluster(rst.cell2cluster, type="match.prop")
```

## The under-partitioned SMC_Peri cluster

In this example, there is an under-partitioned query cluster (SMC_Peri), which is matched in part to two reference clusters (SMC and Peri). We can look at the specific barcodes of the under-partitioned query cluster and the matched reference clusters for more insights.
```{r, fig.height=7}
plot_cluster_by_markers(sce.ref.scale, sce.query.scale, cluster.name = "SMC_Peri", name.E2 = "scRNA.")
plot_cluster_by_markers(sce.ref.scale, cluster.name = "SMC", name.E1 = "snRNA.")
plot_cluster_by_markers(sce.ref.scale, cluster.name = "Peri", name.E1 = "snRNA.")
```


## Available cell2cluster results

Matching results of each query cell can be accessed below.
```{r}
rst.cell2cluster$cell2cluster
```

## Cluster-to-cluster matching

Though it is suggested to use slightly larger subsampling size (`subsamp.size=20` by default) for the cluster-to-cluster matching. Here, we use `subsamp.size=10` again in this use case. Your best setting may depend on your data, therefore we recommend to check the cluster sizes in the beginning.
```{r, eval=FALSE}
rst <- FRmatch(sce.query.scale, sce.ref.scale, 
               filter.size=5, subsamp.size=10, use.cosine=TRUE,
               prefix=c("scRNAseq.","snRNAseq."))
```

Visualize the results. Since larger subsampling size is used and the nature of taking a cluster as a whole, the cluster-level matching is more conserved, resulting to some unassigned matches including the under-partitioned cluster.
```{r, fig.width=3, fig.height=3}
plot_FRmatch(rst, sig.level=.1)
```

## Other presentations of the results

We may use the following plot to look at the adjusted p-values, and choose a reasonable `sig.level` (red dashed line) to determine matches (above the red line) and unassigned matches (below the red line).
```{r, fig.width=3, fig.height=2}
plot_FRmatch(rst, type="padj", sig.level = .1)
```

We can also use the prediction function to list the most similar reference cluster (i.e. highest adjusted p-value) to each query cluster, regardless the `sig.level` threshold.
```{r}
## top match
predict_most_similar_cluster(rst)
```


## Minimum spanning tree plots

Lastly, we can also use the minimum spanning tree (MST) plot to diagnose the under-partitioned cluster and its partial matches. 

```{r, fig.width=5,fig.height=5}
plot_MST(sce.query.scale, sce.ref.scale, "SMC_Peri", "SMC", use.cosine=T)
plot_MST(sce.query.scale, sce.ref.scale, "SMC_Peri", "Peri", use.cosine=T)
```

