Download data here. For how to construct the data objects, see Tutorial.

Quick facts about the datasets:


Load library and data.

library(FRmatch)
load("sce-human-M1-10X-subclass.rda")
load("sce-human-M1C-SS4-subclass.rda")

Rename the data objects for simplification.

## query data
sce.query <- sce.human.M1C.SS4.subclass
## reference data
sce.ref <- sce.human.M1.10X.subclass

Take a look at the gene expression data distribution. We preprocessed the 10X data by taking the log1p() transformation of the raw count data; we preprocessed the SMART-seq data by taking the logCPM of the raw count data. The log-transformed data were stored in the @assays$logcounts slot of each data object.

par(mfrow=c(1,2))
hist(logcounts(sce.query))
stats.query <- summary(as.vector(logcounts(sce.query)))
pzero.query <- sum(as.vector(logcounts(sce.query))==0)/length(as.vector(logcounts(sce.query)))
legend("topright", paste(c(names(stats.query), "zero pct."),"=",round(c(stats.query,pzero.query),3)))
hist(logcounts(sce.ref))
stats.ref <- summary(as.vector(logcounts(sce.ref)))
pzero.ref <- sum(as.vector(logcounts(sce.ref))==0)/length(as.vector(logcounts(sce.ref)))
legend("topright", paste(c(names(stats.ref), "zero pct."),"=",round(c(stats.ref,pzero.ref),3)))

The SMART-seq and 10X data form quite different data distributions. It is noticeable that the SMART-seq protocol has better sensitivity, and the 10X data has more zero values (potentially higher dropout rate). For the cross-platform matching, normalization is necessary!

sce.query.norm <- normalization(sce.query, scale=TRUE, norm.by="mean") #SMART-seq
sce.ref.scale <- normalization(sce.ref, scale=TRUE) #10X data only need scaling

Check again the normalized data distribution. Now they range from 0 to 1, and form similar distribution.

par(mfrow=c(1,2))
hist(logcounts(sce.query.norm))
hist(logcounts(sce.ref.scale))

Take an overview of the cell type clusters in both datasets. Only cells with cluster labels are used, i.e. removed cells without label from the original publication. In FR-Match, we are going to ignore the clusters with very few cells.

## cluster size
plot_clusterSize(sce.query, sce.ref, name.E1 = "Huamn M1C SS4", name.E2 = "Huamn M1 10X")

The following codes are suggested to plot all cell type barcode plots in an assigned folder. NOT RUN HERE.

ref.clusters <- unique(colData(sce.ref)$cluster_membership)
for(cl in ref.clusters){
  plot_cluster_by_markers(sce.ref.scale, cluster.name = cl, name.E1 = "M1_10X_",
                          filename = paste0("plot_barcodes_M1_10X/scale_",cl,".pdf"))
}

query.clusters <- unique(colData(sce.query)$cluster_membership)
for(cl in query.clusters){
  plot_cluster_by_markers(sce.query.norm, sce.query.norm, cluster.name = cl, name.E2 = "M1C_SS4_",
                          filename = paste0("plot_barcodes_M1C_SS4/norm_",cl,".pdf"))
}

Run FR-Match

Cell-to-cluster matching

Below we provide some guidance on parameter configuration.

  • We ignore small clusters that have less than 5 cells (filter.size=5)
  • For cell2cluster matching, it is recommended to use subsamp.size=10. Your best choice depends on your data. The choice of this parameter will impact the matching score (i.e. p-value), which is used to determine a match or unassigned match.
  • use.cosine=TRUE indicates to use the cosine distance instead of the Euclidean distance in the calculation, which will bring more robustness in matching between different types of samples.
  • prefix=c("ss4.","tenx.") is the customizable names of your query and reference data.
rst.cell2cluster <- FRmatch_cell2cluster(sce.query.norm, sce.ref.scale, 
                                         filter.size=5, subsamp.size=10, use.cosine=TRUE,
                                         prefix=c("ss4.","tenx."))

The easiest way to look at the results is to use its plot function, which shows the proportion of query cells that are matched to the reference cluster. Note that the column sum is 1.

plot_FRmatch_cell2cluster(rst.cell2cluster, type="match.prop")

Matching results of each query cell can be accessed below.

rst.cell2cluster$cell2cluster

Cluster-to-cluster matching

Though it is suggested to use slightly larger subsampling size (subsamp.size=20 by default) for the cluster-to-cluster matching. Here, we use subsamp.size=10 again in this example. Your best setting may depend on your data, therefore we recommend to check the cluster sizes in the beginning.

rst <- FRmatch(sce.query.norm, sce.ref.scale, 
               filter.size=5, subsamp.size=10, use.cosine=TRUE,
               prefix=c("ss4.","tenx."))

Visualize the results.

plot_FRmatch(rst)

plot_FRmatch(rst, type="padj")

The prediction function below returns the most similar reference cluster for each query cluster, though some adjusted p-values may be below the threshold (indicating not a firm match).

predict_most_similar_cluster(rst)
---
title: "Matching human M1 cell subclasses across different platforms - SMART-seq and 10X"
output: html_notebook
author: "Yun (Renee) Zhang, zhangy@jcvi.org"
date: "`r format(Sys.time(), '%B %d, %Y')`"
---

```{r, include = FALSE}
knitr::opts_chunk$set(cache = TRUE)
setwd("~/Documents/BICCN-BrainStandards-2020/Anlayses/Mapping/Human-M1/FRmatch/tutorial/")
```

---

Download data [here](https://jcvioffice365-my.sharepoint.com/:f:/r/personal/zhangy_jcvi_org/Documents/FR-Match%20demo%20data/Tutorial_human_M1_10X_SS4?csf=1&web=1&e=2NEV5s). 
For how to construct the data objects, see [Tutorial](https://jcventerinstitute.github.io/celligrate/FRmatch-vignette.html).

Quick facts about the datasets:

+ Reference data
  + Anatomic region: healthy human primary motor cortex (M1)
  + Experimental platform: 10X Chromium v3 (Cv3)
  + Sample type: single-nucleus
  + Citation: [Evolution of cellular diversity in primary motor cortex of human, marmoset monkey, and mouse](https://www.biorxiv.org/content/10.1101/2020.03.31.016972v2). Data used here is the human Cv3 subset of the cross-species M1 data published in the above publication (see Figure 1b).
  + Data portal: [Human M1 10x](https://portal.brain-map.org/atlases-and-data/rnaseq/human-m1-10x). Also used as the reference dataset in the [Azimuth app](https://app.azimuth.hubmapconsortium.org/app/human-motorcortex).
  
+ Query data
  + Anatomic region: healthy human primary motor cortex (M1C or M1)
  + Experimental platform: SMART-Seq v4 (SS4)
  + Sample type: single-nucleus
  + Data portal: [Human Multiple Cortical Areas SMART-seq](https://portal.brain-map.org/atlases-and-data/rnaseq/human-multiple-cortical-areas-smart-seq). Also used as the demo dataset in the [Azimuth app](https://app.azimuth.hubmapconsortium.org/app/human-motorcortex).

---

Load library and data.
```{r}
library(FRmatch)
load("sce-human-M1-10X-subclass.rda")
load("sce-human-M1C-SS4-subclass.rda")
```


Rename the data objects for simplification.
```{r}
## query data
sce.query <- sce.human.M1C.SS4.subclass
## reference data
sce.ref <- sce.human.M1.10X.subclass
```

Take a look at the gene expression data distribution. _We preprocessed the 10X data by taking the `log1p()` transformation of the raw count data; we preprocessed the SMART-seq data by taking the logCPM of the raw count data. The log-transformed data were stored in the `@assays$logcounts` slot of each data object._ 
```{r, fig.width=5, fig.height=2.5}
par(mfrow=c(1,2))
hist(logcounts(sce.query))
stats.query <- summary(as.vector(logcounts(sce.query)))
pzero.query <- sum(as.vector(logcounts(sce.query))==0)/length(as.vector(logcounts(sce.query)))
legend("topright", paste(c(names(stats.query), "zero pct."),"=",round(c(stats.query,pzero.query),3)))
hist(logcounts(sce.ref))
stats.ref <- summary(as.vector(logcounts(sce.ref)))
pzero.ref <- sum(as.vector(logcounts(sce.ref))==0)/length(as.vector(logcounts(sce.ref)))
legend("topright", paste(c(names(stats.ref), "zero pct."),"=",round(c(stats.ref,pzero.ref),3)))
```

The SMART-seq and 10X data form quite different data distributions. It is noticeable that the SMART-seq protocol has better sensitivity, and the 10X data has more zero values (potentially higher dropout rate).
__For the cross-platform matching, normalization is necessary!__
```{r}
sce.query.norm <- normalization(sce.query, scale=TRUE, norm.by="mean") #SMART-seq
sce.ref.scale <- normalization(sce.ref, scale=TRUE) #10X data only need scaling
```

Check again the normalized data distribution. Now they range from 0 to 1, and form similar distribution.
```{r, fig.width=5, fig.height=2.5}
par(mfrow=c(1,2))
hist(logcounts(sce.query.norm))
hist(logcounts(sce.ref.scale))
```


Take an overview of the cell type clusters in both datasets. _Only cells with cluster labels are used, i.e. removed cells without label from the original publication. In FR-Match, we are going to ignore the clusters with very few cells._
```{r, fig.width=5,fig.height=4}
## cluster size
plot_clusterSize(sce.query, sce.ref, name.E1 = "Huamn M1C SS4", name.E2 = "Huamn M1 10X")
```

The following codes are suggested to plot all cell type barcode plots in an assigned folder. NOT RUN HERE.

```{r, eval=FALSE}
ref.clusters <- unique(colData(sce.ref)$cluster_membership)
for(cl in ref.clusters){
  plot_cluster_by_markers(sce.ref.scale, cluster.name = cl, name.E1 = "M1_10X_",
                          filename = paste0("plot_barcodes_M1_10X/scale_",cl,".pdf"))
}

query.clusters <- unique(colData(sce.query)$cluster_membership)
for(cl in query.clusters){
  plot_cluster_by_markers(sce.query.norm, sce.query.norm, cluster.name = cl, name.E2 = "M1C_SS4_",
                          filename = paste0("plot_barcodes_M1C_SS4/norm_",cl,".pdf"))
}
```


# Run FR-Match

## Cell-to-cluster matching

Below we provide some guidance on parameter configuration.

  + We ignore small clusters that have less than 5 cells (`filter.size=5`)
  + For cell2cluster matching, it is recommended to use `subsamp.size=10`. Your best choice depends on your data. The choice of this parameter will impact the matching score (i.e. p-value), which is used to determine a match or unassigned match.
  + `use.cosine=TRUE` indicates to use the cosine distance instead of the Euclidean distance in the calculation, which will bring more robustness in matching between different types of samples.
  + `prefix=c("ss4.","tenx.")` is the customizable names of your query and reference data.

```{r, eval=FALSE}
rst.cell2cluster <- FRmatch_cell2cluster(sce.query.norm, sce.ref.scale, 
                                         filter.size=5, subsamp.size=10, use.cosine=TRUE,
                                         prefix=c("ss4.","tenx."))
```

The easiest way to look at the results is to use its plot function, which shows the proportion of query cells that are matched to the reference cluster. Note that the column sum is 1.
```{r, fig.width=4.5, fig.height=4}
plot_FRmatch_cell2cluster(rst.cell2cluster, type="match.prop")
```

Matching results of each query cell can be accessed below.
```{r}
rst.cell2cluster$cell2cluster
```


## Cluster-to-cluster matching

Though it is suggested to use slightly larger subsampling size (`subsamp.size=20` by default) for the cluster-to-cluster matching. Here, we use `subsamp.size=10` again in this example. Your best setting may depend on your data, therefore we recommend to check the cluster sizes in the beginning.

```{r, eval=FALSE}
rst <- FRmatch(sce.query.norm, sce.ref.scale, 
               filter.size=5, subsamp.size=10, use.cosine=TRUE,
               prefix=c("ss4.","tenx."))
```

Visualize the results. 
```{r, fig.width=3, fig.height=2.5}
plot_FRmatch(rst)
```

```{r, fig.width=3, fig.height=2}
plot_FRmatch(rst, type="padj")
```

The prediction function below returns the most similar reference cluster for each query cluster, though some adjusted p-values may be below the threshold (indicating not a firm match).
```{r}
predict_most_similar_cluster(rst)
```


