Applying STANCE to Spatial Transcriptomics Data

library(STANCE)

## Warning: replacing previous import 'gaston::sigma' by 'stats::sigma' when
## loading 'STANCE'

library(Matrix)
library(ggplot2)
library(reshape2)
library(dplyr)

Load in data

Mouse olfactory bulb replicate No.8 is a \(100 \mu m^2\) resolution spatial transcriptomics dataset of 15,928 genes measured on 262 spots. Following the procedure of Miller et al. (2022), data preprocessing and cell type deconvolution by STdeconvolve have been conducted, resulting in the mOB dataset, which is a list containing the gene expression, spot location coordinates and cell type compositions.

Specifically, counts contains the gene expression of 7,365 genes across 260 spots.

data(mOB)
mOB$counts[1:5,1:5]

##               ACAACTATGGGTTGGCGG ACACAGATCCTGTTCTGA ACATCACCTGCGCGCTCT
## 0610007P14Rik                  1                  3                  3
## 0610009B22Rik                  0                  0                  0
## 0610009D07Rik                  1                  2                  5
## 0610009O20Rik                  0                  2                  2
## 0610010F05Rik                  0                  0                  1
##               ACATTTAAGGCGCATGAT ACCACTGTAATCTCCCAT
## 0610007P14Rik                  4                  1
## 0610009B22Rik                  1                  0
## 0610009D07Rik                  5                  1
## 0610009O20Rik                  1                  1
## 0610010F05Rik                  2                  0

pos is the spatial location coordinates of the 260 spots.

mOB$pos[1:5,]

##                         x      y
## ACAACTATGGGTTGGCGG 16.001 16.036
## ACACAGATCCTGTTCTGA 26.073 15.042
## ACATCACCTGCGCGCTCT 13.048 21.079
## ACATTTAAGGCGCATGAT 13.963 18.117
## ACCACTGTAATCTCCCAT 24.104 13.074

prop is cell type proportion of 12 cell types across the 260 spots.

mOB$prop[1:5,]

##                             1         2          3         4         5
## ACAACTATGGGTTGGCGG 0.11641560 0.0000000 0.08669002 0.0000000 0.0000000
## ACACAGATCCTGTTCTGA 0.00000000 0.1561022 0.00000000 0.0000000 0.0000000
## ACATCACCTGCGCGCTCT 0.09625801 0.0000000 0.00000000 0.3728770 0.1578947
## ACATTTAAGGCGCATGAT 0.11485435 0.0000000 0.00000000 0.7181721 0.1669736
## ACCACTGTAATCTCCCAT 0.07077041 0.0000000 0.00000000 0.0000000 0.0914882
##                            6         7          8          9        10
## ACAACTATGGGTTGGCGG 0.0000000 0.1204314 0.13020670 0.09144814 0.1808773
## ACACAGATCCTGTTCTGA 0.0000000 0.0000000 0.08471756 0.00000000 0.0000000
## ACATCACCTGCGCGCTCT 0.1174574 0.0000000 0.00000000 0.25551295 0.0000000
## ACATTTAAGGCGCATGAT 0.0000000 0.0000000 0.00000000 0.00000000 0.0000000
## ACCACTGTAATCTCCCAT 0.0000000 0.0000000 0.23230714 0.00000000 0.2202639
##                           11        12
## ACAACTATGGGTTGGCGG 0.2739309 0.0000000
## ACACAGATCCTGTTCTGA 0.0000000 0.7591802
## ACATCACCTGCGCGCTCT 0.0000000 0.0000000
## ACATTTAAGGCGCATGAT 0.0000000 0.0000000
## ACCACTGTAATCTCCCAT 0.3256000 0.0595703

Create STANCE object

To run STANCE, we need to first create a STANCE object using creatSTANCEobject function, which requires 4 inputs: 1. counts: gene expression matrix with genes in rows and spots in columns; 2. pos: dimensional location coordinates matrix. The rownames of pos matrix should match the colnames of counts matrix; 3. prop: he cell type proportion of each spot. Rows represent spots and columns represent cell types. The rownames of prop matrix should match the colnames of counts matrix; 4. covariates (default NULL) the covariate (if any) design matrix modeling gene expression. The rownames of covariates matrix should match the colnames of counts matrix.

All the inputs should be convertible to ‘matrix’ form.

mySTANCE <- creatSTANCEobject(counts = mOB$counts,
                       pos = mOB$pos,
                       prop = mOB$prop,
                       covariates = NULL)

Preprocess the input data

Normalize gene expression count data, scale location coordinates and perform quality conrol steps, including removing spots with gene expression less than \(10\), and removing genes whose proportion of zero-express spots greater than \(0.05\) by default. normalized = TRUE is used to specify that the gene expression input has been normalized so as to skip the normalization step.

mySTANCE <- data_preprocess(object = mySTANCE, 
                            gene.threshold = 0.05, spot.threshold = 10,
                            normalized = FALSE)

## The location matrix has been scaled.
## 0 spots with gene expression less than 10 have been removed. 
## 0 mt genes and the genes with gene expression rate less than 0.05 have been removed. 
## Normalizing the count data...

Build Gaussian kernel matrix and construct the important matrices

Build the Gaussian kernel matrix to model spatial correlation between spots. The Gaussian kernel bandwidth will be selected automatically: - for datasets with 5000 spots or fewer, non-parametric Sheather-Jones’ bandwidths are computed for each gene, and the median of these values is used as the common bandwidth; - for datasets with more than 5000 spots, Silverman’s bandwidths are computed for each gene, and the median of these gene-specific bandwidths is used as the common bandwidth.

This step also constructs the important matrices, including design matrices \(\mathbf{X}\), covariance matrices (i.e. \(\mathbf{K}\) for spatial effects \(\gamma_k\) and \(\mathbf{I}\) for residual error \(\varepsilon\)) and \(\mathbf{\Sigma}_k = \mathbf{\Pi}_k \mathbf{K} \mathbf{\Pi}_k^T\) matrices for all the \(\gamma_k\)’s, before fitting STANCE model.

mySTANCE <- build_kernelMatrix(object = mySTANCE)

## Building kernel matrix...
## Computing covariance matrices...

Perform STANCE overall test for utSVGs (including both SVGs & ctSVGs)

Here, for efficiency, we skip the bias correction procedure (correction = F). Use “Benjamini-Yekutieli” method to adjust p-values to control false discovery rate (pv.adjust = "BY").

mySTANCE <- runTest1(object = mySTANCE,
                     correction = F, pv.adjust = "BY")

## Start running STANCE Stage 1 test...

The function outputs a data frame containing the test statistic values, the associated p-values as well as the “BY” adjusted p-values.

head(mySTANCE@Test_1)

##               statistic   p_value p_value_adj
## 0610007P14Rik  697.0190 0.0730441           1
## 0610009B22Rik  643.3463 0.5322611           1
## 0610009D07Rik  339.1009 0.6041702           1
## 0610009O20Rik  890.6290 0.2941044           1
## 0610010F05Rik  705.0588 0.5180449           1
## 0610010K14Rik  694.9316 0.1702137           1

Perform STANCE individual test

# List for all the genes in the experiment
genes.list <- rownames(mySTANCE@gene_expression)
# List for utSVGs detected by STANCE overall test
utSVG.list <- genes.list[mySTANCE@Test_1$p_value_adj < 0.05]
utSVG.list[171:180]

##  [1] "Cpe"   "Cplx1" "Cplx2" "Cpne2" "Cpne4" "Cpne6" "Crim1" "Crip2" "Crtc1"
## [10] "Cryab"

Perform STANCE individual test on each of genes passing the overall test (we called these genes “utSVGs”). Cell_types_to_test = NULL indicates that all the cell types would be tested. Genes_to_test = utSVG.list[1:100] specifies that the first 100 significant genes detected by overall test will be tested.

mySTANCE <- runTest2(object = mySTANCE, 
                     Genes_to_test = utSVG.list[101:200], 
                     Cell_types_to_test = NULL, 
                     correction = F, 
                     ncores = 1)

## Start running STANCE Stage 2 test...

The function outputs a list containing a data frame of test statistic values, as well as the associated p-values for each cell type.

# Individual testing results for cell type "1"
head(mySTANCE@Test_2[["1"]])

##        statistic     p_value
## Cabin1 32.189117 0.006338967
## Cabp1   8.277671 0.720506455
## Cacnb3  5.408803 0.627886687
## Cacnb4  4.204393 0.796376056
## Cacng3  8.888090 0.828823859
## Cadm2   9.944913 0.437784111

# Individual testing results for cell type "2"
head(mySTANCE@Test_2[["2"]])

##        statistic   p_value
## Cabin1  34.27551 0.5671753
## Cabp1   37.02897 0.5966591
## Cacnb3  27.48364 0.4167491
## Cacnb4  43.96634 0.1495813
## Cacng3  82.10546 0.2512983
## Cadm2   12.18509 0.9038280

Estimate the variance components of STANCE model for all the genes

Genes_to_test = genes.sig.1[1:100] specifies that the first 100 significant genes detected by overall test will be estimated.

mySTANCE <- varcomp_est(object = mySTANCE, 
                        Genes_to_est = utSVG.list[101:200],
                        ncores = 1)

head(mySTANCE@VarComp_estimates)

##                1          2          3          4         5          6
## Cabin1 0.2707131 0.00000100 0.03740342 0.06001812 0.0000010 0.04635161
## Cabp1  0.0000010 0.00000100 0.85834618 0.10474024 0.2698156 0.02106626
## Cacnb3 0.0000010 0.00000100 0.07156818 0.08217742 0.7436404 0.15448227
## Cacnb4 0.0000010 0.03835654 0.93057709 0.00000100 0.1765448 0.09052980
## Cacng3 0.0000010 0.03935319 0.00000100 0.10606986 0.0000010 0.13619183
## Cadm2  0.0000010 0.00000100 0.00000100 0.09824489 0.1126940 0.17077836
##                7           8          9          10         11         12
## Cabin1 0.0000010 0.000001000 0.00000100 0.000001000 0.00000100 0.02082207
## Cabp1  0.0000010 0.009390651 0.03807576 1.128242690 0.31302512 0.02143620
## Cacnb3 0.0000010 0.000001000 0.11339463 0.006910621 0.36386536 0.00000100
## Cacnb4 0.0000010 0.000001000 0.00000100 0.877788517 1.11742047 0.00000100
## Cacng3 0.0000010 0.000001000 0.00000100 0.000001000 0.03674453 0.00000100
## Cadm2  0.1226651 0.043506983 0.13440258 0.000001000 0.00000100 0.08973406
##        Error_residuals
## Cabin1      0.06559035
## Cabp1       0.05293168
## Cacnb3      0.10684810
## Cacnb4      0.11050081
## Cacng3      0.04797411
## Cadm2       0.06939496

Display top ctSVGs for the cell type of interest

Given the significance level 0.05, pick out top 20 ctSVGs (if less than 20, then all the ctSVGs will be displayed) for cell type of interest (CT_of_interest = "1").

mySTANCE <- CT_topGenes(object = mySTANCE, 
                        CT_of_interest = "1", numTopGenes = 20)

## Scale for fill is already present.
## Adding another scale for fill, which will replace the existing scale.

# Print the list of top genes for cell type "1"
cat(paste0('The top cell type specific SVGs for cell type \"',
           mySTANCE@cell_type_top_genes$cell_type_of_interest,
           '\": \n',
           paste(mySTANCE@cell_type_top_genes$top_genes, collapse = ","),
           '. \n'))

## The top cell type specific SVGs for cell type "1": 
## Cpne2,Cabin1,Dab2,Crip2,Chst8,Cbx6,Cxx1b,D4Wsu53e,Cdk17,Clstn2.

Draw a stacked bar plot for top genes visualizing the proportion of variance explained by all the cell types and error.

print(mySTANCE@cell_type_top_genes$stacked_bar_plot)

Visualize gene expression

For a given gene, its expression levels across spatial locations within the tissue can be visualized, providing an intuitive representation of spatial gene expression patterns. Here, we employ the visualizeGenePattern function to display the spatial expression pattern of “Cabin1,” which was identified as a cell-type-1-specific ctSVG by the STANCE Stage 2 test results.

When setting scaled_gene_expression = TRUE, the gene expression \(y_i\) at the \(i\)-th spot is scaled using the formula: \[y^{\text{scaled}}_i = \frac{y_i - \min(\mathbf{y})}{\max(\mathbf{y}) - \min(\mathbf{y})}\] where \(\mathbf{y}\) represents the vector of gene expression values across all spots. This scaling ensures the expression values are normalized to a range between 0 and 1, facilitating comparison and visualization.

visualizeGenePattern(object = mySTANCE, gene_to_display = "Cabin1",
                    filling_colors = c("white", "darkblue"),
                    scaled_gene_expression = TRUE,
                    point_size = 4, title_size = 16, legend_title_size = 14, legend_text_size = 10)