Analyse of Spatial Transcriptome data using VISGP

This tutorial shows loading, preprocessing, VISGP analyse of Spatial Transcriptome dataset.

Import packages

Here, we’ll import scbean along with other popular packages.

[1]:
from scbean.model import visgp
import pandas as pd
import multiprocessing as mp
import anndata as ad
import warnings
warnings.filterwarnings('ignore')

Loading dataset

This tutorial uses spatial transcriptome data of human breast cancer (layer 2), which contains 14,789 genes measured on 251 spots. It can be downloaded from Spatial Transcriptomics Research.

[2]:
filepath = 'Users/wyr/data/Layer2_BC_count_matrix-1.tsv'
data = pd.read_csv(filepath, sep='\t')
[3]:
data
[3]:
Unnamed: 0 GAPDH USP4 MAPKAPK2 CPEB1 LANCL2 MCL1 TMEM109 TMEM189 ITPK1 ... TREML1 C12orf79 ZCCHC12 ZNF222 TRIM17 RNASEK KSR2 PCDHGB4 ACOXL CASQ2
0 17.907x4.967 1 1 1 0 0 2 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 18.965x5.003 7 0 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 18.954x5.995 5 0 0 0 0 2 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 17.846x5.993 1 0 0 0 0 2 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 20.016x6.019 2 0 1 0 0 2 0 0 1 ... 0 0 0 0 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
246 23.094x23.975 4 0 0 0 0 1 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
247 24.981x23.964 4 0 0 0 0 1 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
248 21.874x24.852 4 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
249 23.096x24.93 2 0 0 0 0 0 1 0 0 ... 0 0 0 0 0 0 0 0 0 0
250 24.076x25.951 1 0 0 0 0 0 1 0 0 ... 0 0 0 0 0 0 0 0 0 0

251 rows × 14790 columns

Each row represents a two-dimensional position and each column represents a gene.

Preprocessing

Here, we separate location information and gene expression from the original data to meet the input of the model. In addition, we will filter out some genes and spots with low expression levels.

[4]:
location = pd.DataFrame(index=data.index)
location['x'] = data['Unnamed: 0'].str.split('x').str.get(0).map(float)
location['y'] = data['Unnamed: 0'].str.split('x').str.get(1).map(float)
data.drop('Unnamed: 0', axis=1, inplace=True)
# Filter practically unobserved genes
data = data.T[data.sum(0) >= 10].T
# genes * location
data = data.T
location = location[data.sum(0) >= 10]
data = data.T[data.sum(0) >= 10].T

# model inputs (anndata)
obs = pd.DataFrame()
obs['gene_name'] = data.index.values
adata = ad.AnnData(data.values, obs=obs, var=location, dtype='float64')

Identify SV genes via VGP

The result will be a DataFrame with p-value and adjusted p-value (q-value) for each gene. We identified genes with q-value less than 0.05 as spatially variable genes.

[5]:
obj = visgp.VISGP(adata, processes=mp.cpu_count())
results = obj.run()
100%|████████████████████████████████████████████████████████████████████████████| 9907/9907 [4:36:11<00:00,  1.67s/it]
  0%|                                                                                         | 0/9907 [00:00<?, ?it/s]
Results....................
100%|████████████████████████████████████████████████████████████████████████████| 9907/9907 [00:04<00:00, 2143.60it/s]
[7]:
results
[7]:
gene p_value q_value
0 GAPDH 0.0 0.0
1 USP4 0.550989 0.551075
2 MAPKAPK2 0.000157 0.000158
3 CPEB1 0.608838 0.608953
4 LANCL2 0.890499 0.89099
... ... ... ...
9902 AC245100.1 0.265572 0.265635
9903 HSD17B6 0.572269 0.572319
9904 ARNTL 0.56616 0.566218
9905 PAQR8 0.000011 0.000011
9906 XRCC4 0.475503 0.475589

9907 rows × 3 columns

According to the result, we screened the SV genes, that is, the genes with q-value less than 0.05.

[8]:
results.sort_values(["q_value"], inplace=True)
len(results[results["q_value"]<0.05])
[8]:
4134
[9]:
results.head(10)
[9]:
gene p_value q_value
0 GAPDH 0.0 0.0
608 GRINA 0.0 0.0
2094 PSMD8 0.0 0.0
610 HNRNPL 0.0 0.0
2080 PGK1 0.0 0.0
274 KTN1 0.0 0.0
2073 ARPC1B 0.0 0.0
2061 ANP32B 0.0 0.0
271 ACACA 0.0 0.0
623 COL6A1 0.0 0.0