Analyse of Spatial Transcriptome data using VISGP
This tutorial shows loading, preprocessing, VISGP analyse of Spatial Transcriptome dataset.
Import packages
Here, we’ll import scbean along with other popular packages.
[1]:
from scbean.model import visgp
import pandas as pd
import multiprocessing as mp
import anndata as ad
import warnings
warnings.filterwarnings('ignore')
Loading dataset
This tutorial uses spatial transcriptome data of human breast cancer (layer 2), which contains 14,789 genes measured on 251 spots. It can be downloaded from Spatial Transcriptomics Research.
[2]:
filepath = 'Users/wyr/data/Layer2_BC_count_matrix-1.tsv'
data = pd.read_csv(filepath, sep='\t')
[3]:
data
[3]:
Unnamed: 0 | GAPDH | USP4 | MAPKAPK2 | CPEB1 | LANCL2 | MCL1 | TMEM109 | TMEM189 | ITPK1 | ... | TREML1 | C12orf79 | ZCCHC12 | ZNF222 | TRIM17 | RNASEK | KSR2 | PCDHGB4 | ACOXL | CASQ2 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 17.907x4.967 | 1 | 1 | 1 | 0 | 0 | 2 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 18.965x5.003 | 7 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 18.954x5.995 | 5 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 17.846x5.993 | 1 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 20.016x6.019 | 2 | 0 | 1 | 0 | 0 | 2 | 0 | 0 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
246 | 23.094x23.975 | 4 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
247 | 24.981x23.964 | 4 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
248 | 21.874x24.852 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
249 | 23.096x24.93 | 2 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
250 | 24.076x25.951 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
251 rows × 14790 columns
Each row represents a two-dimensional position and each column represents a gene.
Preprocessing
Here, we separate location information and gene expression from the original data to meet the input of the model. In addition, we will filter out some genes and spots with low expression levels.
[4]:
location = pd.DataFrame(index=data.index)
location['x'] = data['Unnamed: 0'].str.split('x').str.get(0).map(float)
location['y'] = data['Unnamed: 0'].str.split('x').str.get(1).map(float)
data.drop('Unnamed: 0', axis=1, inplace=True)
# Filter practically unobserved genes
data = data.T[data.sum(0) >= 10].T
# genes * location
data = data.T
location = location[data.sum(0) >= 10]
data = data.T[data.sum(0) >= 10].T
# model inputs (anndata)
obs = pd.DataFrame()
obs['gene_name'] = data.index.values
adata = ad.AnnData(data.values, obs=obs, var=location, dtype='float64')
Identify SV genes via VGP
The result will be a DataFrame with p-value and adjusted p-value (q-value) for each gene. We identified genes with q-value less than 0.05 as spatially variable genes.
[5]:
obj = visgp.VISGP(adata, processes=mp.cpu_count())
results = obj.run()
100%|████████████████████████████████████████████████████████████████████████████| 9907/9907 [4:36:11<00:00, 1.67s/it]
0%| | 0/9907 [00:00<?, ?it/s]
Results....................
100%|████████████████████████████████████████████████████████████████████████████| 9907/9907 [00:04<00:00, 2143.60it/s]
[7]:
results
[7]:
gene | p_value | q_value | |
---|---|---|---|
0 | GAPDH | 0.0 | 0.0 |
1 | USP4 | 0.550989 | 0.551075 |
2 | MAPKAPK2 | 0.000157 | 0.000158 |
3 | CPEB1 | 0.608838 | 0.608953 |
4 | LANCL2 | 0.890499 | 0.89099 |
... | ... | ... | ... |
9902 | AC245100.1 | 0.265572 | 0.265635 |
9903 | HSD17B6 | 0.572269 | 0.572319 |
9904 | ARNTL | 0.56616 | 0.566218 |
9905 | PAQR8 | 0.000011 | 0.000011 |
9906 | XRCC4 | 0.475503 | 0.475589 |
9907 rows × 3 columns
According to the result, we screened the SV genes, that is, the genes with q-value less than 0.05.
[8]:
results.sort_values(["q_value"], inplace=True)
len(results[results["q_value"]<0.05])
[8]:
4134
[9]:
results.head(10)
[9]:
gene | p_value | q_value | |
---|---|---|---|
0 | GAPDH | 0.0 | 0.0 |
608 | GRINA | 0.0 | 0.0 |
2094 | PSMD8 | 0.0 | 0.0 |
610 | HNRNPL | 0.0 | 0.0 |
2080 | PGK1 | 0.0 | 0.0 |
274 | KTN1 | 0.0 | 0.0 |
2073 | ARPC1B | 0.0 | 0.0 |
2061 | ANP32B | 0.0 | 0.0 |
271 | ACACA | 0.0 | 0.0 |
623 | COL6A1 | 0.0 | 0.0 |