Analyse of Spatial Transcriptome data using VISGP

This tutorial shows loading, preprocessing, VISGP analyse of Spatial Transcriptome dataset.

Import packages

Here, we’ll import scbean along with other popular packages.

[1]:

from scbean.model import visgp
import pandas as pd
import multiprocessing as mp
import anndata as ad
import warnings
warnings.filterwarnings('ignore')

Loading dataset

This tutorial uses spatial transcriptome data of human breast cancer (layer 2), which contains 14,789 genes measured on 251 spots. It can be downloaded from Spatial Transcriptomics Research.

[2]:

filepath = 'Users/wyr/data/Layer2_BC_count_matrix-1.tsv'
data = pd.read_csv(filepath, sep='\t')

[3]:

data

[3]:

	Unnamed: 0	GAPDH	USP4	MAPKAPK2	CPEB1	LANCL2	MCL1	TMEM109	TMEM189	ITPK1	...	TREML1	C12orf79	ZCCHC12	ZNF222	TRIM17	RNASEK	KSR2	PCDHGB4	ACOXL	CASQ2
0	17.907x4.967	1	1	1	0	0	2	0	0	0	...	0	0	0	0	0	0	0	0	0	0
1	18.965x5.003	7	0	1	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
2	18.954x5.995	5	0	0	0	0	2	0	0	0	...	0	0	0	0	0	0	0	0	0	0
3	17.846x5.993	1	0	0	0	0	2	0	0	0	...	0	0	0	0	0	0	0	0	0	0
4	20.016x6.019	2	0	1	0	0	2	0	0	1	...	0	0	0	0	0	0	0	0	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
246	23.094x23.975	4	0	0	0	0	1	0	0	0	...	0	0	0	0	0	0	0	0	0	0
247	24.981x23.964	4	0	0	0	0	1	0	0	0	...	0	0	0	0	0	0	0	0	0	0
248	21.874x24.852	4	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
249	23.096x24.93	2	0	0	0	0	0	1	0	0	...	0	0	0	0	0	0	0	0	0	0
250	24.076x25.951	1	0	0	0	0	0	1	0	0	...	0	0	0	0	0	0	0	0	0	0

251 rows × 14790 columns

Each row represents a two-dimensional position and each column represents a gene.

Preprocessing

Here, we separate location information and gene expression from the original data to meet the input of the model. In addition, we will filter out some genes and spots with low expression levels.

[4]:

location = pd.DataFrame(index=data.index)
location['x'] = data['Unnamed: 0'].str.split('x').str.get(0).map(float)
location['y'] = data['Unnamed: 0'].str.split('x').str.get(1).map(float)
data.drop('Unnamed: 0', axis=1, inplace=True)
# Filter practically unobserved genes
data = data.T[data.sum(0) >= 10].T
# genes * location
data = data.T
location = location[data.sum(0) >= 10]
data = data.T[data.sum(0) >= 10].T

# model inputs (anndata)
obs = pd.DataFrame()
obs['gene_name'] = data.index.values
adata = ad.AnnData(data.values, obs=obs, var=location, dtype='float64')

Identify SV genes via VGP

The result will be a DataFrame with p-value and adjusted p-value (q-value) for each gene. We identified genes with q-value less than 0.05 as spatially variable genes.

[5]:

obj = visgp.VISGP(adata, processes=mp.cpu_count())
results = obj.run()

100%|████████████████████████████████████████████████████████████████████████████| 9907/9907 [4:36:11<00:00,  1.67s/it]
  0%|                                                                                         | 0/9907 [00:00<?, ?it/s]

Results....................

100%|████████████████████████████████████████████████████████████████████████████| 9907/9907 [00:04<00:00, 2143.60it/s]

[7]:

results

[7]:

	gene	p_value	q_value
0	GAPDH	0.0	0.0
1	USP4	0.550989	0.551075
2	MAPKAPK2	0.000157	0.000158
3	CPEB1	0.608838	0.608953
4	LANCL2	0.890499	0.89099
...	...	...	...
9902	AC245100.1	0.265572	0.265635
9903	HSD17B6	0.572269	0.572319
9904	ARNTL	0.56616	0.566218
9905	PAQR8	0.000011	0.000011
9906	XRCC4	0.475503	0.475589

9907 rows × 3 columns

According to the result, we screened the SV genes, that is, the genes with q-value less than 0.05.

[8]:

results.sort_values(["q_value"], inplace=True)
len(results[results["q_value"]<0.05])

[8]:

[9]:

results.head(10)

[9]:

	gene	p_value	q_value
0	GAPDH	0.0	0.0
608	GRINA	0.0	0.0
2094	PSMD8	0.0	0.0
610	HNRNPL	0.0	0.0
2080	PGK1	0.0	0.0
274	KTN1	0.0	0.0
2073	ARPC1B	0.0	0.0
2061	ANP32B	0.0	0.0
271	ACACA	0.0	0.0
623	COL6A1	0.0	0.0