Integrating multiple scRNA-seq data

This tutorial shows loading, preprocessing, DAVAE integration and visualization of 293T and Jurkat cells in three different batches (Mixed Cell Lines).

Importing scbean package

Here, we’ll import scbean along with other popular packages.

import pandas as pd
import scbean.model.davae as davae
import as tl
import scanpy as sc
import matplotlib
from numpy.random import seed

# Command for Jupyter notebooks only
%matplotlib inline
import warnings
from matplotlib.axes._axes import _log as matplotlib_axes_logger
Loading data

This tutorial uses Mixed Cell Line datasets from 10xgenomics with non-overlapping populations from three batches, two of which contain 293t (2885 cells) and jurkat (3258 cells) cells respectively, and the third batch contains a 1:1 mixture of 293t and jurkat cells (3388 cells).

  • Read from 10x mtx file The file in 10x mtx format can be downloaded here. Set the fmt parameter of pp.read_sc_data() function to ‘10x_mtx’ to read the data downloaded from 10XGenomics. If the file downloaded from 10XGenomics is in h5 format, the dataset can be loaded by setting the fmt parameter to ‘10x_h5’.

file1 = './data/' + "293t/hg19/"
file2 = './data/' + "jurkat/hg19/"
file3 = './data/' + "jurkat_293t/hg19/"

adata_b1 = tl.read_sc_data(file1, fmt='10x_mtx', batch_name="293t")
adata_b2 = tl.read_sc_data(file2, fmt='10x_mtx', batch_name="jurkat")
adata_b3 = tl.read_sc_data(file3, fmt='10x_mtx', batch_name="mixed")

Data preprocessing

Here, we filter and normalize each data separately and concatenate them into one AnnData object. For more details, please check the preprocessing API.

adata_all = tl.davae_preprocessing([adata_b1, adata_b2, adata_b3], index_unique="-")
DAVAE Integration

The code for integration using davae is as following:

# Command for Jupyter notebooks only
import os
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "2"

adata_integrate = davae.fit_integration(
    hidden_layers=[64, 32, 6]
AnnData object with n_obs × n_vars = 9530 × 2000
    obs: '_batch', 'n_genes', 'percent_mito', 'n_counts', 'size_factor', 'loss_weight', 'batch_label', 'batch'
    var: 'gene_ids', 'n_cells-0-0', 'highly_variable-0-0', 'means-0-0', 'dispersions-0-0', 'dispersions_norm-0-0', 'n_cells-1-0', 'highly_variable-1-0', 'means-1-0', 'dispersions-1-0', 'dispersions_norm-1-0', 'n_cells-1', 'highly_variable-1', 'means-1', 'dispersions-1', 'dispersions_norm-1'
    obsm: 'X_davae'

1.The of each cell has been saved in adata.obs

2.The embedding representation of davae for each cell have been saved in adata.obsm(‘X_davae’)

Loading result from h5ad file: You can also download and use the integrated results. The output.h5ad file of the DAVAE result can be downloaded here

adata_integrate = sc.read_h5ad('./adata_integrate.h5ad')
AnnData object with n_obs × n_vars = 9530 × 2000
    obs: '_batch', 'n_genes', 'percent_mito', 'n_counts', 'size_factor', 'loss_weight', 'batch_label', 'batch', 'leiden', 'celltype'
    var: 'gene_ids', 'n_cells-0-0', 'highly_variable-0-0', 'means-0-0', 'dispersions-0-0', 'dispersions_norm-0-0', 'n_cells-1-0', 'highly_variable-1-0', 'means-1-0', 'dispersions-1-0', 'dispersions_norm-1-0', 'n_cells-1', 'highly_variable-1', 'means-1', 'dispersions-1', 'dispersions_norm-1'
    uns: '_batch_colors', 'celltype_colors', 'leiden', 'leiden_colors', 'neighbors'
    obsm: 'X_davae', 'X_umap'
    obsp: 'connectivities', 'distances'

UMAP Visualization

We use UMAP to reduce the embedding feature output by DAVAE in 2 dimensions.

import umap
adata_integrate.uns['_batch_colors'] = ['#FF34FF', '#4FC601', '#3B5DFF'], color=['_batch', 'celltype'], s=3)