Author: Hui Ma
Date: 2020-02-01
Notebook Source: doublet_detection.ipynb
import pegasus as pg
In this tutorial, we'll use the output result of Pegasus Tutorial to demonstrate how to detect and remove doublet cells in Pegasus. The dataset consists of human bone marrow single cells from 8 donors.
The dataset is stored at https://storage.googleapis.com/terra-featured-workspaces/Cumulus/MantonBM_result.zarr.zip. You can also use gsutil
to download it via its Google bucket URL (gs://terra-featured-workspaces/Cumulus/MantonBM_result.zarr.zip).
Now load the count matrix:
data = pg.read_input("MantonBM_result.zarr.zip")
data
2021-02-01 18:07:20,082 - pegasusio.readwrite - INFO - zarr file 'MantonBM_result.zarr.zip' is loaded. 2021-02-01 18:07:20,083 - pegasusio.readwrite - INFO - Function 'read_input' finished in 0.76s.
MultimodalData object with 1 UnimodalData: 'GRCh38-rna' It currently binds to UnimodalData object GRCh38-rna UnimodalData object with n_obs x n_vars = 35465 x 25653 Genome: GRCh38; Modality: rna It contains 2 matrices: 'X', 'raw.X' It currently binds to matrix 'X' as X obs: 'n_genes', 'Channel', 'n_counts', 'percent_mito', 'scale', 'Group', 'louvain_labels', 'anno' var: 'featureid', 'n_cells', 'percent_cells', 'robust', 'highly_variable_features', 'mean', 'var', 'hvf_loess', 'hvf_rank' obsm: 'X_diffmap', 'X_fle', 'X_pca', 'X_pca_harmony', 'X_phi', 'X_tsne', 'X_umap' varm: 'gmeans', 'gstds', 'means', 'partial_sum', 'de_res' uns: 'genome', 'louvain_resolution', 'modality', 'norm_count', 'pca_features', 'stdzn_max_value', 'Channels', 'Groups', 'PCs', 'c2gid', 'diffmap_evals', 'diffmap_knn_distances', 'diffmap_knn_indices', 'gncells', 'ncells', 'pca_harmony_knn_distances', 'pca_harmony_knn_indices', 'stdzn_mean', 'stdzn_std', 'W_diffmap', 'W_pca_harmony', 'df_qcplot', 'pca'
pg.infer_doublets(data, channel_attr = 'Channel', clust_attr = 'anno')
2021-02-01 18:07:20,238 - pegasus.tools.preprocessing - INFO - After filtration, 21108/25653 genes are kept. Among 21108 genes, 17375 genes are robust. 2021-02-01 18:07:20,239 - pegasus.tools.preprocessing - INFO - Function 'identify_robust_genes' finished in 0.12s. 2021-02-01 18:07:20,308 - pegasus.tools.preprocessing - INFO - Function 'log_norm' finished in 0.07s. 2021-02-01 18:07:20,324 - pegasus.tools.hvf_selection - INFO - Function 'estimate_feature_statistics' finished in 0.01s. 2021-02-01 18:07:20,355 - pegasus.tools.hvf_selection - INFO - 2000 highly variable features have been selected. 2021-02-01 18:07:20,355 - pegasus.tools.hvf_selection - INFO - Function 'highly_variable_features' finished in 0.05s. 2021-02-01 18:07:22,343 - pegasus.tools.doublet_detection - INFO - Sample MantonBM1_HiSeq_1: doublet threshold = 0.1606; total cells = 4415; neotypic doublet rate = 2.22% 2021-02-01 18:07:22,806 - pegasus.tools.doublet_detection - INFO - Function '_run_scrublet' finished in 2.45s. 2021-02-01 18:07:23,541 - pegasus.tools.preprocessing - INFO - After filtration, 20329/25653 genes are kept. Among 20329 genes, 16739 genes are robust. 2021-02-01 18:07:23,542 - pegasus.tools.preprocessing - INFO - Function 'identify_robust_genes' finished in 0.15s. 2021-02-01 18:07:23,609 - pegasus.tools.preprocessing - INFO - Function 'log_norm' finished in 0.07s. 2021-02-01 18:07:23,624 - pegasus.tools.hvf_selection - INFO - Function 'estimate_feature_statistics' finished in 0.01s. 2021-02-01 18:07:23,654 - pegasus.tools.hvf_selection - INFO - 2000 highly variable features have been selected. 2021-02-01 18:07:23,656 - pegasus.tools.hvf_selection - INFO - Function 'highly_variable_features' finished in 0.05s. 2021-02-01 18:07:25,736 - pegasus.tools.doublet_detection - INFO - Sample MantonBM2_HiSeq_1: doublet threshold = 0.1401; total cells = 4935; neotypic doublet rate = 3.12% 2021-02-01 18:07:26,096 - pegasus.tools.doublet_detection - INFO - Function '_run_scrublet' finished in 2.44s. 2021-02-01 18:07:26,843 - pegasus.tools.preprocessing - INFO - After filtration, 20231/25653 genes are kept. Among 20231 genes, 16424 genes are robust. 2021-02-01 18:07:26,844 - pegasus.tools.preprocessing - INFO - Function 'identify_robust_genes' finished in 0.15s. 2021-02-01 18:07:26,903 - pegasus.tools.preprocessing - INFO - Function 'log_norm' finished in 0.06s. 2021-02-01 18:07:26,916 - pegasus.tools.hvf_selection - INFO - Function 'estimate_feature_statistics' finished in 0.01s. 2021-02-01 18:07:26,945 - pegasus.tools.hvf_selection - INFO - 2000 highly variable features have been selected. 2021-02-01 18:07:26,946 - pegasus.tools.hvf_selection - INFO - Function 'highly_variable_features' finished in 0.04s. 2021-02-01 18:07:28,657 - pegasus.tools.doublet_detection - INFO - Sample MantonBM3_HiSeq_1: doublet threshold = 0.1495; total cells = 4225; neotypic doublet rate = 1.78% 2021-02-01 18:07:29,020 - pegasus.tools.doublet_detection - INFO - Function '_run_scrublet' finished in 2.07s. 2021-02-01 18:07:29,716 - pegasus.tools.preprocessing - INFO - After filtration, 20593/25653 genes are kept. Among 20593 genes, 16941 genes are robust. 2021-02-01 18:07:29,716 - pegasus.tools.preprocessing - INFO - Function 'identify_robust_genes' finished in 0.16s. 2021-02-01 18:07:29,780 - pegasus.tools.preprocessing - INFO - Function 'log_norm' finished in 0.06s. 2021-02-01 18:07:29,794 - pegasus.tools.hvf_selection - INFO - Function 'estimate_feature_statistics' finished in 0.01s. 2021-02-01 18:07:29,822 - pegasus.tools.hvf_selection - INFO - 2000 highly variable features have been selected. 2021-02-01 18:07:29,822 - pegasus.tools.hvf_selection - INFO - Function 'highly_variable_features' finished in 0.04s. 2021-02-01 18:07:31,445 - pegasus.tools.doublet_detection - INFO - Sample MantonBM4_HiSeq_1: doublet threshold = 0.1453; total cells = 4172; neotypic doublet rate = 1.73% 2021-02-01 18:07:31,888 - pegasus.tools.doublet_detection - INFO - Function '_run_scrublet' finished in 2.07s. 2021-02-01 18:07:32,648 - pegasus.tools.preprocessing - INFO - After filtration, 20955/25653 genes are kept. Among 20955 genes, 17374 genes are robust. 2021-02-01 18:07:32,649 - pegasus.tools.preprocessing - INFO - Function 'identify_robust_genes' finished in 0.18s. 2021-02-01 18:07:32,728 - pegasus.tools.preprocessing - INFO - Function 'log_norm' finished in 0.08s. 2021-02-01 18:07:32,744 - pegasus.tools.hvf_selection - INFO - Function 'estimate_feature_statistics' finished in 0.01s. 2021-02-01 18:07:32,775 - pegasus.tools.hvf_selection - INFO - 2000 highly variable features have been selected. 2021-02-01 18:07:32,775 - pegasus.tools.hvf_selection - INFO - Function 'highly_variable_features' finished in 0.05s. 2021-02-01 18:07:34,458 - pegasus.tools.doublet_detection - INFO - Sample MantonBM5_HiSeq_1: doublet threshold = 0.1651; total cells = 4090; neotypic doublet rate = 1.59% 2021-02-01 18:07:34,836 - pegasus.tools.doublet_detection - INFO - Function '_run_scrublet' finished in 2.06s. 2021-02-01 18:07:35,570 - pegasus.tools.preprocessing - INFO - After filtration, 21035/25653 genes are kept. Among 21035 genes, 17344 genes are robust. 2021-02-01 18:07:35,571 - pegasus.tools.preprocessing - INFO - Function 'identify_robust_genes' finished in 0.19s. 2021-02-01 18:07:35,647 - pegasus.tools.preprocessing - INFO - Function 'log_norm' finished in 0.08s. 2021-02-01 18:07:35,664 - pegasus.tools.hvf_selection - INFO - Function 'estimate_feature_statistics' finished in 0.02s. 2021-02-01 18:07:35,695 - pegasus.tools.hvf_selection - INFO - 2000 highly variable features have been selected. 2021-02-01 18:07:35,696 - pegasus.tools.hvf_selection - INFO - Function 'highly_variable_features' finished in 0.05s. 2021-02-01 18:07:37,572 - pegasus.tools.doublet_detection - INFO - Sample MantonBM6_HiSeq_1: doublet threshold = 0.1871; total cells = 4665; neotypic doublet rate = 1.59% 2021-02-01 18:07:37,932 - pegasus.tools.doublet_detection - INFO - Function '_run_scrublet' finished in 2.24s. 2021-02-01 18:07:38,642 - pegasus.tools.preprocessing - INFO - After filtration, 20449/25653 genes are kept. Among 20449 genes, 16754 genes are robust. 2021-02-01 18:07:38,643 - pegasus.tools.preprocessing - INFO - Function 'identify_robust_genes' finished in 0.17s. 2021-02-01 18:07:38,703 - pegasus.tools.preprocessing - INFO - Function 'log_norm' finished in 0.06s. 2021-02-01 18:07:38,717 - pegasus.tools.hvf_selection - INFO - Function 'estimate_feature_statistics' finished in 0.01s. 2021-02-01 18:07:38,745 - pegasus.tools.hvf_selection - INFO - 2000 highly variable features have been selected. 2021-02-01 18:07:38,745 - pegasus.tools.hvf_selection - INFO - Function 'highly_variable_features' finished in 0.04s. 2021-02-01 18:07:40,644 - pegasus.tools.doublet_detection - INFO - Sample MantonBM7_HiSeq_1: doublet threshold = 0.1420; total cells = 4452; neotypic doublet rate = 3.01% 2021-02-01 18:07:41,021 - pegasus.tools.doublet_detection - INFO - Function '_run_scrublet' finished in 2.28s. 2021-02-01 18:07:41,755 - pegasus.tools.preprocessing - INFO - After filtration, 20070/25653 genes are kept. Among 20070 genes, 16434 genes are robust. 2021-02-01 18:07:41,756 - pegasus.tools.preprocessing - INFO - Function 'identify_robust_genes' finished in 0.18s. 2021-02-01 18:07:41,821 - pegasus.tools.preprocessing - INFO - Function 'log_norm' finished in 0.06s. 2021-02-01 18:07:41,836 - pegasus.tools.hvf_selection - INFO - Function 'estimate_feature_statistics' finished in 0.01s. 2021-02-01 18:07:41,866 - pegasus.tools.hvf_selection - INFO - 2000 highly variable features have been selected. 2021-02-01 18:07:41,866 - pegasus.tools.hvf_selection - INFO - Function 'highly_variable_features' finished in 0.04s. 2021-02-01 18:07:43,816 - pegasus.tools.doublet_detection - INFO - Sample MantonBM8_HiSeq_1: doublet threshold = 0.1923; total cells = 4511; neotypic doublet rate = 1.15% 2021-02-01 18:07:44,172 - pegasus.tools.doublet_detection - INFO - Function '_run_scrublet' finished in 2.31s. 2021-02-01 18:07:44,715 - pegasus.tools.doublet_detection - INFO - Doublets are predicted! 2021-02-01 18:07:44,716 - pegasus.tools.doublet_detection - INFO - Function 'infer_doublets' finished in 24.62s.
Here, plot annotation and Scrublet-like doublet score.
pg.scatter(data,attrs=['anno','doublet_score'], basis='umap', wspace=1.2)
We also want to see the doublet percentage of each cluster to decide if there is a doublet cluster.
data.uns['pred_dbl_cluster']
cluster | percentage | pval | qval | |
---|---|---|---|---|
0 | CD14+ Monocyte | 4.462044 | 0.000003 | 0.000018 |
1 | Natural killer cell | 3.307692 | 0.000013 | 0.000036 |
2 | B cell | 3.275663 | 0.000004 | 0.000018 |
All clusters have doublet percentage under 5%, so no need to mark any doublet clusters here. If any cluster has doublet percentage more than $50\%$, we can consider to mark it as doublet cluster.
For example, If we want to mark 'CD14+ Monocyte' and 'CD14+ Monocyte-2' as doublet clusters, use command
pg.mark_doublets(data, dbl_clusts = 'anno:CD14+ Monocyte,CD14+ Monocyte-2')
The pg.mark_doublets command will mark doublet cluster (if any), and write singlet/doublet assignment to the "demux_type" column attribute in data.obs. The "demux_type" attribute is also used for singlet/doublet assignment of cell hashing, nucleus hashing and genetics pooling data (see documentation).
For this demonstration dataset, among 35,465 cells, 748 doublets detected. Doublet rate is $2.11\%$
pg.mark_doublets(data)
data.obs['demux_type'].value_counts()
singlet 34741 doublet 724 Name: demux_type, dtype: int64
Doublets distribution can be better observed in umap plot.
pg.scatter(data, attrs = ['anno', 'demux_type'], legend_loc = ['on data', 'right margin'],
wspace = 0.1,alpha = [1.0, 0.8], palettes = 'demux_type:gainsboro,red')
pg.qc_metrics(data,select_singlets=True)
pg.filter_data(data)
2021-02-01 18:07:55,023 - pegasusio.qc_utils - INFO - After filtration, 34741 out of 35465 cell barcodes are kept in UnimodalData object GRCh38-rna. 2021-02-01 18:07:55,024 - pegasus.tools.preprocessing - INFO - Function 'filter_data' finished in 0.43s.
Start the reclustering process from re-selecting highly variable genes. Batch effect is observed, so we also want to use harmony algorithm to correct bach effect for reclustering.
pg.highly_variable_features(data, consider_batch=True)
pg.pca(data)
pca_key = pg.run_harmony(data)
pg.neighbors(data,rep=pca_key)
pg.louvain(data,rep=pca_key)
pg.umap(data,rep=pca_key)
2021-02-01 18:07:55,224 - pegasus.tools.hvf_selection - INFO - Function 'estimate_feature_statistics' finished in 0.19s. 2021-02-01 18:07:55,257 - pegasus.tools.hvf_selection - INFO - 2000 highly variable features have been selected. 2021-02-01 18:07:55,258 - pegasus.tools.hvf_selection - INFO - Function 'highly_variable_features' finished in 0.23s. 2021-02-01 18:07:59,289 - pegasus.tools.preprocessing - INFO - Function 'pca' finished in 4.03s. 2021-02-01 18:07:59,680 - pegasus.tools.batch_correction - INFO - Start integration using Harmony. Initialization is completed. Completed 1 / 10 iteration(s). Completed 2 / 10 iteration(s). Completed 3 / 10 iteration(s). Completed 4 / 10 iteration(s). Completed 5 / 10 iteration(s). Completed 6 / 10 iteration(s). Reach convergence after 6 iteration(s). 2021-02-01 18:08:18,253 - pegasus.tools.batch_correction - INFO - Function 'run_harmony' finished in 18.96s. 2021-02-01 18:08:22,957 - pegasus.tools.nearest_neighbors - INFO - Function 'get_neighbors' finished in 4.70s. 2021-02-01 18:08:23,875 - pegasus.tools.nearest_neighbors - INFO - Function 'calculate_affinity_matrix' finished in 0.92s. 2021-02-01 18:08:25,138 - pegasus.tools.graph_operations - INFO - Function 'construct_graph' finished in 1.26s. 2021-02-01 18:08:48,071 - pegasus.tools.clustering - INFO - Louvain clustering is done. Get 16 clusters. 2021-02-01 18:08:48,177 - pegasus.tools.clustering - INFO - Function 'louvain' finished in 24.30s. 2021-02-01 18:08:48,177 - pegasus.tools.nearest_neighbors - INFO - Found cached kNN results, no calculation is required. 2021-02-01 18:08:48,178 - pegasus.tools.nearest_neighbors - INFO - Function 'get_neighbors' finished in 0.00s. 2021-02-01 18:08:48,188 - pegasus.tools.visualization - INFO - UMAP(min_dist=0.5, random_state=0, verbose=True) 2021-02-01 18:08:48,189 - pegasus.tools.visualization - INFO - Construct fuzzy simplicial set 2021-02-01 18:08:49,833 - pegasus.tools.visualization - INFO - Construct embedding completed 0 / 200 epochs completed 20 / 200 epochs completed 40 / 200 epochs completed 60 / 200 epochs completed 80 / 200 epochs completed 100 / 200 epochs completed 120 / 200 epochs completed 140 / 200 epochs completed 160 / 200 epochs completed 180 / 200 epochs 2021-02-01 18:09:11,139 - pegasus.tools.visualization - INFO - Function 'umap' finished in 22.96s.
Re-annotate:
pg.de_analysis(data, cluster='louvain_labels')
celltype_dict = pg.infer_cell_types(data, markers = 'human_immune',de_test='mwu',output_file='BM_celltype_re_dict.txt')
cluster_names = pg.infer_cluster_names(celltype_dict)
pg.annotate(data, name='anno', based_on='louvain_labels', anno_dict=cluster_names)
2021-02-01 18:09:11,984 - pegasus.tools.diff_expr - INFO - CSR matrix is converted to CSC matrix. Time spent = 0.8375s. 2021-02-01 18:09:18,461 - pegasus.tools.diff_expr - INFO - MWU test and AUROC calculation are finished. Time spent = 6.4771s. 2021-02-01 18:09:18,571 - pegasus.tools.diff_expr - INFO - Sufficient statistics are collected. Time spent = 0.1103s. 2021-02-01 18:09:18,669 - pegasus.tools.diff_expr - INFO - Differential expression analysis is finished. 2021-02-01 18:09:18,669 - pegasus.tools.diff_expr - INFO - Function 'de_analysis' finished in 7.53s.
Umap of annotation after re-clustering:
pg.scatter(data,attrs='anno',legend_loc='on data',basis='umap')