pegasus.infer_doublets

pegasus.infer_doublets(data, dbl_attr='scrublet_score', channel_attr=None, clust_attr=None, n_components=50, robust=True, n_clusters=30, n_clusters2=50, n_init=10, min_avg_cells_per_final_cluster=10, alpha=0.05, min_dbl_rate=0.01, random_state=0, verbose=False)[source]

Infer doublets based on Scrublet scores.

This implementation is inspired by [Pijuan-Sala19] and [Popescu19].

Parameters
  • data (pegasusio.MultimodalData) – Annotated data matrix with rows for cells and columns for genes.

  • dbl_attr (str, optional, default: scrublet_scores) – Attribute indicating calculated doublet scores from Scrublet.

  • channel_attr (str, optional, default: None) – Attribute indicating sample channels. If None, assuming cell-level and sample-level doublets are already calculated and saved in ‘data.obs[pred_dbl_type]’.

  • clust_attr (str, optional, default: None) – Attribute indicating cluster labels. If None, does not perform cluster-level doublet detection.

  • n_components (int, optional, default: 50) – Number of PC components for sample-level doublet inference. Note that we use all genes (sparse matrix) and truncated SVD to infer PCs. Because truncated SVD does not reduce means and the first component correlates with the mean, we will use n_components + 1 in sklearn.decomposition.TruncatedSVD .

  • robust (bool, optional, default True) – If robust == True, use algorithm = ‘arpack’; otherwise, use algorithm = ‘randomized’.

  • n_clusters (int, optional, default: 30) – The number of first level clusters.

  • n_clusters2 (int, optional, default: 50) – The number of second level clusters.

  • n_init (int, optional, default: 10) – Number of kmeans tries for the first level clustering. Default is set to be the same as scikit-learn Kmeans function.

  • min_avg_cells_per_final_cluster (int, optional, default: 10) – Expected number of cells per cluster after the two-level KMeans.

  • alpha (float, optional, default: 0.05) – FDR significant level for statistical tests.

  • min_dbl_rate (float, optional, default: 0.01) – Minimum expected doublet rate for one channel. In some cases, the algorithm will iterate until no doublet detected, which is contradict with our expectation that at least a small percent of cells are doublets. With this parameter, the algorithm will stop before the detected ratio of doublets is less than min_dbl_rate.

  • random_state (int, optional, default: 0) – Random seed for reproducing results.

  • verbose (bool, optional, default: False) – If true, pegasus generates density plots for each channel under working directory with name channel.dbl.png. In addition, diagnostic outputs will be generated.

Return type

None

Returns

  • None

  • Update data.obs

    • data.obs['pred_dbl_type']: Predicted singlet/doublet types.

    • data.uns['pred_dbl_cluster']: Only generated if ‘clust_attr’ is not None. This is a dataframe with two columns, ‘Cluster’ and ‘Qval’. Only clusters with significantly more doublets than expected will be recorded here.

Examples

>>> pg.infer_doublets(data, channel_attr = 'Channel', clust_attr = 'Annotation')