pegasus.infer_doublets

pegasus.infer_doublets(data, channel_attr=None, clust_attr=None, min_cell=100, expected_doublet_rate=None, sim_doublet_ratio=2.0, n_prin_comps=30, k=None, n_jobs=- 1, alpha=0.05, random_state=0, plot_hist='sample', manual_correction=None)[source]

Infer doublets by first calculating Scrublet-like [Wolock18] doublet scores and then smartly determining an appropriate doublet score cutoff [Li20-2] .

This function should be called after clustering if clust_attr is not None. In this case, we will test if each cluster is significantly enriched for doublets using Fisher’s exact test.

Parameters
  • data (pegasusio.MultimodalData) – Annotated data matrix with rows for cells and columns for genes.

  • channel_attr (str, optional, default: None) – Attribute indicating sample channels. If set, calculate scrublet-like doublet scores per channel.

  • clust_attr (str, optional, default: None) – Attribute indicating cluster labels. If set, estimate proportion of doublets in each cluster and statistical significance.

  • min_cell (int, optional, default: 100) – Minimum number of cells per sample to calculate doublet scores. For samples having less than ‘min_cell’ cells, doublet score calculation will be skipped.

  • expected_doublet_rate (float, optional, default: None) – The expected doublet rate for the experiment. By default, calculate the expected rate based on number of cells from the 10x multiplet rate table

  • sim_doublet_ratio (float, optional, default: 2.0) – The ratio between synthetic doublets and observed cells.

  • n_prin_comps (int, optional, default: 30) – Number of principal components.

  • k (int, optional, default: None) – Number of observed cell neighbors. If None, k = round(0.5 * sqrt(number of observed cells)). Total neighbors k_adj = round(k * (1.0 + sim_doublet_ratio)).

  • n_jobs (int, optional, default: -1) – Number of threads to use. If -1, use all physical CPU cores.

  • alpha (float, optional, default: 0.05) – FDR significant level for cluster-level fisher exact test.

  • random_state (int, optional, default: 0) – Random seed for reproducing results.

  • plot_hist (str, optional, default: sample) – If not None, plot diagnostic histograms using plot_hist as the prefix. If channel_attr is None, plot_hist.dbl.png is generated; Otherwise, plot_hist.channel_name.dbl.png files are generated. Each figure consists of 4 panels showing histograms of doublet scores for observed cells (panel 1, density in log scale), simulated doublets (panel 2, density in log scale), KDE plot (panel 3) and signed curvature plot (panel 4) of log doublet scores for simulated doublets.

  • manual_correction (str, optional, default: None) – Use human guide to correct doublet threshold for certain channels. This is string representing a comma-separately list. Each item in the list represent one sample and the sample name and correction guide are separated using ‘:’. The only correction guide supported is ‘peak’, which means cut at the center of the peak. If only one sample available, use ‘’ as the sample name.

Return type

None

Returns

  • None

  • Update data.obs

    • data.obs['pred_dbl_type']: Predicted singlet/doublet types.

    • data.uns['pred_dbl_cluster']: Only generated if ‘clust_attr’ is not None. This is a dataframe with two columns, ‘Cluster’ and ‘Qval’. Only clusters with significantly more doublets than expected will be recorded here.

Examples

>>> pg.infer_doublets(data, channel_attr = 'Channel', clust_attr = 'Annotation')