pegasus.calc_signature_score

pegasus.calc_signature_score(data, signatures, n_bins=50, show_omitted_genes=False, random_state=0)[source]

Calculate signature / gene module score. [Li20-1]

This is an improved version of implementation in [Jerby-Arnon18].

Parameters
  • data (MultimodalData, UnimodalData, or anndata.AnnData object.) – Single cell expression data.

  • signatures (Dict[str, List[str]] or str) –

    This argument accepts either a dictionary or a string. If signatures is a dictionary, it can contain multiple signature score calculation requests. Each key in the dictionary represents a separate signature score calculation and its corresponding value contains a list of gene symbols. Each score will be stored in data.obs field with key as the keyword. If signatures is a string, it should refer to a Gene Matrix Transposed (GMT)-formatted file. Pegasus will load signatures from the GMT file.

    Pegasus also provide 5 default signature panels for each of human and mouse. They are cell_cycle_human, gender_human, mitochondrial_genes_human, ribosomal_genes_human and apoptosis_human for human; cell_cycle_mouse, gender_mouse, mitochondrial_genes_mouse, ribosomal_genes_mouse and apoptosis_mouse for mouse.
    • cell_cycle_human contains two cell-cycle signatures, G1/S and G2/M, obtained from Tirosh et al. 2016. We also updated gene symbols according to Seurat’s cc.genes.updated.2019 vector. We additionally calculate signature scores cycling and cycle_diff, which are max{G2/M, G1/S} and G2/M - G1/S respectively. We provide predicted cell cycle phases in data.obs['predicted_phase'] in case it is useful. predicted_phase is predicted based on G1/S and G2/M scores. First, we identify G0 cells. We apply KMeans algorithm to obtain 2 clusters based on the cycling signature. G0 cells are from the cluster with smallest mean value. For each cell from the other cluster, if G1/S > G2/M, it is a G1/S cell, otherwise it is a G2/M cell.

    • gender_human contains two gender-specific signatures, female_score and male_score. Genes were selected based on DE analysis between genders based on 8 channels of bone marrow data from HCA Census of Immune Cells and the brain nuclei data from Gaublomme and Li et al, 2019, Nature Communications. After calculation, three signature scores will be calculated: female_score, male_score and gender_score. female_score and male_score are calculated based on female and male signatures respectively and a larger score represent a higher likelihood of that gender. gender_score is calculated as male_score - female_score. A large positive score likely represents male and a large negative score likely represents female. Pegasus also provides predicted gender for each cell based on gender_score, which is stored in data.obs['predicted_gender']. To predict genders, we apply the KMeans algorithm to the gender_score and ask for 3 clusters. The clusters with a minimum and maximum clauster centers are predicted as female and male respectively and the cluster in the middle is predicted as uncertain. Note that this approach is conservative and it is likely that users can predict genders based on gender_score for cells in the uncertain cluster with a reasonable accuracy.

    • mitochondrial_genes_human contains two signatures, mito_genes and mito_ribo. mito_genes contains 13 mitocondrial genes from chrM and mito_ribo contains mitocondrial ribosomal genes that are not from chrM. Note that mito_genes correlates well with percent of mitocondrial UMIs and mito_ribo does not.

    • ribosomal_genes_human contains one signature, ribo_genes, which includes ribosomal genes from both large and small units.

    • apoptosis_human contains one signature, apoptosis, which includes apoptosis-related genes from the KEGG pathway.

    • cell_cycle_mouse, gender_mouse, mitochondrial_genes_mouse, ribosomal_genes_mouse and apoptosis_mouse are the corresponding signatures for mouse. Gene symbols are directly translated from human genes.

  • n_bins (int, optional, default: 50) – Number of bins on expression levels for grouping genes.

  • show_omitted_genes (bool, optional, default False) – Signature genes that are not expressed in the data will be omitted. By default, pegasus does not report which genes are omitted. If this option is turned on, report omitted genes.

  • random_state (int, optional, default: 0) – Random state used by KMeans if signature == gender_human or gender_mouse.

Return type

None

Returns

  • None.

  • Update data.obs

    • data.obs["key"]: signature / gene module score for signature “key”

  • Update data.var

    • data.var["mean"]: Mean expression of each gene across all cells. Only updated if “mean” does not exist in data.var.

    • data.var["bins"]: Bin category for each gene. Only updated if data.uns[“sig_n_bins”] is updated.

  • Update data.obsm

    • data.obsm["sig_background"]: Expected signature score for each bin category. Only updated if data.uns[“sig_n_bins”] is updated.

  • Update data.uns

    • data.uns["sig_n_bins"]: Number of bins to partition genes into. Only updated if “sig_n_bins” does not exist or the recorded number of bins does not match n_bins.

Examples

>>> pg.calc_signature_score(data, {"T_cell_sig": ["CD3D", "CD3E", "CD3G", "TRAC"]})
>>> pg.calc_signature_score(data, "cell_cycle_human")