Pegasus for Single Cell Analysis¶
Pegasus is a tool for analyzing transcriptomes of millions of single cells. It is a command line tool, a python package and a base for Cloud-based analysis workflows.
Release Highlights in Current Stable¶
1.8.1 August 23, 2023¶
Bug fix in cell marker JSON files for
infer_cell_types
function.
1.8.0 July 21, 2023¶
New Feature and Improvement
Updata
human_immune
andhuman_lung
marker sets.Add
mouse_liver
marker set.Add split_one_cluster function to subcluster one cluster into a specified number of subclusters.
Update neighbors function to set
use_cache=False
by default, and adjust K tomin(K, int(sqrt(n_samples)))
. [PR 272]In infer_doublets function, argument
manual_correction
now accepts a float number threshold specified by users for cut-off. [PR 275]
Bug Fix
Fix divide by zero issue in
integrative_nmf
function. [PR 258]Compatibility with Pandas v2.0. [PR 261]
Allow
infer_doublets
to use any count matrix with key name specified by users. [PR 268 Thanks to Donghoon Lee]
Installation¶
Pegasus works with Python 3.7, 3.8 and 3.9.
Linux¶
Ubuntu/Debian¶
On Ubuntu/Debian Linux, first install the following dependency by:
sudo apt install build-essential
Next, you can install Pegasus system-wide by PyPI (see Ubuntu/Debian install via PyPI), or within a Miniconda environment (see Install via Conda).
To use the Force-directed-layout (FLE) embedding feature, you’ll need Java. You can either install Oracle JDK, or install OpenJDK which is included in Ubuntu official repository:
sudo apt install default-jdk
First, install Python 3, pip tool for Python 3 and Cython package:
sudo apt install python3 python3-pip
python3 -m pip install --upgrade pip
python3 -m pip install cython
Now install Pegasus with the required dependencies via pip:
python3 -m pip install pegasuspy
or install Pegasus with all dependencies:
python3 -m pip install pegasuspy[all]
Alternatively, you can install Pegasus with some of the additional optional dependencies as below:
torch: This includes
harmony-pytorch
for data integration andnmf-torch
for NMF and iNMF data integration, both of which uses PyTorch:python3 -m pip install pegasuspy[torch]
louvain: This includes
louvain
package, which provides Louvain clustering algorithm, besides the default Leiden algorithm in Pegasus:python3 -m pip install pegasuspy[louvain]
Note
If installing from Python 3.9, to install louvain
, you’ll need to install the following packages system-wide first in order to locally compile it:
sudo apt install flex bison libtool
tsne: This package is to calculate t-SNE plots using a fast algorithm FIt-SNE:
sudo apt install libfftw3-dev python3 -m pip install pegasuspy[tsne]
forceatlas: This includes
forceatlas2-python
package, a multi-thread Force Atlas 2 implementation for trajectory analysis:python3 -m pip install pegasuspy[forceatlas]
scanorama: This includes
scanorama
package, a widely-used method for batch correction:python3 -m pip install pegasuspy[scanorama]
mkl: This includes
mkl
package, which improve math routines for science and engineering applications. Notice that mkl not included in pegasuspy[all] above:python3 -m pip install pegasuspy[mkl]
rpy2: This includes
rpy2
package, which is used by Pegasus wrapper on R functions, such asfgsea
andDESeq2
:python3 -m pip install pegasuspy[rpy2]
scvi: This includes
scvi-tools
package for data integration:python3 -m pip install pegasuspy[scvi]
Fedora¶
On Fedora Linux, first install the following dependency by:
sudo dnf install gcc gcc-c++
Next, you can install Pegasus system-wide by PyPI (see Fedora install via PyPI), or within a Miniconda environment (see Install via Conda).
To use the Force-directed-layout (FLE) embedding feature, you’ll need Java. You can either install Oracle JDK, or install OpenJDK which is included in Fedora official repository (e.g. java-latest-openjdk
):
sudo dnf install java-latest-openjdk
or other OpenJDK version chosen from the searching result of command:
dnf search openjdk
We’ll use Python 3.8 in this tutorial.
First, install Python 3 and pip tool for Python 3:
sudo dnf install python3.8
python3.8 -m ensurepip --user
python3.8 -m pip install --upgrade pip
Now install Pegasus with the required dependencies via pip:
python3.8 -m pip install pegasuspy
or install Pegasus with all dependencies:
python3.8 -m pip install pegasuspy[all]
Alternatively, you can install Pegasus with some of the additional optional dependencies as below:
torch: This includes
harmony-pytorch
for data integration andnmf-torch
for NMF and iNMF data integration, both of which uses PyTorch:python3.8 -m pip install pegasuspy[torch]
louvain: This includes
louvain
package, which provides Louvain clustering algorithm, besides the default Leiden algorithm in Pegasus:python3.8 -m pip install pegasuspy[louvain]
Note
If installing from Python 3.9, to install louvain
, you’ll need to install the following packages system-wide first in order to locally compile it:
sudo dnf install flex bison libtool
tsne: This package is to calculate t-SNE plots using a fast algorithm FIt-SNE:
sudo apt install libfftw3-dev python3.8 -m pip install pegasuspy[tsne]
forceatlas: This includes
forceatlas2-python
package, a multi-thread Force Atlas 2 implementation for trajectory analysis:python3.8 -m pip install pegasuspy[forceatlas]
scanorama: This includes
scanorama
package, a widely-used method for batch correction:python3.8 -m pip install pegasuspy[scanorama]
mkl: This includes
mkl
package, which improve math routines for science and engineering applications. Notice that mkl not included in pegasuspy[all] above:python3.8 -m pip install pegasuspy[mkl]
rpy2: This includes
rpy2
package, which is used by Pegasus wrapper on R functions, such asfgsea
andDESeq2
:python3.8 -m pip install pegasuspy[rpy2]
scvi: This includes
scvi-tools
package for data integration:python3.8 -m pip install pegasuspy[scvi]
macOS¶
Prerequisites¶
First, install Homebrew by following the instruction on its website: https://brew.sh/. Then install the following dependencies:
brew install libomp
And install macOS command line tools:
xcode-select --install
Next, you can install Pegasus system-wide by PyPI (see macOS installation via PyPI), or within a Miniconda environment (see Install via Conda).
To use the Force-directed-layout (FLE) embedding feature, you’ll need Java. You can either install Oracle JDK, or install OpenJDK via Homebrew:
brew install java
macOS install via PyPI¶
You need to install Python and pip tool first:
brew install python3 python3 -m pip install --upgrade pip
Now install Pegasus with required dependencies via pip:
python3 -m pip install pegasuspy
or install Pegasus with all dependencies:
python3 -m pip install pegasuspy[all]
Alternatively, you can install Pegasus with some of the additional optional dependencies as below:
torch: This includes
harmony-pytorch
for data integration andnmf-torch
for NMF and iNMF data integration, both of which uses PyTorch:python3 -m pip install pegasuspy[torch]
louvain: This includes
louvain
package, which provides Louvain clustering algorithm, besides the default Leiden algorithm in Pegasus:python3 -m pip install pegasuspy[louvain]
tsne: This package is to calculate t-SNE plots using a fast algorithm FIt-SNE:
sudo apt install libfftw3-dev python3 -m pip install pegasuspy[tsne]
forceatlas: This includes
forceatlas2-python
package, a multi-thread Force Atlas 2 implementation for trajectory analysis:python3 -m pip install pegasuspy[forceatlas]
scanorama: This includes
scanorama
package, a widely-used method for batch correction:python3 -m pip install pegasuspy[scanorama]
mkl: This includes
mkl
packages, which improve math routines for science and engineering applications. Notice that mkl not included in pegasuspy[all] above:python3 -m pip install pegasuspy[mkl]
rpy2: This includes
rpy2
package, which is used by Pegasus wrapper on R functions, such asfgsea
andDESeq2
:python3 -m pip install pegasuspy[rpy2]
scvi: This includes
scvi-tools
package for data integration:python3 -m pip install pegasuspy[scvi]
Install via Conda¶
Alternatively, you can install Pegasus via Conda, which is a separate virtual environment without touching your system-wide packages and settings.
You can install Anaconda, or Miniconda (a minimal installer of conda). In this tutorial, we’ll use Miniconda.
Download Miniconda installer for your OS. For example, if on 64-bit Linux, then use the following commands to install Miniconda:
export CONDA_PATH=/home/foo bash Miniconda3-latest-Linux-x86_64.sh -p $CONDA_PATH/miniconda3 mv Miniconda3-latest-Linux-x86_64.sh $CONDA_PATH/miniconda3 source ~/.bashrc
where /home/foo
should be replaced by the directory to which you want to install Miniconda. Similarly for macOS.
Create a conda environment for pegasus. This tutorial uses
pegasus
as the environment name, but you are free to choose your own:conda create -n pegasus -y python=3.8
Also notice that Python 3.8
is used in this tutorial. To choose a different version of Python, simply change the version number in the command above.
Enter
pegasus
environment by activating:conda activate pegasus
Install Pegasus via conda:
conda install -c bioconda pegasuspy
(Optional) Use the following command to add support
nmf-torch
:pip install nmf-torch
Enalbe Force Atlas 2 for trajectory analysis:
conda install -c bioconda forceatlas2-python
Enable support on scanorama
:
conda install -c bioconda scanorama
Enable support on fgsea
and deseq2
packages:
conda install -c bioconda rpy2 bioconductor-fgsea bioconductor-deseq2
Enable support on scvi-tools
:
conda install -c conda-forge scvi-tools
Install via Singularity¶
Singularity is a container engine similar to Docker. Its main difference from Docker is that Singularity can be used with unprivileged permissions.
Note
Please notice that Singularity Hub has been offline since April 26th, 2021 (see blog post). All existing containers held there are in archive, and we can no longer push new builds.
So if you fetch the container from Singularity Hub using the following command:
singularity pull shub://klarman-cell-observatory/pegasus
it will just give you a Singularity container of Pegasus v1.2.0 running on Ubuntu Linux 20.04 base with Python 3.8, in the name pegasus_latest.sif
of about 2.4 GB.
On your local machine, first install Singularity, then you can use our Singularity spec file to build a Singularity container by yourself:
singularity build pegasus.sif Singularity
where Singularity
is the spec filename.
After that, you can interact with it by running the following command:
singularity run pegasus.sif
Please refer to Singularity image interaction guide for details.
Development Version¶
To install Pegasus development version directly from its GitHub respository, please do the following steps:
Install prerequisite libraries as mentioned in above sections.
Install Git. See here for how to install Git.
Use git to fetch repository source code, and install from it:
git clone https://github.com/lilab-bcb/pegasus.git cd pegasus pip install -e .[all]
where -e
option of pip
means to install in editing mode, so that your Pegasus installation will be automatically updated upon modifications in source code.
Use Pegasus as a command line tool¶
Pegasus can be used as a command line tool. Type:
pegasus -h
to see the help information:
Usage:
pegasus <command> [<args>...]
pegasus -h | --help
pegasus -v | --version
pegasus
has 9 sub-commands in 6 groups.
Preprocessing:
- aggregate_matrix
Aggregate sample count matrices into a single count matrix. It also enables users to import metadata into the count matrix.
Demultiplexing:
- demuxEM
Demultiplex cells/nuclei based on DNA barcodes for cell-hashing and nuclei-hashing data.
Analyzing:
- cluster
Perform first-pass analysis using the count matrix generated from ‘aggregate_matrix’. This subcommand could perform low quality cell filtration, batch correction, variable gene selection, dimension reduction, diffusion map calculation, graph-based clustering, visualization. The final results will be written into zarr-formatted file, or h5ad file, which Seurat could load.
- de_analysis
Detect markers for each cluster by performing differential expression analysis per cluster (within cluster vs. outside cluster). DE tests include Welch’s t-test, Fisher’s exact test, Mann-Whitney U test. It can also calculate AUROC values for each gene.
- find_markers
Find markers for each cluster by training classifiers using LightGBM.
- annotate_cluster
This subcommand is used to automatically annotate cell types for each cluster based on existing markers. Currently, it works for human/mouse immune/brain cells, etc.
Plotting:
- plot
Make static plots, which includes plotting tSNE, UMAP, and FLE embeddings by cluster labels and different groups.
Web-based visualization:
- scp_output
Generate output files for single cell portal.
MISC:
- check_indexes
Check CITE-Seq/hashing indexes to avoid index collision.
Quick guide¶
Suppose you have example.csv
ready with the following contents:
Sample,Source,Platform,Donor,Reference,Location
sample_1,bone_marrow,NextSeq,1,GRCh38,/my_dir/sample_1/raw_feature_bc_matrices.h5
sample_2,bone_marrow,NextSeq,2,GRCh38,/my_dir/sample_2/raw_feature_bc_matrices.h5
sample_3,pbmc,NextSeq,1,GRCh38,/my_dir/sample_3/raw_gene_bc_matrices_h5.h5
sample_4,pbmc,NextSeq,2,GRCh38,/my_dir/sample_4/raw_gene_bc_matrices_h5.h5
You want to analyze all four samples but correct batch effects for bone marrow and pbmc samples separately. You can run the following commands:
pegasus aggregate_matrix --attributes Source,Platform,Donor example.csv example.aggr
pegasus cluster -p 20 --correct-batch-effect --batch-group-by Source --louvain --umap example.aggr.zarr.zip example
pegasus de_analysis -p 20 --labels louvain_labels example.zarr.zip example.de.xlsx
pegasus annotate_cluster example.zarr.zip example.anno.txt
pegasus plot compo --groupby louvain_labels --condition Donor example.zarr.zip example.composition.pdf
pegasus plot scatter --basis umap --attributes louvain_labels,Donor example.zarr.zip example.umap.pdf
The above analysis will give you UMAP embedding and Louvain cluster labels in example.zarr.zip
, along with differential expression analysis
result stored in example.de.xlsx
, and putative cluster-specific cell type annotation stored in example.anno.txt
.
You can investigate donor-specific effects by looking at example.composition.pdf
.
example.umap.pdf
plotted UMAP colored by louvain_labels and Donor info side-by-side.
pegasus aggregate_matrix
¶
The first step for single cell analysis is to generate one count matrix from cellranger’s channel-specific count matrices. pegasus aggregate_matrix
allows aggregating arbitrary matrices with the help of a CSV file.
Type:
pegasus aggregate_matrix -h
to see the usage information:
Usage:
pegasus aggregate_matrix <csv_file> <output_name> [--restriction <restriction>... --attributes <attributes> --default-reference <reference> --select-only-singlets --min-genes <number>]
pegasus aggregate_matrix -h
Arguments:
csv_file Input csv-formatted file containing information of each sc/snRNA-seq sample. This file must contain at least 2 columns -
Sample
, sample name; andLocation
, location of the sample count matrix in either 10x v2/v3, DGE, mtx, csv, tsv or loom format. Additionally, an optionalReference
column can be used to select samples generated from a same reference (e.g. mm10). If the count matrix is in either DGE, mtx, csv, tsv, or loom format, the value in this column will be used as the reference since the count matrix file does not contain reference name information. Moreover, theReference
column can be used to aggregate count matrices generated from different genome versions or gene annotations together under a unified reference. For example, if we have one matrix generated from mm9 and the other one generated from mm10, we can write mm9_10 for these two matrices in their Reference column. Pegasus will change their references to ‘mm9_10’ and use the union of gene symbols from the two matrices as the gene symbols of the aggregated matrix. For HDF5 files (e.g. 10x v2/v3), the reference name contained in the file does not need to match the value in this column. In fact, we use this column to rename references in HDF5 files. For example, if we have two HDF files, one generated from mm9 and the other generated from mm10. We can set these two files’ Reference column value to ‘mm9_10’, which will rename their reference names into mm9_10 and the aggregated matrix will contain all genes from either mm9 or mm10. This renaming feature does not work if one HDF5 file contain multiple references (e.g. mm10 and GRCh38). csv_file can optionally contain two columns -nUMI
andnGene
. These two columns define minimum number of UMIs and genes for cell selection for each sample. The values in these two columns overwrite the--min-genes
and--min-umis
arguments. See below for an example csv:Sample,Source,Platform,Donor,Reference,Location sample_1,bone_marrow,NextSeq,1,GRCh38,/my_dir/sample_1/raw_feature_bc_matrices.h5 sample_2,bone_marrow,NextSeq,2,GRCh38,/my_dir/sample_2/raw_feature_bc_matrices.h5 sample_3,pbmc,NextSeq,1,GRCh38,/my_dir/sample_3/raw_gene_bc_matrices_h5.h5 sample_4,pbmc,NextSeq,2,GRCh38,/my_dir/sample_4/raw_gene_bc_matrices_h5.h5
- output_name
The output file name.
Options:
- --restriction <restriction>…
Select channels that satisfy all restrictions. Each restriction takes the format of name:value,…,value or name:~value,..,value, where ~ refers to not. You can specifiy multiple restrictions by setting this option multiple times.
- --attributes <attributes>
Specify a comma-separated list of outputted attributes. These attributes should be column names in the csv file.
- --default-reference <reference>
If sample count matrix is in either DGE, mtx, csv, tsv or loom format and there is no Reference column in the csv_file, use <reference> as the reference.
- --select-only-singlets
If we have demultiplexed data, turning on this option will make pegasus only include barcodes that are predicted as singlets.
- --remap-singlets <remap_string>
Remap singlet names using <remap_string>, where <remap_string> takes the format “new_name_i:old_name_1,old_name_2;new_name_ii:old_name_3;…”. For example, if we hashed 5 libraries from 3 samples sample1_lib1, sample1_lib2, sample2_lib1, sample2_lib2 and sample3, we can remap them to 3 samples using this string: “sample1:sample1_lib1,sample1_lib2;sample2:sample2_lib1,sample2_lib2”. In this way, the new singlet names will be in metadata field with key ‘assignment’, while the old names will be kept in metadata field with key ‘assignment.orig’.
- --subset-singlets <subset_string>
If select singlets, only select singlets in the <subset_string>, which takes the format “name1,name2,…”. Note that if –remap-singlets is specified, subsetting happens after remapping. For example, we can only select singlets from sampe 1 and 3 using “sample1,sample3”.
- --min-genes <number>
Only keep barcodes with at least <ngene> expressed genes.
- --max-genes <number>
Only keep cells with less than <number> of genes.
- --min-umis <number>
Only keep cells with at least <number> of UMIs.
- --max-umis <number>
Only keep cells with less than <number> of UMIs.
- --mito-prefix <prefix>
Prefix for mitochondrial genes. If multiple prefixes are provided, separate them by comma (e.g. “MT-,mt-“).
- --percent-mito <percent>
Only keep cells with mitochondrial percent less than <percent>%. Only when both mito_prefix and percent_mito set, the mitochondrial filter will be triggered.
- --no-append-sample-name
Turn this option on if you do not want to append sample name in front of each sample’s barcode (concatenated using ‘-‘).
- -h, --help
Print out help information.
Outputs:
- output_name.zarr.zip
A zipped Zarr file containing aggregated data.
Examples:
pegasus aggregate_matrix --restriction Source:BM,CB --restriction Individual:1-8 --attributes Source,Platform Manton_count_matrix.csv aggr_data
pegasus demuxEM
¶
Demultiplex cell-hashing/nucleus-hashing data.
Type:
pegasus demuxEM -h
to see the usage information:
Usage:
pegasus demuxEM [options] <input_raw_gene_bc_matrices_h5> <input_hto_csv_file> <output_name>
pegasus demuxEM -h | --help
pegasus demuxEM -v | --version
Arguments:
- input_raw_gene_bc_matrices_h5
Input raw RNA expression matrix in 10x hdf5 format. It is important to feed raw (unfiltered) count matrix, as demuxEM uses it to estimate the background information.
- input_hto_csv_file
Input HTO (antibody tag) count matrix in CSV format.
- output_name
Output name. All outputs will use it as the prefix.
Options:
- -p <number>, --threads <number>
Number of threads. [default: 1]
- --genome <genome>
Reference genome name. If not provided, we will infer it from the expression matrix file.
- --alpha-on-samples <alpha>
The Dirichlet prior concentration parameter (alpha) on samples. An alpha value < 1.0 will make the prior sparse. [default: 0.0]
- --min-num-genes <number>
We only demultiplex cells/nuclei with at least <number> of expressed genes. [default: 100]
- --min-num-umis <number>
We only demultiplex cells/nuclei with at least <number> of UMIs. [default: 100]
- --min-signal-hashtag <count>
Any cell/nucleus with less than <count> hashtags from the signal will be marked as unknown. [default: 10.0]
- --random-state <seed>
The random seed used in the KMeans algorithm to separate empty ADT droplets from others. [default: 0]
- --generate-diagnostic-plots
Generate a series of diagnostic plots, including the background/signal between HTO counts, estimated background probabilities, HTO distributions of cells and non-cells etc.
- --generate-gender-plot <genes>
Generate violin plots using gender-specific genes (e.g. Xist). <gene> is a comma-separated list of gene names.
- -v, --version
Show DemuxEM version.
- -h, --help
Print out help information.
Outputs:
- output_name_demux.zarr.zip
RNA expression matrix with demultiplexed sample identities in Zarr format.
- output_name.out.demuxEM.zarr.zip
DemuxEM-calculated results in Zarr format, containing two datasets, one for HTO and one for RNA.
- output_name.ambient_hashtag.hist.pdf
Optional output. A histogram plot depicting hashtag distributions of empty droplets and non-empty droplets.
- output_name.background_probabilities.bar.pdf
Optional output. A bar plot visualizing the estimated hashtag background probability distribution.
- output_name.real_content.hist.pdf
Optional output. A histogram plot depicting hashtag distributions of not-real-cells and real-cells as defined by total number of expressed genes in the RNA assay.
- output_name.rna_demux.hist.pdf
Optional output. A histogram plot depicting RNA UMI distribution for singlets, doublets and unknown cells.
- output_name.gene_name.violin.pdf
Optional outputs. Violin plots depicting gender-specific gene expression across samples. We can have multiple plots if a gene list is provided in ‘–generate-gender-plot’ option.
Examples:
pegasus demuxEM -p 8 --generate-diagnostic-plots sample_raw_gene_bc_matrices.h5 sample_hto.csv sample_output
pegasus cluster
¶
Once we collected the count matrix in 10x (example_10x.h5
) or Zarr (example.zarr.zip
) format, we can perform single cell analysis using pegasus cluster
.
Type:
pegasus cluster -h
to see the usage information:
Usage:
pegasus cluster [options] <input_file> <output_name>
pegasus cluster -h
Arguments:
- input_file
Input file in either ‘zarr’, ‘h5ad’, ‘loom’, ‘10x’, ‘mtx’, ‘csv’, ‘tsv’ or ‘fcs’ format. If first-pass analysis has been performed, but you want to run some additional analysis, you could also pass a zarr-formatted file.
- output_name
Output file name. All outputs will use it as the prefix.
Options:
- -p <number>, --threads <number>
Number of threads. [default: 1]
- --processed
Input file is processed. Assume quality control, data normalization and log transformation, highly variable gene selection, batch correction/PCA and kNN graph building is done.
- --channel <channel_attr>
Use <channel_attr> to create a ‘Channel’ column metadata field. All cells within a channel are assumed to come from a same batch.
- --black-list <black_list>
Cell barcode attributes in black list will be popped out. Format is “attr1,attr2,…,attrn”.
- --select-singlets
Only select DemuxEM-predicted singlets for analysis.
- --remap-singlets <remap_string>
Remap singlet names using <remap_string>, where <remap_string> takes the format “new_name_i:old_name_1,old_name_2;new_name_ii:old_name_3;…”. For example, if we hashed 5 libraries from 3 samples sample1_lib1, sample1_lib2, sample2_lib1, sample2_lib2 and sample3, we can remap them to 3 samples using this string: “sample1:sample1_lib1,sample1_lib2;sample2:sample2_lib1,sample2_lib2”. In this way, the new singlet names will be in metadata field with key ‘assignment’, while the old names will be kept in metadata field with key ‘assignment.orig’.
- --subset-singlets <subset_string>
If select singlets, only select singlets in the <subset_string>, which takes the format “name1,name2,…”. Note that if –remap-singlets is specified, subsetting happens after remapping. For example, we can only select singlets from sampe 1 and 3 using “sample1,sample3”.
- --genome <genome_name>
If sample count matrix is in either DGE, mtx, csv, tsv or loom format, use <genome_name> as the genome reference name.
- --focus <keys>
Focus analysis on Unimodal data with <keys>. <keys> is a comma-separated list of keys. If None, the self._selected will be the focused one.
- --append <key>
Append Unimodal data <key> to any <keys> in
--focus
.- --output-loom
Output loom-formatted file.
- --output-h5ad
Output h5ad-formatted file.
- --min-genes <number>
Only keep cells with at least <number> of genes. [default: 500]
- --max-genes <number>
Only keep cells with less than <number> of genes. [default: 6000]
- --min-umis <number>
Only keep cells with at least <number> of UMIs.
- --max-umis <number>
Only keep cells with less than <number> of UMIs.
- --mito-prefix <prefix>
Prefix for mitochondrial genes. Can provide multiple prefixes for multiple organisms (e.g. “MT-” means to use “MT-“, “GRCh38:MT-,mm10:mt-,MT-” means to use “MT-” for GRCh38, “mt-” for mm10 and “MT-” for all other organisms). [default: GRCh38:MT-,mm10:mt-,MT-]
- --percent-mito <ratio>
Only keep cells with mitochondrial percent less than <percent>%. [default: 20.0]
- --gene-percent-cells <ratio>
Only use genes that are expressed in at least <percent>% of cells to select variable genes. [default: 0.05]
- --output-filtration-results
Output filtration results as a spreadsheet.
- --plot-filtration-results
Plot filtration results as PDF files.
- --plot-filtration-figsize <figsize>
Figure size for filtration plots. <figsize> is a comma-separated list of two numbers, the width and height of the figure (e.g. 6,4).
- --min-genes-before-filtration <number>
If raw data matrix is input, empty barcodes will dominate pre-filtration statistics. To avoid this, for raw data matrix, only consider barcodes with at lease <number> genes for pre-filtration condition. [default: 100]
- --counts-per-cell-after <number>
Total counts per cell after normalization. [default: 1e5]
- --select-hvf-flavor <flavor>
Highly variable feature selection method. <flavor> can be ‘pegasus’ or ‘Seurat’. [default: pegasus]
- --select-hvf-ngenes <nfeatures>
Select top <nfeatures> highly variable features. If <flavor> is ‘Seurat’ and <ngenes> is ‘None’, select HVGs with z-score cutoff at 0.5. [default: 2000]
- --no-select-hvf
Do not select highly variable features.
- --plot-hvf
Plot highly variable feature selection.
- --correct-batch-effect
Correct for batch effects.
- --correction-method <method>
Batch correction method, can be either ‘L/S’ for location/scale adjustment algorithm (Li and Wong. The analysis of Gene Expression Data 2003), ‘harmony’ for Harmony (Korsunsky et al. Nature Methods 2019), ‘scanorama’ for Scanorama (Hie et al. Nature Biotechnology 2019) or ‘inmf’ for integrative NMF (Yang and Michailidis Bioinformatics 2016, Welch et al. Cell 2019, Gao et al. Natuer Biotechnology 2021) [default: harmony]
- --batch-group-by <expression>
Batch correction assumes the differences in gene expression between channels are due to batch effects. However, in many cases, we know that channels can be partitioned into several groups and each group is biologically different from others. In this case, we will only perform batch correction for channels within each group. This option defines the groups. If <expression> is None, we assume all channels are from one group. Otherwise, groups are defined according to <expression>. <expression> takes the form of either ‘attr’, or ‘attr1+attr2+…+attrn’, or ‘attr=value11,…,value1n_1;value21,…,value2n_2;…;valuem1,…,valuemn_m’. In the first form, ‘attr’ should be an existing sample attribute, and groups are defined by ‘attr’. In the second form, ‘attr1’,…,’attrn’ are n existing sample attributes and groups are defined by the Cartesian product of these n attributes. In the last form, there will be m + 1 groups. A cell belongs to group i (i > 0) if and only if its sample attribute ‘attr’ has a value among valuei1,…,valuein_i. A cell belongs to group 0 if it does not belong to any other groups.
- --harmony-nclusters <nclusters>
Number of clusters used for Harmony batch correction.
- --inmf-lambda <lambda>
Coefficient of regularization for iNMF. [default: 5.0]
- --random-state <seed>
Random number generator seed. [default: 0]
- --temp-folder <temp_folder>
Joblib temporary folder for memmapping numpy arrays.
- --calc-signature-scores <sig_list>
Calculate signature scores for gene sets in <sig_list>. <sig_list> is a comma-separated list of strings. Each string should either be a <GMT_file> or one of ‘cell_cycle_human’, ‘cell_cycle_mouse’, ‘gender_human’, ‘gender_mouse’, ‘mitochondrial_genes_human’, ‘mitochondrial_genes_mouse’, ‘ribosomal_genes_human’ and ‘ribosomal_genes_mouse’.
- --pca-n <number>
Number of principal components. [default: 50]
- --nmf
Compute nonnegative matrix factorization (NMF) on highly variable features.
- --nmf-n <number>
Number of NMF components. IF iNMF is used for batch correction, this parameter also sets iNMF number of components. [default: 20]
- --knn-K <number>
Number of nearest neighbors for building kNN graph. [default: 100]
- --knn-full-speed
For the sake of reproducibility, we only run one thread for building kNN indices. Turn on this option will allow multiple threads to be used for index building. However, it will also reduce reproducibility due to the racing between multiple threads.
- --kBET
Calculate kBET.
- --kBET-batch <batch>
kBET batch keyword.
- --kBET-alpha <alpha>
kBET rejection alpha. [default: 0.05]
- --kBET-K <K>
kBET K. [default: 25]
- --diffmap
Calculate diffusion maps.
- --diffmap-ndc <number>
Number of diffusion components. [default: 100]
- --diffmap-solver <solver>
Solver for eigen decomposition, either ‘randomized’ or ‘eigsh’. [default: eigsh]
- --diffmap-maxt <max_t>
Maximum time stamp to search for the knee point. [default: 5000]
- --calculate-pseudotime <roots>
Calculate diffusion-based pseudotimes based on <roots>. <roots> should be a comma-separated list of cell barcodes.
- --louvain
Run louvain clustering algorithm.
- --louvain-resolution <resolution>
Resolution parameter for the louvain clustering algorithm. [default: 1.3]
- --louvain-class-label <label>
Louvain cluster label name in result. [default: louvain_labels]
- --leiden
Run leiden clustering algorithm.
- --leiden-resolution <resolution>
Resolution parameter for the leiden clustering algorithm. [default: 1.3]
- --leiden-niter <niter>
Number of iterations of running the Leiden algorithm. If <niter> is negative, run Leiden iteratively until no improvement. [default: -1]
- --leiden-class-label <label>
Leiden cluster label name in result. [default: leiden_labels]
- --spectral-louvain
Run spectral-louvain clustering algorithm.
- --spectral-louvain-basis <basis>
Basis used for KMeans clustering. Can be ‘pca’ or ‘diffmap’. If ‘diffmap’ is not calculated, use ‘pca’ instead. [default: diffmap]
- --spectral-louvain-nclusters <number>
Number of first level clusters for Kmeans. [default: 30]
- --spectral-louvain-nclusters2 <number>
Number of second level clusters for Kmeans. [default: 50]
- --spectral-louvain-ninit <number>
Number of Kmeans tries for first level clustering. Default is the same as scikit-learn Kmeans function. [default: 10]
- --spectral-louvain-resolution <resolution>.
Resolution parameter for louvain. [default: 1.3]
- --spectral-louvain-class-label <label>
Spectral-louvain label name in result. [default: spectral_louvain_labels]
- --spectral-leiden
Run spectral-leiden clustering algorithm.
- --spectral-leiden-basis <basis>
Basis used for KMeans clustering. Can be ‘pca’ or ‘diffmap’. If ‘diffmap’ is not calculated, use ‘pca’ instead. [default: diffmap]
- --spectral-leiden-nclusters <number>
Number of first level clusters for Kmeans. [default: 30]
- --spectral-leiden-nclusters2 <number>
Number of second level clusters for Kmeans. [default: 50]
- --spectral-leiden-ninit <number>
Number of Kmeans tries for first level clustering. Default is the same as scikit-learn Kmeans function. [default: 10]
- --spectral-leiden-resolution <resolution>
Resolution parameter for leiden. [default: 1.3]
- --spectral-leiden-class-label <label>
Spectral-leiden label name in result. [default: spectral_leiden_labels]
- --tsne
Run FIt-SNE package to compute t-SNE embeddings for visualization.
- --tsne-perplexity <perplexity>
t-SNE’s perplexity parameter. [default: 30]
- --tsne-initialization <choice>
<choice> can be either ‘random’ or ‘pca’. ‘random’ refers to random initialization. ‘pca’ refers to PCA initialization as described in (CITE Kobak et al. 2019) [default: pca]
- --umap
Run umap for visualization.
- --umap-K <K>
K neighbors for umap. [default: 15]
- --umap-min-dist <number>
Umap parameter. [default: 0.5]
- --umap-spread <spread>
Umap parameter. [default: 1.0]
- --fle
Run force-directed layout embedding.
- --fle-K <K>
K neighbors for building graph for FLE. [default: 50]
- --fle-target-change-per-node <change>
Target change per node to stop forceAtlas2. [default: 2.0]
- --fle-target-steps <steps>
Maximum number of iterations before stopping the forceAtlas2 algoritm. [default: 5000]
- --fle-memory <memory>
Memory size in GB for the Java FA2 component. [default: 8]
- --net-down-sample-fraction <frac>
Down sampling fraction for net-related visualization. [default: 0.1]
- --net-down-sample-K <K>
Use <K> neighbors to estimate local density for each data point for down sampling. [default: 25]
- --net-down-sample-alpha <alpha>
Weighted down sample, proportional to radius^alpha. [default: 1.0]
- --net-regressor-L2-penalty <value>
L2 penalty parameter for the deep net regressor. [default: 0.1]
- --net-umap
Run net umap for visualization.
- --net-umap-polish-learning-rate <rate>
After running the deep regressor to predict new coordinate, what is the learning rate to use to polish the coordinates for UMAP. [default: 1.0]
- --net-umap-polish-nepochs <nepochs>
Number of iterations for polishing UMAP run. [default: 40]
- --net-umap-out-basis <basis>
Output basis for net-UMAP. [default: net_umap]
- --net-fle
Run net FLE.
- --net-fle-polish-target-steps <steps>
After running the deep regressor to predict new coordinate, what is the number of force atlas 2 iterations. [default: 1500]
- --net-fle-out-basis <basis>
Output basis for net-FLE. [default: net_fle]
- --infer-doublets
Infer doublets using the method described here. Obs attribute ‘doublet_score’ stores Scrublet-like doublet scores and attribute ‘demux_type’ stores ‘doublet/singlet’ assignments.
- --expected-doublet-rate <rate>
The expected doublet rate per sample. By default, calculate the expected rate based on number of cells from the 10x multiplet rate table.
- --dbl-cluster-attr <attr>
<attr> refers to a cluster attribute containing cluster labels (e.g. ‘louvain_labels’). Doublet clusters will be marked based on <attr> with the following criteria: passing the Fisher’s exact test and having >= 50% of cells identified as doublets. By default, the first computed cluster attribute in the list of leiden, louvain, spectral_ledein and spectral_louvain is used.
- --citeseq
Input data contain both RNA and CITE-Seq modalities. This will set –focus to be the RNA modality and –append to be the CITE-Seq modality. In addition, ‘ADT-‘ will be added in front of each antibody name to avoid name conflict with genes in the RNA modality.
- --citeseq-umap
For high quality cells kept in the RNA modality, generate a UMAP based on their antibody expression.
- --citeseq-umap-exclude <list>
<list> is a comma-separated list of antibodies to be excluded from the UMAP calculation (e.g. Mouse-IgG1,Mouse-IgG2a).
- -h, --help
Print out help information.
Outputs:
- output_name.zarr.zip
Output file in Zarr format. To load this file in python, use
import pegasus; data = pegasus.read_input('output_name.zarr.zip')
. The log-normalized expression matrix is stored indata.X
as a CSR-format sparse matrix. Theobs
field contains cell related attributes, including clustering results. For example,data.obs_names
records cell barcodes;data.obs['Channel']
records the channel each cell comes from;data.obs['n_genes']
,data.obs['n_counts']
, anddata.obs['percent_mito']
record the number of expressed genes, total UMI count, and mitochondrial rate for each cell respectively;data.obs['louvain_labels']
anddata.obs['approx_louvain_labels']
record each cell’s cluster labels using different clustring algorithms;data.obs['pseudo_time']
records the inferred pseudotime for each cell. Thevar
field contains gene related attributes. For example,data.var_names
records gene symbols,data.var['gene_ids']
records Ensembl gene IDs, anddata.var['selected']
records selected variable genes. Theobsm
field records embedding coordiates. For example,data.obsm['X_pca']
records PCA coordinates,data.obsm['X_tsne']
records tSNE coordinates,data.obsm['X_umap']
records UMAP coordinates,data.obsm['X_diffmap']
records diffusion map coordinates, anddata.obsm['X_fle']
records the force-directed layout coordinates from the diffusion components. Theuns
field stores other related information, such as reference genome (data.uns['genome']
). This file can be loaded into R and converted into a Seurat object.- output_name.<group>.h5ad
Optional output. Only exists if ‘–output-h5ad’ is set. Results in h5ad format per focused <group>. This file can be loaded into R and converted into a Seurat object.
- output_name.<group>.loom
Optional output. Only exists if ‘–output-loom’ is set. Results in loom format per focused <group>.
- output_name.<group>.filt.xlsx
Optional output. Only exists if ‘–output-filtration-results’ is set. Filtration statistics per focused <group>. This file has two sheets — Cell filtration stats and Gene filtration stats. The first sheet records cell filtering results and it has 10 columns: Channel, channel name; kept, number of cells kept; median_n_genes, median number of expressed genes in kept cells; median_n_umis, median number of UMIs in kept cells; median_percent_mito, median mitochondrial rate as UMIs between mitochondrial genes and all genes in kept cells; filt, number of cells filtered out; total, total number of cells before filtration, if the input contain all barcodes, this number is the cells left after ‘–min-genes-on-raw’ filtration; median_n_genes_before, median expressed genes per cell before filtration; median_n_umis_before, median UMIs per cell before filtration; median_percent_mito_before, median mitochondrial rate per cell before filtration. The channels are sorted in ascending order with respect to the number of kept cells per channel. The second sheet records genes that failed to pass the filtering. This sheet has 3 columns: gene, gene name; n_cells, number of cells this gene is expressed; percent_cells, the fraction of cells this gene is expressed. Genes are ranked in ascending order according to number of cells the gene is expressed. Note that only genes not expressed in any cell are removed from the data. Other filtered genes are marked as non-robust and not used for TPM-like normalization.
- output_name.<group>.filt.gene.pdf
Optional output. Only exists if ‘–plot-filtration-results’ is set. This file contains violin plots contrasting gene count distributions before and after filtration per channel per focused <group>.
- output_name.<group>.filt.UMI.pdf
Optional output. Only exists if ‘–plot-filtration-results’ is set. This file contains violin plots contrasting UMI count distributions before and after filtration per channel per focused <group>.
- output_name.<group>.filt.mito.pdf
Optional output. Only exists if ‘–plot-filtration-results’ is set. This file contains violin plots contrasting mitochondrial rate distributions before and after filtration per channel per focused <group>.
- output_name.<group>.hvf.pdf
Optional output. Only exists if ‘–plot-hvf’ is set. This file contains a scatter plot describing the highly variable gene selection procedure per focused <group>.
- output_name.<group>.<channel>.dbl.png
Optional output. Only exists if ‘–infer-doublets’ is set. Each figure consists of 4 panels showing diagnostic plots for doublet inference. If there is only one channel in <group>, file name becomes output_name.<group>.dbl.png.
Examples:
pegasus cluster -p 20 --correct-batch-effect --louvain --tsne example_10x.h5 example_out pegasus cluster -p 20 --leiden --umap --net-fle example.zarr.zip example_out
pegasus de_analysis
¶
Once we have the clusters, we can detect markers using pegasus de_analysis
. We will calculate Mann-Whitney U test and AUROC values by default.
Type:
pegasus de_analysis -h
to see the usage information:
Usage:
pegasus de_analysis [options] (--labels <attr>) <input_data_file> <output_spreadsheet>
pegasus de_analysis -h
Arguments:
- input_data_file
Single cell data with clustering calculated. DE results would be written back.
- output_spreadsheet
Output spreadsheet with DE results.
Options:
- --labels <attr>
<attr> used as cluster labels. [default: louvain_labels]
- -p <threads>
Use <threads> threads. [default: 1]
- --de-key <key>
Store DE results into AnnData varm with key = <key>. [default: de_res]
- --t
Calculate Welch’s t-test.
- --fisher
Calculate Fisher’s exact test.
- --temp-folder <temp_folder>
Joblib temporary folder for memmapping numpy arrays.
- --alpha <alpha>
Control false discovery rate at <alpha>. [default: 0.05]
- --ndigits <ndigits>
Round non p-values and q-values to <ndigits> after decimal point in the excel. [default: 3]
- --quiet
Do not show detailed intermediate outputs.
- -h, --help
Print out help information.
Outputs:
- input_data_file
DE results would be written back to the ‘varm’ field with name set by ‘–de-key <key>’.
- output_spreadsheet
An excel spreadsheet containing DE results. Each cluster has two tabs in the spreadsheet. One is for up-regulated genes and the other is for down-regulated genes. If DE was performed on conditions within each cluster. Each cluster will have number of conditions tabs and each condition tab contains two spreadsheet: up for up-regulated genes and down for down-regulated genes.
Examples:
pegasus de_analysis -p 26 --labels louvain_labels --t --fisher example.zarr.zip example_de.xlsx
pegasus find_markers
¶
Once we have the DE results, we can optionally find cluster-specific markers with gradient boosting using pegasus find_markers
.
Type:
pegasus find_markers -h
to see the usage information:
Usage:
pegasus find_markers [options] <input_data_file> <output_spreadsheet>
pegasus find_markers -h
Arguments:
- input_h5ad_file
Single cell data after running the de_analysis.
- output_spreadsheet
Output spreadsheet with LightGBM detected markers.
Options:
- -p <threads>
Use <threads> threads. [default: 1]
- --labels <attr>
<attr> used as cluster labels. [default: louvain_labels]
- --de-key <key>
Key for storing DE results in ‘varm’ field. [default: de_res]
- --remove-ribo
Remove ribosomal genes with either RPL or RPS as prefixes.
- --min-gain <gain>
Only report genes with a feature importance score (in gain) of at least <gain>. [default: 1.0]
- --random-state <seed>
Random state for initializing LightGBM and KMeans. [default: 0]
- -h, --help
Print out help information.
Outputs:
- output_spreadsheet
An excel spreadsheet containing detected markers. Each cluster has one tab in the spreadsheet and each tab has six columns, listing markers that are strongly up-regulated, weakly up-regulated, down-regulated and their associated LightGBM gains.
Examples:
pegasus find_markers --labels louvain_labels --remove-ribo --min-gain 10.0 -p 10 example.zarr.zip example.markers.xlsx
pegasus annotate_cluster
¶
Once we have the DE results, we could optionally identify putative cell types for each cluster using pegasus annotate_cluster
.
This command has two forms: the first form generates putative annotations, and the second form write annotations into the Zarr object.
Type:
pegasus annotate_cluster -h
to see the usage information:
Usage:
pegasus annotate_cluster [--marker-file <file> --de-test <test> --de-alpha <alpha> --de-key <key> --minimum-report-score <score> --do-not-use-non-de-genes] <input_data_file> <output_file>
pegasus annotate_cluster --annotation <annotation_string> <input_data_file>
pegasus annotate_cluster -h
Arguments:
- input_data_file
Single cell data with DE analysis done by
pegasus de_analysis
.- output_file
Output annotation file.
Options:
- --markers <str>
<str> is a comma-separated list. Each element in the list either refers to a JSON file containing legacy markers, or ‘human_immune’/’mouse_immune’/’human_brain’/’mouse_brain’/’human_lung’ for predefined markers. [default: human_immune]
- --de-test <test>
DE test to use to infer cell types. [default: mwu]
- --de-alpha <alpha>
False discovery rate to control family-wise error rate. [default: 0.05]
- --de-key <key>
Keyword where the DE results store in ‘varm’ field. [default: de_res]
- --minimum-report-score <score>
Minimum cell type score to report a potential cell type. [default: 0.5]
- --do-not-use-non-de-genes
Do not count non DE genes as down-regulated.
- --annotation <annotation_string>
Write cell type annotations in <annotation_string> into <input_data_file>. <annotation_string> has this format:
'anno_name:clust_name:anno_1;anno_2;...;anno_n'
, whereanno_name
is the annotation attribute in the Zarr object,clust_name
is the attribute with cluster ids, andanno_i
is the annotation for cluster i.- -h, --help
Print out help information.
Outputs:
- output_file
This is a text file. For each cluster, all its putative cell types are listed in descending order of the cell type score. For each putative cell type, all markers support this cell type are listed. If one putative cell type has cell subtypes, all subtypes will be listed under this cell type.
Examples:
pegasus annotate_cluster example.zarr.zip example.anno.txt pegasus annotate_cluster --markers human_immune,human_lung lung.zarr.zip lung.anno.txt pegasus annotate_cluster --annotation "anno:louvain_labels:T cells;B cells;NK cells;Monocytes" example.zarr.zip
pegasus plot
¶
We can make a variety of figures using pegasus plot
.
Type:
pegasus plot -h
to see the usage information:
Usage:
pegasus plot [options] [--restriction <restriction>...] [--palette <palette>...] <plot_type> <input_file> <output_file>
pegasus plot -h
Arguments:
- plot_type
Plot type, either ‘scatter’ for scatter plots, ‘compo’ for composition plots, or ‘wordcloud’ for word cloud plots.
- input_file
Single cell data in Zarr or H5ad format.
- output_file
Output image file.
Options:
- --dpi <dpi>
DPI value for the figure. [default: 500]
- --basis <basis>
Basis for 2D plotting, chosen from ‘tsne’, ‘fitsne’, ‘umap’, ‘pca’, ‘fle’, ‘net_tsne’, ‘net_umap’ or ‘net_fle’. [default: umap]
- --attributes <attrs>
<attrs> is a comma-separated list of attributes to color the basis. This option is only used in ‘scatter’.
- --restriction <restriction>…
Set restriction if you only want to plot a subset of data. Multiple <restriction> strings are allowed. Each <restriction> takes the format of ‘attr:value,value’, or ‘attr:~value,value..’ which means excluding values. This option is used in ‘composition’ and ‘scatter’.
- --alpha <alpha>
Point transparent parameter. Can be a single value or a list of values separated by comma used for each attribute in <attrs>.
- --legend-loc <str>
Legend location, can be either “right margin” or “on data”. If a list is provided, set ‘legend_loc’ for each attribute in ‘attrs’ separately. [default: “right margin”]
- --palette <str>
Used for setting colors for every categories in categorical attributes. Multiple <palette> strings are allowed. Each string takes the format of ‘attr:color1,color2,…,colorn’. ‘attr’ is the categorical attribute and ‘color1’ - ‘colorn’ are the colors for each category in ‘attr’ (e.g. ‘cluster_labels:black,blue,red,…,yellow’). If there is only one categorical attribute in ‘attrs’,
palletes
can be set as a single string and the ‘attr’ keyword can be omitted (e.g. “blue,yellow,red”).- --show-background
Show points that are not selected as gray.
- --nrows <nrows>
Number of rows in the figure. If not set, pegasus will figure it out automatically.
- --ncols <ncols>
Number of columns in the figure. If not set, pegasus will figure it out automatically.
- --panel-size <sizes>
Panel size in inches, w x h, separated by comma. Note that margins are not counted in the sizes. For composition, default is (6, 4). For scatter plots, default is (4, 4).
- --left <left>
Figure’s left margin in fraction with respect to panel width.
- --bottom <bottom>
Figure’s bottom margin in fraction with respect to panel height.
- --wspace <wspace>
Horizontal space between panels in fraction with respect to panel width.
- --hspace <hspace>
Vertical space between panels in fraction with respect to panel height.
- --groupby <attr>
Use <attr> to categorize the cells for the composition plot, e.g. cell type.
- --condition <attr>
Use <attr> to calculate frequency within each category defined by ‘–groupby’ for the composition plot, e.g. donor.
- --style <style>
Composition plot styles. Can be either ‘frequency’ or ‘normalized’. [default: normalized]
- --factor <factor>
Factor index (column index in data.uns[‘W’]) to be used to generate word cloud plot.
- --max-words <max_words>
Maximum number of genes to show in the image. [default: 20]
- -h, --help
Print out help information.
Examples:
pegasus plot scatter --basis tsne --attributes louvain_labels,Donor example.h5ad scatter.pdf
pegasus plot compo --groupby louvain_labels --condition Donor example.zarr.zip compo.pdf
pegasus plot wordcloud --factor 0 example.zarr.zip word_cloud_0.pdf
pegasus scp_output
¶
If we want to visualize analysis results on single cell portal (SCP), we can generate required files for SCP using this subcommand.
Type:
pegasus scp_output -h
to see the usage information:
Usage:
pegasus scp_output <input_data_file> <output_name>
pegasus scp_output -h
Arguments:
- input_data_file
Analyzed single cell data in zarr format.
- output_name
Name prefix for all outputted files.
Options:
- --dense
Output dense expression matrix instead.
- --round-to <ndigit>
Round expression to <ndigit> after the decimal point. [default: 2]
- -h, --help
Print out help information.
Outputs:
- output_name.scp.metadata.txt, output_name.scp.barcodes.tsv, output_name.scp.genes.tsv, output_name.scp.matrix.mtx, output_name.scp.*.coords.txt, output_name.scp.expr.txt
Files that single cell portal needs.
Examples:
pegasus scp_output example.zarr.zip example
pegasus check_indexes
¶
If we run CITE-Seq or any kind of hashing, we need to make sure that the library indexes of CITE-Seq/hashing do not collide with 10x’s RNA indexes. This command can help us to determine which 10x index sets we should use.
Type:
pegasus check_indexes -h
to see the usage information:
Usage:
pegasus check_indexes [--num-mismatch <mismatch> --num-report <report>] <index_file>
pegasus check_indexes -h
Arguments:
- index_file
Index file containing CITE-Seq/hashing index sequences. One sequence per line.
Options:
- --num-mismatch <mismatch>
Number of mismatch allowed for each index sequence. [default: 1]
- --num-report <report>
Number of valid 10x indexes to report. Default is to report all valid indexes. [default: 9999]
- -h, --help
Print out help information.
Outputs:
Up to <report> number of valid 10x indexes will be printed out to standard output.
Examples:
pegasus check_indexes --num-report 8 index_file.txt
Use Pegasus on Terra Notebook¶
You need to first have a Terra account.
1. Start Notebook Runtime on Terra¶
The first time when you use Terra notebook, you need to create a Cloud Environment.
On the top-right panel of your workspace, click the following button within red circle:

Then click the CUSTOMIZE
button shown in red regtangle:

Then you’ll need to set the configuration of your cloud environment in the pop-out dialog (see image below):

1.1. Create from Terra official environment¶
Terra team maintains a list of cloud environments for users to quickly set up their own. In this way, you’ll use the most recent stable Pegasus cloud environment.
In Application configuration field, select Pegasus
from the drop-down menu:

In case you are interested in looking at which packages and tools are included in the Pegasus cloud environment, please click the What's installed on this environment?
link right below the drop-down menu:

After that set other fields in the pop-out dialog:
In Cloud compute profile field, you can set the computing resources (CPUs and Memory size) you want to use.
In Compute type input, choose
Standard VM
, as this is the cheapest type and is enough for using Pegasus.(NEW) In case you need GPUs, select Enable GPUs, then choose GPU type and type in number of GPUs to use. Notice that only version 1.4.3 or later supports this GPU mode.
In Persistent disk size (GB) field, choose the size of the persistent disk for your cloud environment. This disk space by default remains after you delete your cloud environment (unless you choose to delete the persistent disk as well), but it costs even if you stop your cloud environment.
Now click the CREATE
button to start the creation. After waiting for 1-2 minutes, your cloud environment will be ready to use, and it’s started automatically.
1.2. Create from custom environment¶
Alternatively, you can create from a custom environment if you want to use an older version of Pegasus.

In the cloud environment setting page:
In Application configuration field, choose
Custom Environment
(see the first red rectangle above).In Container image input, type
cumulusprod/pegasus-terra:<version>
(see the second red rectangle above), where<version>
should be chosen from this list. All the tags are for different versions of Pegasus.In Cloud compute profile field, set the computing resources (CPUs and Memory size) you want to use.
In Compute type field, choose
Standard VM
, as this is the cheapest type and is enough for using Pegasus.In Persistent disk size (GB) field, choose the size of the persistent disk for your cloud environment.
When finishing the configuration, click NEXT
button. You’ll see a warning page, then click CREATE
button to start the creation.
After waiting for 1-2 minutes, your cloud environment will be ready to use, and it’s started automatically.
1.3. Start an environment already created¶
After creation, this cloud environment is associated with your Terra workspace. You can start the same environment anytime in your workspace by clicking the following button within red circle on the top-right panel:

1. Create Your Terra Notebook¶
In the NOTEBOOKS tab of your workspace, you can either create a blank notebook, or upload your local notebook. After creation, click EDIT button to enter the edit mode, and Terra will automatically start your cloud environment.
When the start-up is done, you can type the following code in your notebook to check if Pegasus can be loaded and if it’s the correct version you want to use:
import pegasus as pg
pg.__version__
3. Load Data into Cloud Environment¶
To use your data on Cloud (i.e. from the Google Bucket of your workspace), you should first copy it into your notebook’s cloud environment by Google Cloud SDK:
!gsutil -m cp gs://link-to-count-matrix .
where gs://link-to-count-matrix
is the Google Bucket URL to your count matrix data file, and !
is the indicator of running terminal commands within Jupyter notebook.
After that, you can use Pegasus function to load it into memory.
Please refer to tutorials for how to use Pegasus on Terra notebook.
4. Stop Notebook Runtime¶
When you are done with the interactive analysis, to avoid being charged by Google Cloud while not using it, don’t forget to stop your cloud environment by clicking the following button of the top-right panel of your workspace within red circle:

If you forget to stop manually, as far as you’ve closed all the webpages related to your cloud environment (e.g. Terra notebooks, Terra terminals, etc.), you’ll still be safe. In this case, Terra will automatically stop the cloud environment for you after waiting for a few minutes.
Tutorials¶
Analysis Tutorial: A case study on single-cell RNA sequencing data analysis using Pegasus.
Plotting Tutorial: Show how to use Pegasus to generate different kinds of plots for analysis.
Batch Correction Tutorial: Introduction on batch correction / data integration methods available in Pegasus.
Spatial Analysis Tutorial: Introduction on spatial analysis using Pegasus.
Doublet Detection Tutorial: Introduction on doublet detection method in Pegasus.
Regress Out Tutorial: Introduction on regressing out via a case study on cell-cycle gene effects.
Pseudobulk Analysis Tutorial: Introduction on pseudobulk analysis using Pegasus.
API¶
Pegasus can also be used as a python package. Import pegasus by:
import pegasus as pg
Read and Write¶
|
Load data into memory. |
|
Write data back to disk. |
|
Aggregate channel-specific count matrices into one big count matrix. |
Analysis Tools¶
Preprocess¶
|
Generate Quality Control (QC) metrics regarding cell barcodes on the dataset. |
|
Calculate filtration stats on cell barcodes. |
|
Filter data based on qc_metrics calculated in |
|
Identify robust genes as candidates for HVG selection and remove genes that are not expressed in any cells. |
|
Normalize each cell by total counts, and then apply natural logarithm to the data. |
|
Apply the Log(1+x) transformation on the given count matrix. |
|
Normalize each cell by total count. |
|
Conduct arcsinh transform on the current matrix. |
|
Highly variable features (HVF) selection. |
|
Subset the features and store the resulting matrix in dense format in data.uns with '_tmp_fmat_' prefix, with the option of standardization and truncating based on max_value. |
|
Perform Principle Component Analysis (PCA) to the data. |
|
Perform Nonnegative Matrix Factorization (NMF) to the data using Frobenius norm. |
|
Regress out effects due to specific observational attributes. |
|
Calculate the standardized z scores of the count matrix. |
Batch Correction¶
|
Batch correction on PCs using Harmony. |
|
Batch correction using Scanorama. |
|
Perform Integrative Nonnegative Matrix Factorization (iNMF) [Yang16] for data integration. |
|
Run scVI embedding. |
Nearest Neighbors¶
|
Compute k nearest neighbors and affinity matrix, which will be used for diffmap and graph-based community detection algorithms. |
|
Find K nearest neighbors for each data point and return the indices and distances arrays. |
|
Calculate the kBET metric of the data regarding a specific sample attribute and embedding. |
|
Calculate the kSIM metric of the data regarding a specific sample attribute and embedding. |
Diffusion Map¶
|
Calculate Diffusion Map. |
|
Calculate Pseudotime based on Diffusion Map. |
|
Inference on path of a cluster. |
Cluster Algorithms¶
|
Cluster the data using the chosen algorithm. |
|
Cluster the cells using Louvain algorithm. |
|
Cluster the data using Leiden algorithm. |
|
Use Leiden algorithm to split 'clust_id' in 'clust_label' into 'n_components' sub-clusters and write the new clusting results to 'res_label'. |
|
Cluster the data using Spectral Louvain algorithm. |
|
Cluster the data using Spectral Leiden algorithm. |
Visualization Algorithms¶
|
Calculate t-SNE embedding of cells using the FIt-SNE package. |
|
Calculate UMAP embedding of cells. |
|
Construct the Force-directed (FLE) graph. |
|
Calculate Net-UMAP embedding of cells. |
|
Construct Net-Force-directed (FLE) graph. |
Doublet Detection¶
|
Infer doublets by first calculating Scrublet-like [Wolock18] doublet scores and then smartly determining an appropriate doublet score cutoff [Li20-2] . |
|
Convert doublet prediction into doublet annotations that Pegasus can recognize. |
Gene Module Score¶
|
Calculate signature / gene module score. |
Label Transfer¶
|
Run scArches training. |
|
Run scArches training. |
Differential Expression and Gene Set Enrichment Analysis¶
|
Perform Differential Expression (DE) Analysis on data. |
|
Extract DE results into a human readable structure. |
|
Write DE analysis results into Excel workbook. |
|
Perform Gene Set Enrichment Analysis using fGSEA. |
Annotate clusters¶
|
Infer putative cell types for each cluster using legacy markers. |
|
Add annotation to the data object as a categorical variable. |
Plotting¶
|
Generate scatter plots for different attributes |
|
Generate scatter plots of attribute 'attr' for each category in attribute 'group'. |
|
Scatter plot on spatial coordinates. |
|
Generate a composition plot, which shows the percentage of cells from each condition for every cluster. |
|
Generate a stacked violin plot. |
|
Generate a heatmap. |
|
Generate a dot plot. |
|
Generate a dendrogram on hierarchical clustering result. |
|
Generate highly variable feature plot. |
|
Plot quality control statistics (before filtration vs. |
|
Generate Volcano plots (-log10 p value vs. |
|
Generate a barcode rank plot, which shows the total UMIs against barcode rank (in descending order with respect to total UMIs) |
|
Generate ridge plots, up to 8 features can be shown in one figure. |
|
Generate one word cloud image for factor (starts from 0) in data.uns['W']. |
|
Generate GSEA barplots |
|
Generate Elbowplot and suggest n_comps to select based on random matrix theory (see utils.largest_variance_from_random_matrix). |
Pseudo-bulk analysis¶
|
Generate Pseudo-bulk count matrices. |
|
Perform Differential Expression (DE) Analysis using DESeq2 on pseduobulk data. |
|
Extract pseudobulk DE results into a human readable structure. |
|
Write pseudo-bulk DE analysis results into a Excel workbook. |
|
Generate Volcano plots (-log10 p value vs. |
Demultiplexing¶
|
For cell-hashing data, estimate antibody background probability using KMeans algorithm. |
|
Demultiplexing cell/nucleus-hashing data, using the estimated antibody background probability calculated in |
|
Write demultiplexing results into raw gene expression matrix. |
Miscellaneous¶
|
Extract and display gene expressions for each cluster. |
|
Extract and display differential expression analysis results of markers for each cluster. |
|
Using MWU test to detect if any cluster is an outlier regarding one of the qc attributes: n_genes, n_counts, percent_mito. |
|
Find markers using gradient boosting method. |
Release Notes¶
Note
Also see the release notes of PegasusIO.
Version 1.8¶
1.8.1 August 23, 2023¶
Bug fix in cell marker JSON files for
infer_cell_types
function.
1.8.0 July 21, 2023¶
New Feature and Improvement
Updata
human_immune
andhuman_lung
marker sets.Add
mouse_liver
marker set.Add split_one_cluster function to subcluster one cluster into a specified number of subclusters.
Update neighbors function to set
use_cache=False
by default, and adjust K tomin(K, int(sqrt(n_samples)))
. [PR 272]In infer_doublets function, argument
manual_correction
now accepts a float number threshold specified by users for cut-off. [PR 275]
Bug Fix
Fix divide by zero issue in
integrative_nmf
function. [PR 258]Compatibility with Pandas v2.0. [PR 261]
Allow
infer_doublets
to use any count matrix with key name specified by users. [PR 268 Thanks to Donghoon Lee]
Version 1.7¶
1.7.1 July 29, 2022¶
1.7.0 July 5, 2022¶
New Features
Add pegasus.elbowplot function to generate elbowplot, with an automated suggestion on number of PCs to be selected based on random matrix theory ([Johnstone 2001] and [Shekhar 2022]).
Add arcsinh_transform function for arcsinh transformation on the count matrix.
Improvement
Function
nearest_neighbors
has additional argumentn_comps
to allow use part of the components from the source embedding for calculating the nearest neighbor graph.Add
n_comps
argument forrun_harmony
,tsne
,umap
, andfle
(argument name isrep_ncomps
) functions to allow select part of the components from the source embedding.Function
scatter
can plot multiple components and bases.
Version 1.6¶
1.6.0 April 16, 2022¶
New Features
Add support for scVI-tools:
Function pegasus.run_scvi, which is a wrapper of scVI for data integration.
Add a dedicated section for scVI method in Pegasus batch correction tutorial.
Function pegasus.train_scarches_scanvi and pegasus.predict_scarches_scanvi to wrap scArches for transfer learning on single-cell data labels.
Version 1.5¶
1.5.0 March 9, 2022¶
New Features
Spatial data analysis:
Enable
pegasus.read_input
function to load 10x Visium data: setfile_type="visium"
option.Add
pegasus.spatial
function to generate spatial plot for 10x Visium data.Add Spatial Analysis Tutorial in Tutorials.
Pseudobulk analysis: see summary
Add
pegasus.pseudobulk
function to generate pseudobulk matrix.Add
pegasus.deseq2
function to perform pseudobulk differential expression (DE) analysis, which is a Python wrapper of DESeq2.Requires rpy2 and the original DESeq2 R package installed.
Add
pegasus.pseudo.markers
,pegasus.pseudo.write_results_to_excel
andpegasus.pseudo.volcano
functions for processing pseudobulk DE results.Add Pseudobulk Analysis Tutorial in Tutorials.
Add
pegasus.fgsea
function to perform Gene Set Enrichment Analysis (GSEA) on DE results and plotting, which is a Python wrapper of fgsea.Requires rpy2 and the original fgsea R package installed.
API Changes
Function
correct_batch
, which implements the L/S adjustment batch correction method, is obsolete. We recommend usingrun_harmony
instead, which is also the default of--correct-batch-effect
option inpegasus cluster
command.pegasus.highly_variable_features
allows specify custom attribute key for batches (batch
option), and thus removeconsider_batch
option. To select HVGs without considering batch effects, simply use the default, or equivalently usebatch=None
option.Add
dist
option topegasus.neighbors
function to allow use distance other than L2. (Contribution by hoondy in PR 233)Available options:
l2
for L2 (default),ip
for inner product, andcosine
for cosine similarity.
The kNN graph returned by
pegasus.neighbors
function is now stored inobsm
field of the data object, no longer inuns
field. Moreover, the kNN affinity matrix is stored inobsp
field.
Improvements
Adjust
pegasus.write_output
function to work with Zarr v2.11.0+.
Version 1.4¶
1.4.5 January 24, 2022¶
Make several dependencies optional to meet with different use cases.
Adjust to umap-learn v0.5.2 interface.
1.4.4 October 22, 2021¶
Use PegasusIO v0.4.0+ for data manipulation.
Add
calculate_z_score
function to calculate standardized z-scored count matrix.
1.4.3 July 25, 2021¶
Allow
run_harmony
function to use GPU for computation.
1.4.2 July 19, 2021¶
Bug fix for
--output-h5ad
and--citeseq
options inpegasus cluster
command.
1.4.1 July 17, 2021¶
Add NMF-related options to
pegasus cluster
command.Add word cloud graph plotting feature to
pegasus plot
command.pegasus aggregate_matrix
command now allow sample-specific filtration with parameters set in the input CSV-format sample sheet.Update doublet detection method:
infer_doublets
andmark_doublets
functions.Bug fix.
1.4.0 June 24, 2021¶
Add
nmf
andintegrative_nmf
functions to compute NMF and iNMF using nmf-torch package;integrative_nmf
supports quantile normalization proposed in the LIGER papers ([Welch19], [Gao21]).Change the parameter defaults of function
qc_metrics
: Now all defaults areNone
, meaning not performing any filtration on cell barcodes.In Annotate Clusters API functions:
Improve human immune cell markers and auto cell type assignment for human immune cells. (
infer_cell_types
function)Update mouse brain cell markers (
infer_cell_types
function)annotate
function now adds annotation as a categorical variable and sort categories in natural order.
Add
find_outlier_clusters
function to detect if any cluster is an outlier regarding one of the qc attributes (n_genes, n_counts, percent_mito) using MWU test.In Plotting API functions:
scatter
function now plots all cells ifattrs
==None
; Addfix_corners
option to fix the four corners when only a subset of cells is shown.Fix a bug in
heatmap
plotting function.
Fix bugs in functions
spectral_leiden
andspectral_louvain
.Improvements:
Support umap-learn v0.5+. (
umap
andnet_umap
functions)Update doublet detection algorithm. (
infer_doublets
function)Improve message reporting execution time spent each step. (Pegasus command line tool)
Version 1.3¶
1.3.0 February 2, 2021¶
Make PCA more reproducible. No need to keep options for robust PCA calculation:
In
pca
function, remove argumentrobust
.In
infer_doublets
function, remove argumentrobust
.In pegasus cluster command, remove option
--pca-robust
.
Add control on number of parallel threads for OpenMP/BLAS.
Now n_jobs = -1 refers to use all physical CPU cores instead of logcal CPU cores.
Remove function
reduce_diffmap_to_3d
. In pegasus cluster command, remove option--diffmap-to-3d
.Enhance compo_plot and dotplot functions’ usability.
Bug fix.
Version 1.2¶
1.2.0 December 25, 2020¶
tSNE support:
tsne
function in API: Use FIt-SNE for tSNE embedding calculation. No longer support MulticoreTSNE.Determine
learning_rate
argument intsne
more dynamically. ([Belkina19], [Kobak19])By default, use PCA embedding for initialization in
tsne
. ([Kobak19])Remove
net_tsne
andfitsne
functions from API.Remove
--net-tsne
and--fitsne
options from pegasus cluster command.
Add multimodal support on RNA and CITE-Seq data back:
--citeseq
,--citeseq-umap
, and--citeseq-umap-exclude
in pegasus cluster command.Doublet detection:
Add automated doublet cutoff inference to
infer_doublets
function in API. ([Li20-2])Expose doublet detection to command-line tool:
--infer-doublets
,--expected-doublet-rate
, and--dbl-cluster-attr
in pegasus cluster command.Add doublet detection tutorial.
Allow multiple marker files used in cell type annotation:
annotate
function in API;--markers
option in pegasus annotate_cluster command.Rename pc_regress_out function in API to
regress_out
.Update the regress out tutorial.
Bug fix.
Version 1.1¶
1.1.0 December 7, 2020¶
Improve doublet detection in Scrublet-like way using automatic threshold selection strategy:
infer_doublets
, andmark_doublets
. Remove Scrublet from dependency, and removerun_scrublet
function.Enhance performance of log-normalization (
log_norm
) and signature score calculation (calc_signature_score
).In
pegasus cluster
command, add--genome
option to specify reference genome name for input data ofdge
,csv
, orloom
format.Update Regress out tutorial.
Add
ridgeplot
.Improve plotting functions:
heatmap
, anddendrogram
.Bug fix.
Version 1.0¶
1.0.0 September 22, 2020¶
New features:
Use
zarr
file format to handle data, which has a better I/O performance in general.Multi-modality support:
Data are manipulated in Multi-modal structure in memory.
Support focus analysis on Unimodal data, and appending other Unimodal data to it. (
--focus
and--append
options incluster
command)
Calculate signature / gene module scores. (calc_signature_score)
Doublet detection based on Scrublet:
run_scrublet
,infer_doublets
, andmark_doublets
.Principal-Component-level regress out. (pc_regress_out)
Batch correction using Scanorama. (run_scanorama)
Allow DE analysis with sample attribute as condition. (Set
condition
argument in de_analysis)Use static plots to show results (see Plotting):
Provide static plots: composition plot, embedding plot (e.g. tSNE, UMAP, FLE, etc.), dot plot, feature plot, and volcano plot;
Add more gene-specific plots: dendrogram, heatmap, violin plot, quality-control violin, HVF plot.
Deprecations:
No longer support
h5sc
file format, which was the output format ofaggregate_matrix
command in Pegasus version0.x
.Remove
net_fitsne
function.
API changes:
In cell quality-control, default percent of mitochondrial genes is changed from 10.0 to 20.0. (
percent_mito
argument in qc_metrics;--percent-mito
option incluster
command)Move gene quality-control out of
filter_data
function to be a separate step. (identify_robust_genes)DE analysis now uses MWU test by default, not t test. (de_analysis)
infer_cell_types uses MWU test as the default
de_test
.
Performance improvement:
Speed up MWU test in DE analysis, which is inspired by Presto.
Integrate Fisher’s exact test via Cython in DE analysis to improve speed.
Other highlights:
Make I/O and count matrices aggregation a dedicated package PegasusIO.
Tutorials:
Update Analysis tutorial;
Add 3 more tutorials: one on plotting library, one on batch correction and data integration, and one on regress out.
Version 0.x¶
0.17.2 June 26, 2020¶
Make Pegasus compatible with umap-learn v0.4+.
Use louvain 0.7+ for Louvain clustering.
Update tutorial.
0.17.1 April 6, 2020¶
Improve pegasus command-line tool log.
Add human lung markers.
Improve log-normalization speed.
Provide robust version of PCA calculation as an option.
Add signature score calculation API.
Fix bugs.
0.17.0 March 10, 2020¶
Support anndata 0.7 and pandas 1.0.
Better
loom
format output writing function.Bug fix on
mtx
format output writing function.Update human immune cell markers.
Improve
pegasus scp_output
command.
0.16.11 February 28, 2020¶
Add
--remap-singlets
and--subset-singlets
options to ‘cluster’ command.Allow reading
loom
file with user-specified batch key and black list.
0.16.9 February 17, 2020¶
Allow reading h5ad
file with user-specified batch key.
0.16.8 January 30, 2020¶
Allow input annotated loom
file.
0.16.7 January 28, 2020¶
Allow input mtx
files of more filename formats.
0.16.5 January 23, 2020¶
Add Harmony algorithm for data integration.
0.16.3 December 17, 2019¶
Add support for loading mtx files generated from BUStools.
0.16.2 December 8, 2019¶
Fix bug in ‘subcluster’ command.
0.16.1 December 4, 2019¶
Fix one bug in clustering pipeline.
0.16.0 December 3, 2019¶
Change options in ‘aggregate_matrix’ command: remove ‘–google-cloud’, add ‘–default-reference’.
Fix bug in ‘–annotation’ option of ‘annotate_cluster’ command.
Fix bug in ‘net_fle’ function with 3-dimension coordinates.
Use fisher package version 0.1.9 or above, as modifications in our forked fisher-modified package has been merged into it.
0.15.0 October 2, 2019¶
Rename package to PegasusPy, with module name pegasus.
0.14.0 September 17, 2019¶
Provide Python API for interactive analysis.
0.10.0 January 31, 2019¶
Added ‘find_markers’ command to find markers using LightGBM.
Improved file loading speed and enabled the parsing of channels from barcode strings for cellranger aggregated h5 files.
0.9.0 January 17, 2019¶
In ‘cluster’ command, changed ‘–output-seurat-compatible’ to ‘–make-output-seurat-compatible’. Do not generate output_name.seurat.h5ad. Instead, output_name.h5ad should be able to convert to a Seurat object directly. In the seurat object, raw.data slot refers to the filtered count data, data slot refers to the log-normalized expression data, and scale.data refers to the variable-gene-selected, scaled data.
In ‘cluster’ command, added ‘–min-umis’ and ‘–max-umis’ options to filter cells based on UMI counts.
In ‘cluster’ command, ‘–output-filtration-results’ option does not require a spreadsheet name anymore. In addition, added more statistics such as median number of genes per cell in the spreadsheet.
In ‘cluster’ command, added ‘–plot-filtration-results’ and ‘–plot-filtration-figsize’ to support plotting filtration results. Improved documentation on ‘cluster command’ outputs.
Added ‘parquet’ command to transfer h5ad file into a parquet file for web-based interactive visualization.
0.8.0 November 26, 2018¶
Added support for checking index collision for CITE-Seq/hashing experiments.
0.7.0 October 26, 2018¶
Added support for CITE-Seq analysis.
0.4.0 August 2, 2018¶
Added mouse brain markers.
Allow aggregate matrix to take ‘Sample’ as attribute.
0.3.0 June 26, 2018¶
scrtools supports fast preprocessing, batch-correction, dimension reduction, graph-based clustering, diffusion maps, force-directed layouts, and differential expression analysis, annotate clusters, and plottings.
References¶
- Belkina19
A. C. Belkina, C. O. Ciccolella, R. Anno, R. Halpert, J. Spidlen, and J. E. Snyder-Cappione, “Automated optimized parameters for T-distributed stochastic neighbor embedding improve visualization and analysis of large datasets”, In Nature Communications, 2019.
- Blondel08
V. D. Blondel, J.-L. Guillaume, R. Lambiotte, and E. Lefebvre, “Fast unfolding of communities in large networks”, In Journal of Statistical Mechanics: Theory and Experiment, 2008.
- Büttner18
M. Büttner, et al., “A test metric for assessing single-cell RNA-seq batch correction”, In Nature Methods, 2018.
- Gao21
Ch. Gao, et al., “Iterative single-cell multi-omic integration using online learning”, In Nature Biotechnology, 2021.
- Hie19
B. Hie, B. Bryson, and B. Berger, “Efficient integration of heterogeneous single-cell transcriptomes using Scanorama”, In Nature Biotechnology, 2019.
- Jacomy14
M. Jacomy, T. Venturini, S. Heymann, and M. Bastian, “ForceAtlas2, a continuous graph layout algorithm for handy network visualization designed for the Gephi software”, In PLoS ONE, 2014.
- Kobak19
D. Kobak, and P. Berens, “Kobak, D., & Berens, P. (2019). The art of using t-SNE for single-cell transcriptomics”, In Nature Communications, 2019.
- Korsunsky19
I. Korsunsky, et al., “Fast, sensitive and accurate integration of single-cell data with Harmony”, In Nature Methods, 2019.
- Li20
B. Li, J. Gould, Y. Yang, et al., “Cumulus provides cloud-based data analysis for large-scale single-cell and single-nucleus RNA-seq”, In Nature Methods, 2020.
- Li20-1
B. Li, “A streamlined method for signature score calculation”, In Pegasus repository, 2020.
- Li20-2
B. Li, “Doublet detection in Pegasus”, In Pegasus repository, 2020.
- Linderman19
G.C. Linderman, et al., “Fast interpolation-based t-SNE for improved visualization of single-cell RNA-seq data”, In Nature Methods, 2019.
- Jerby-Arnon18
L. Jerby-Arnon, et al., “A cancer cell program promotes T cell exclusion and resistance to checkpoint blockade”, In Cell, 2018.
- Malkov16
Yu. A. Malkov, and D.A. Yashunin, “Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs”, In arXiv, 2016.
- McInnes18
L. McInnes, J. Healy, and J. Melville, “UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction”, Preprint at arXiv, 2018.
- Traag19
V. A. Traag, L. Waltman, and N. J. van Eck, “From Louvain to Leiden: guaranteeing well-connected communities”, In Scientific Reports, 2019.
- Welch19
J. D. Welch, et al., “Single-Cell Multi-omic Integration Compares and Contrasts Features of Brain Cell Identity” In Cell, 2019.
- Wolock18
S.L. Wolock, R. Lopez, and A.M. Klein, “Scrublet: computational identification of cell doublets in single-cell transcriptomic Data”, In Cell Systems, 2018.
- Yang16
Z. Yang, and G. Michailidis, “A non-negative matrix factorization method for detecting modules in heterogeneous omics multi-modal data”, In Bioinformatics, 2016.
Contributors¶
Besides the Pegasus team, we would like to sincerely give our thanks to the following contributors to this project:
Name |
Commits |
Date |
Note |
---|---|---|---|
Tariq Daouda |
774c2cc |
2019/10/17 |
Decorator for time logging and garbage collection. |
Contact us¶
If you have any questions related to Pegasus, please feel free to contact us via Cumulus Support Google Group.