Release Notes
Note
Also see the release notes of PegasusIO.
Version 1.10
1.10.1 February 10, 2025
1.10.0 June 12, 2024
New Features
Add
pegasus.pseudo.get_original_DE_resultfunction to return the DE result as a Pandas DataFrame.Implement a new version of Dendrogram (PR 295):
It now uses Connection Specific Index (CSI) for the distance calculation.
A new function
calc_dendrogramto calculate the linkage, and store it indata.unsfield.A new function
plot_dendrogramto plot the dendrogram based on the linkage calculated.
For
deseq2function (PR 300, 304):It now has 2 backends:
pydeseq2by default, which uses Python package PyDESeq2;deseq2for R package DESeq2. Specify it inbackendparameter.Add
compute_allparameter to decide if applying DE analysis to all count matrices of the input pseudobulk data object, or only for the default count matrix.Add
alphaparameter o allow user choose significance level for independent filtering to calculate adjusted p-values.For
pydeseq2backend, it accepts inference on multiple contrasts in the same model.
Reorganize GSEA functions (PR 305):
Rename
fgseafunction togsea. The newgseafunction accepts two methods:gseapyto use GSEAPy’s prerank function, which is the default;fgseato use R package fgsea.Use parameter
rank_keyto specify which attribute in the DE result to be used as gene ranks/signatures.The number of threads
n_jobsis4by default.Rename
write_fgsea_results_to_excelfunction towrite_gsea_results_to_excel.
Improvement
Add
label_fontsizeparameter toplot_gseafunction to allow change label font size in GSEA plots.Improve
scatterfunction:Add
aspectparameter. By default it’saspect=autoto enforce square plots; set toaspect=equalfor the actual y to x axis ratio.
Improve
spatialfunction (PR 294):Now accepts non-Visium data, which may not have spatial images stored in
data.imgfield.Add parameter
aspect. Default isaspect=equalto plot the actual y to x axis ratio; set toaspect=autoto enforce square plots.Add parameter
margin_percentto set the image margin to the 4 sides.Add parameters
restrictions,show_backgroundandpalettes, which have the same behaviors as inscatterfunction.Add parameter
y_flipwith defaultTrue. This is for the case where y coordinates start from top, which needs a flip. Set toFalseif the spatial y-coordinates start from bottom.When plotting Visium data with no image background: set
resolution=Noneandy_flip=True.
For
pseudobulkfunction (` <(PR 300)>`_): rename parametersampletogroupby,clustertocondition; it returns aMultimodalDataobject, and does not write to the input data object.For
pegasus.pseudo.volcanofunction, addrank_byparameter to rank genes by log2 fold change (log2fc) or -log10(p-value) (neglog10p). This would affect the top gene names shown in the plots.Bug fix in plotting functions to work with Matplotlib v3.9+.
Add bimodality index to doublet detection (PR 307).
Version 1.9
1.9.1 March 16, 2024
New Feature
Add
write_fgsea_results_to_excelfunction (see documentation)
Improvement
For
dotplotfunction, addshow_only_expressedparameter to decide whether the color intensity of dots are based on cells expressing the genes to show or all cells. By default,show_only_expressed=True. (PR 292)Allow
scatterandspatialfunctions to have a list ofvminandvmaxvalues when plotting multiple features.For
scatter,spatial,dotplot,violinfunctions, when some features are not in the data, emit a warning message and continue with features existing in the data.In
spatialfunction, addnrowsandncolsto organize subplots.In
calculate_z_scorefunction, enforce the input count matrix to be dense orscipy.csr_matrix.In
plot_gseafunction, addlabel_fontsizeparametere to allow change font label size in GSEA plots.
1.9.0 January 19, 2024
New Feature and Improvement
calculate_z_scoreworks with sparse count matrix. [PR 276 Thanks to Jayaram Kancherla]Plotting functions (
scatter,dotplot,violin,heatmap) now give warnings on genes/attributes not existing in the data, and skip them in the plots.Improve
heatmap:Add
show_sample_nameparameter for cases of pseudo-bulk data, nanoString DSP data, etc.Use Scipy’s linkage (
scipy.cluster.hierarchy.linkage) for dendrograms to use its optimal ordering feature for better results (seegroupby_optimal_orderingparameter).
Update human lung and mouse immune markers used by
infer_cell_typesfunction.run_harmonycan accept multiple attributes to be the batch key, by providing a list of attribute names to itsbatchparameter.Expose
online_batch_sizeparameter innmfandintegrative_nmffunctions.
Version 1.8
1.8.1 August 23, 2023
Bug fix in cell marker JSON files for
infer_cell_typesfunction.
1.8.0 July 21, 2023
New Feature and Improvement
Updata
human_immuneandhuman_lungmarker sets.Add
mouse_livermarker set.Add split_one_cluster function to subcluster one cluster into a specified number of subclusters.
Update neighbors function to set
use_cache=Falseby default, and adjust K tomin(K, int(sqrt(n_samples))). [PR 272]In infer_doublets function, argument
manual_correctionnow accepts a float number threshold specified by users for cut-off. [PR 275]
Bug Fix
Fix divide by zero issue in
integrative_nmffunction. [PR 258]Compatibility with Pandas v2.0. [PR 261]
Allow
infer_doubletsto use any count matrix with key name specified by users. [PR 268 Thanks to Donghoon Lee]
Version 1.7
1.7.1 July 29, 2022
1.7.0 July 5, 2022
New Features
Add pegasus.elbowplot function to generate elbowplot, with an automated suggestion on number of PCs to be selected based on random matrix theory ([Johnstone 2001] and [Shekhar 2022]).
Add arcsinh_transform function for arcsinh transformation on the count matrix.
Improvement
Function
nearest_neighborshas additional argumentn_compsto allow use part of the components from the source embedding for calculating the nearest neighbor graph.Add
n_compsargument forrun_harmony,tsne,umap, andfle(argument name isrep_ncomps) functions to allow select part of the components from the source embedding.Function
scattercan plot multiple components and bases.
Version 1.6
1.6.0 April 16, 2022
New Features
Add support for scVI-tools:
Function pegasus.run_scvi, which is a wrapper of scVI for data integration.
Add a dedicated section for scVI method in Pegasus batch correction tutorial.
Function pegasus.train_scarches_scanvi and pegasus.predict_scarches_scanvi to wrap scArches for transfer learning on single-cell data labels.
Version 1.5
1.5.0 March 9, 2022
New Features
Spatial data analysis:
Enable
pegasus.read_inputfunction to load 10x Visium data: setfile_type="visium"option.Add
pegasus.spatialfunction to generate spatial plot for 10x Visium data.Add Spatial Analysis Tutorial in Tutorials.
Pseudobulk analysis: see summary
Add
pegasus.pseudobulkfunction to generate pseudobulk matrix.Add
pegasus.deseq2function to perform pseudobulk differential expression (DE) analysis, which is a Python wrapper of DESeq2.Requires rpy2 and the original DESeq2 R package installed.
Add
pegasus.pseudo.markers,pegasus.pseudo.write_results_to_excelandpegasus.pseudo.volcanofunctions for processing pseudobulk DE results.Add Pseudobulk Analysis Tutorial in Tutorials.
Add
pegasus.fgseafunction to perform Gene Set Enrichment Analysis (GSEA) on DE results and plotting, which is a Python wrapper of fgsea.Requires rpy2 and the original fgsea R package installed.
API Changes
Function
correct_batch, which implements the L/S adjustment batch correction method, is obsolete. We recommend usingrun_harmonyinstead, which is also the default of--correct-batch-effectoption inpegasus clustercommand.pegasus.highly_variable_featuresallows specify custom attribute key for batches (batchoption), and thus removeconsider_batchoption. To select HVGs without considering batch effects, simply use the default, or equivalently usebatch=Noneoption.Add
distoption topegasus.neighborsfunction to allow use distance other than L2. (Contribution by hoondy in PR 233)Available options:
l2for L2 (default),ipfor inner product, andcosinefor cosine similarity.
The kNN graph returned by
pegasus.neighborsfunction is now stored inobsmfield of the data object, no longer inunsfield. Moreover, the kNN affinity matrix is stored inobspfield.
Improvements
Adjust
pegasus.write_outputfunction to work with Zarr v2.11.0+.
Version 1.4
1.4.5 January 24, 2022
Make several dependencies optional to meet with different use cases.
Adjust to umap-learn v0.5.2 interface.
1.4.4 October 22, 2021
Use PegasusIO v0.4.0+ for data manipulation.
Add
calculate_z_scorefunction to calculate standardized z-scored count matrix.
1.4.3 July 25, 2021
Allow
run_harmonyfunction to use GPU for computation.
1.4.2 July 19, 2021
Bug fix for
--output-h5adand--citeseqoptions inpegasus clustercommand.
1.4.1 July 17, 2021
Add NMF-related options to
pegasus clustercommand.Add word cloud graph plotting feature to
pegasus plotcommand.pegasus aggregate_matrixcommand now allow sample-specific filtration with parameters set in the input CSV-format sample sheet.Update doublet detection method:
infer_doubletsandmark_doubletsfunctions.Bug fix.
1.4.0 June 24, 2021
Add
nmfandintegrative_nmffunctions to compute NMF and iNMF using nmf-torch package;integrative_nmfsupports quantile normalization proposed in the LIGER papers ([Welch19], [Gao21]).Change the parameter defaults of function
qc_metrics: Now all defaults areNone, meaning not performing any filtration on cell barcodes.In Annotate Clusters API functions:
Improve human immune cell markers and auto cell type assignment for human immune cells. (
infer_cell_typesfunction)Update mouse brain cell markers (
infer_cell_typesfunction)annotatefunction now adds annotation as a categorical variable and sort categories in natural order.
Add
find_outlier_clustersfunction to detect if any cluster is an outlier regarding one of the qc attributes (n_genes, n_counts, percent_mito) using MWU test.In Plotting API functions:
scatterfunction now plots all cells ifattrs==None; Addfix_cornersoption to fix the four corners when only a subset of cells is shown.Fix a bug in
heatmapplotting function.
Fix bugs in functions
spectral_leidenandspectral_louvain.Improvements:
Support umap-learn v0.5+. (
umapandnet_umapfunctions)Update doublet detection algorithm. (
infer_doubletsfunction)Improve message reporting execution time spent each step. (Pegasus command line tool)
Version 1.3
1.3.0 February 2, 2021
Make PCA more reproducible. No need to keep options for robust PCA calculation:
In
pcafunction, remove argumentrobust.In
infer_doubletsfunction, remove argumentrobust.In pegasus cluster command, remove option
--pca-robust.
Add control on number of parallel threads for OpenMP/BLAS.
Now n_jobs = -1 refers to use all physical CPU cores instead of logcal CPU cores.
Remove function
reduce_diffmap_to_3d. In pegasus cluster command, remove option--diffmap-to-3d.Enhance compo_plot and dotplot functions’ usability.
Bug fix.
Version 1.2
1.2.0 December 25, 2020
tSNE support:
tsnefunction in API: Use FIt-SNE for tSNE embedding calculation. No longer support MulticoreTSNE.Determine
learning_rateargument intsnemore dynamically. ([Belkina19], [Kobak19])By default, use PCA embedding for initialization in
tsne. ([Kobak19])Remove
net_tsneandfitsnefunctions from API.Remove
--net-tsneand--fitsneoptions from pegasus cluster command.
Add multimodal support on RNA and CITE-Seq data back:
--citeseq,--citeseq-umap, and--citeseq-umap-excludein pegasus cluster command.Doublet detection:
Add automated doublet cutoff inference to
infer_doubletsfunction in API. ([Li20-2])Expose doublet detection to command-line tool:
--infer-doublets,--expected-doublet-rate, and--dbl-cluster-attrin pegasus cluster command.Add doublet detection tutorial.
Allow multiple marker files used in cell type annotation:
annotatefunction in API;--markersoption in pegasus annotate_cluster command.Rename pc_regress_out function in API to
regress_out.Update the regress out tutorial.
Bug fix.
Version 1.1
1.1.0 December 7, 2020
Improve doublet detection in Scrublet-like way using automatic threshold selection strategy:
infer_doublets, andmark_doublets. Remove Scrublet from dependency, and removerun_scrubletfunction.Enhance performance of log-normalization (
log_norm) and signature score calculation (calc_signature_score).In
pegasus clustercommand, add--genomeoption to specify reference genome name for input data ofdge,csv, orloomformat.Update Regress out tutorial.
Add
ridgeplot.Improve plotting functions:
heatmap, anddendrogram.Bug fix.
Version 1.0
1.0.0 September 22, 2020
New features:
Use
zarrfile format to handle data, which has a better I/O performance in general.Multi-modality support:
Data are manipulated in Multi-modal structure in memory.
Support focus analysis on Unimodal data, and appending other Unimodal data to it. (
--focusand--appendoptions inclustercommand)
Calculate signature / gene module scores. (calc_signature_score)
Doublet detection based on Scrublet:
run_scrublet,infer_doublets, andmark_doublets.Principal-Component-level regress out. (pc_regress_out)
Batch correction using Scanorama. (run_scanorama)
Allow DE analysis with sample attribute as condition. (Set
conditionargument in de_analysis)Use static plots to show results (see Plotting):
Provide static plots: composition plot, embedding plot (e.g. tSNE, UMAP, FLE, etc.), dot plot, feature plot, and volcano plot;
Add more gene-specific plots: dendrogram, heatmap, violin plot, quality-control violin, HVF plot.
Deprecations:
No longer support
h5scfile format, which was the output format ofaggregate_matrixcommand in Pegasus version0.x.Remove
net_fitsnefunction.
API changes:
In cell quality-control, default percent of mitochondrial genes is changed from 10.0 to 20.0. (
percent_mitoargument in qc_metrics;--percent-mitooption inclustercommand)Move gene quality-control out of
filter_datafunction to be a separate step. (identify_robust_genes)DE analysis now uses MWU test by default, not t test. (de_analysis)
infer_cell_types uses MWU test as the default
de_test.
Performance improvement:
Speed up MWU test in DE analysis, which is inspired by Presto.
Integrate Fisher’s exact test via Cython in DE analysis to improve speed.
Other highlights:
Make I/O and count matrices aggregation a dedicated package PegasusIO.
Tutorials:
Update Analysis tutorial;
Add 3 more tutorials: one on plotting library, one on batch correction and data integration, and one on regress out.
Version 0.x
0.17.2 June 26, 2020
Make Pegasus compatible with umap-learn v0.4+.
Use louvain 0.7+ for Louvain clustering.
Update tutorial.
0.17.1 April 6, 2020
Improve pegasus command-line tool log.
Add human lung markers.
Improve log-normalization speed.
Provide robust version of PCA calculation as an option.
Add signature score calculation API.
Fix bugs.
0.17.0 March 10, 2020
Support anndata 0.7 and pandas 1.0.
Better
loomformat output writing function.Bug fix on
mtxformat output writing function.Update human immune cell markers.
Improve
pegasus scp_outputcommand.
0.16.11 February 28, 2020
Add
--remap-singletsand--subset-singletsoptions to ‘cluster’ command.Allow reading
loomfile with user-specified batch key and black list.
0.16.9 February 17, 2020
Allow reading h5ad file with user-specified batch key.
0.16.8 January 30, 2020
Allow input annotated loom file.
0.16.7 January 28, 2020
Allow input mtx files of more filename formats.
0.16.5 January 23, 2020
Add Harmony algorithm for data integration.
0.16.3 December 17, 2019
Add support for loading mtx files generated from BUStools.
0.16.2 December 8, 2019
Fix bug in ‘subcluster’ command.
0.16.1 December 4, 2019
Fix one bug in clustering pipeline.
0.16.0 December 3, 2019
Change options in ‘aggregate_matrix’ command: remove ‘–google-cloud’, add ‘–default-reference’.
Fix bug in ‘–annotation’ option of ‘annotate_cluster’ command.
Fix bug in ‘net_fle’ function with 3-dimension coordinates.
Use fisher package version 0.1.9 or above, as modifications in our forked fisher-modified package has been merged into it.
0.15.0 October 2, 2019
Rename package to PegasusPy, with module name pegasus.
0.14.0 September 17, 2019
Provide Python API for interactive analysis.
0.10.0 January 31, 2019
Added ‘find_markers’ command to find markers using LightGBM.
Improved file loading speed and enabled the parsing of channels from barcode strings for cellranger aggregated h5 files.
0.9.0 January 17, 2019
In ‘cluster’ command, changed ‘–output-seurat-compatible’ to ‘–make-output-seurat-compatible’. Do not generate output_name.seurat.h5ad. Instead, output_name.h5ad should be able to convert to a Seurat object directly. In the seurat object, raw.data slot refers to the filtered count data, data slot refers to the log-normalized expression data, and scale.data refers to the variable-gene-selected, scaled data.
In ‘cluster’ command, added ‘–min-umis’ and ‘–max-umis’ options to filter cells based on UMI counts.
In ‘cluster’ command, ‘–output-filtration-results’ option does not require a spreadsheet name anymore. In addition, added more statistics such as median number of genes per cell in the spreadsheet.
In ‘cluster’ command, added ‘–plot-filtration-results’ and ‘–plot-filtration-figsize’ to support plotting filtration results. Improved documentation on ‘cluster command’ outputs.
Added ‘parquet’ command to transfer h5ad file into a parquet file for web-based interactive visualization.
0.8.0 November 26, 2018
Added support for checking index collision for CITE-Seq/hashing experiments.
0.7.0 October 26, 2018
Added support for CITE-Seq analysis.
0.4.0 August 2, 2018
Added mouse brain markers.
Allow aggregate matrix to take ‘Sample’ as attribute.
0.3.0 June 26, 2018
scrtools supports fast preprocessing, batch-correction, dimension reduction, graph-based clustering, diffusion maps, force-directed layouts, and differential expression analysis, annotate clusters, and plottings.