Pegasus Plotting Tutorial

Author: Hui Ma, Yiming Yang
Date: 2021-06-24
Notebook Source: plotting_tutorial.ipynb

For this plotting tutorial, we provide an analysis result of gene-count matrix dataset on Human Bone Marrow with 8 donors. You can get the data from https://storage.googleapis.com/terra-featured-workspaces/Cumulus/MantonBM_result.zarr.zip, or use gsutil to download via its Google bucket URL (gs://terra-featured-workspace/Cumulus/MantonBM_result.zarr.zip):

After downloading, load the file using Pegasus read_input function:

In the following sections, we'll cover Pegasus plotting functions using this dataset. Moreover, for gene plots, the canonical gene markers below will be used:

Sections

QC Violin Plot

The first step in preprocessing is to perform the quality control analysis, and remove cells and genes of low quality.

pg.qcviolin shows the effect of quality control more intuitively by presenting the violin plot of cell distribution before and after filtration.

plot_type='gene' shows the number of expressed cells before and after filtration.

Quality control stats on number of percentage of mitochondrial genes:

The number of UMIs before and after filtration is also an important aspect of quality control.

Highly Variable Feature Plot

Highly Variable Genes (HVG) are more likely to convey information discriminating different cell types and states. Thus, rather than considering all genes, people usually focus on selected HVGs for downstream analyses.

Pegasus provides hvfplot function to generate a scatterplot of genes upon HVG selection. This plot only works for Pegasus-flavor HVGs (i.e. flavor='pegasus' in Pegasus highly_variable_features function).

After selecting 2000 HVGs using the Pegasus selection method, the plot below is generated. Each point stands for one gene. Blue points are selected to be HVGs, which account for the majority of variation of the dataset. By default, it prints labels of 20 top HVGs. You can change this number in top_n parameter.

Composition Plot

Composition plot is a bar plot showing the cell compositions (under different conditions) in each cluster. Below is to show the composition of different samples in each Louvain cluster:

Composition plot is useful to fast assess library quality and batch effects.

Scatter Plot

Scatter plot requires at least 2 parameters

For this demonstration, we select annotation and channel as data attributes, and tsne as basis.

Cell Visualization

To visualize the data, users need to first run visualization functions to calcualte the corresponding cell embeddings. See here for all visualization functions in Pegasus.

tSNE Plot. Pegasus uses FIt-SNE algorithm for tSNE plot, which works better on large-scale datasets:

You can show both cluster-specific cell types (anno) and samples (Channel) in one function.

Note here, the legend of left plot is not fully dislplayed. We need to adjust the horizon distance between two plots by changing "wspace"

"wspace" is the ratio of horizon distance to width of plots. For example ,the default value of wspace is 0.4, which means the gap is 0.4 times the width of plot.

There are also other optional parameters. For example, the commanly used ones

For the plots below, Left one has alpha=1.0, while right one has alpha=0.1

Plots with lower dpi are smaller and have lower resolution.

UMAP Plot. Pegasus also provides 2 different kinds of UMAP plots: umap, or net_umap.

The embedding methods can be changed by changing basis

Cell Developmental Trajectory. Pegasus provides Force-directed Layout (FLE) based plots to show pseudo-trajectories of cell development. To show it, you need to first run diffmap, then FLE-like visualization, fle or net_fle, and finally plot using scatter with the corresponding basis:

Feature Plot

Besides plotting cell attributes in data.obs, you can also specify gene names in attrs parameter of scatter function to generate Feature Plots, which is useful for inspecting specific genes in details.

Below are feature plots of 4 genes, together with UMAP plot of cell type annotations:

As the plots show, TRAC is highly expressed in T-cell clusters, CD79A in B-cell and Plasma-cell clusters, CD14 in CD14+ Monocytes, and CD34 in the Stem-cell cluster.

Scatter Plot against Categories

By using scatter_groups function, user can generate a set of scatter plots against a categorical cell attribute, for example Channel, one plot regarding a category.

Below is an example

There are 8 categories in Channel attribute, since the dataset contains 8 samples. As a result, 9 FIt-SNE plots are generated: first one shows cells from all samples, followed by 8 plots, one of which only shows cells from one sample.

You can see that all the 8 sample-specific plots have similar cell distribution regarding clusters. This is due to the batch correction on the original data.

Violin Plot

The violin plot gives us an idea of the distribution of gene expression values across clusters. The violin plot below shows the expression of the marker gene set we use in this tutorial against Louvain clusters:

Notice that all the genes' violin plots are stacked together. And you can check that their expression levels across clusters are pretty consistent with our knowledge.

Pegasus violin plot can also be split regarding a categorical cell attributes with 2 categories. For example, below is a plot showing expression of gene XIST and RPS4Y1 with cells split into male and female donors:

It's easy to see that in the plot above, XIST is highly expressed in female samples across all clusters, while RPS4Y1 is highly expressed in male samples.

Heat Map

Besides Violin Plot, Pegasus also provides Heat Map.

It is also useful to add a dendrogram to the graph to bring together similar clusters or genes, based on the expression of selected genes.

By default, heatmap shows a Heat Map based on average expression of selected genes within each cluster, and cluster similar genes altogether. You can set attrs_cluster=True to merge similar clusters:

switch_axes=True can switch x-axis (genes) and y-axis (clusters):

In addition, you can also show Heat Map based on expression of individual cell by setting on_average=False. Notice that this can take some time for a large dataset.

By comparing with Violin plot, you can see the Heat maps are consistent with it.

Dot Plot

Dot plot is an alternative to Heat map. Besides the mean expression of a gene within a cluster (dot color), it also conveys the fraction of cells within the cluster expressing that gene (dot size):

Similarly as heatmap, switch_axes=True switches the two axes:

Dendrogram

Pegasus provides dendrograms based on hierarchical clustering. There are 2 ways for clustering: one is regarding the expression of a selected set of genes:

Another way is to use a cell embedding for clustering. As this dataset is corrected using Harmony algorithm, with the corrected PCA embedding stored at data.obsm['X_pca_harmony'], we can simply set rep parameter in dendrogram to 'pca_harmony' to use this embedding for clustering:

Now we have a different dendrogram.

Volcano Plot

Differential Expression (DE) Analysis is to discover cluster-specific marker genes. For each cluster, it compares cells within the cluster with all the others, then finds genes significantly highly expressed (up-regulated) and lowly expressed (down-regulated) for the cluster.

After running de_analysis function in Pegasus, we can use Volcano plot to see the DE result. Here we use Louvain cluster 1 for illustration. You can see that this cluster is annotated as "Naive T cell":

Below is its Volcano plot:

The plot uses the default thresholds: Log fold change at $\pm 1.0$, and Q-value at $0.05$. Each point stands for a gene. Red ones are significant marker genes: those at right-hand side are up-regulated genes for Cluster 1, while those at left-hand side are down-regulated genes.

In specific, TRAC gene, a marker gene of T cells, is the second to the rightmost up-regulated gene for Cluster 1. This is consistent with its cell type annotation.

The clusters used in Volcano plot depend on how you run de_analysis function. For this dataset, we performed DE analysis on Louvain clusters, as a result, we have to use their labels in Volcano plot.

By default, volcano uses t test results. User can set de_test parameter to either 'fisher' or 'mwu' to use other test result, as long as those tests have been performed in de_analysis ahead.

Read more ...

See Pegasus_Tutorial for tutorial on running downstream analysis using Pegasus.

See Plotting_Documentation for details on parameters of Pegasus plotting functions.