Pegasus Plotting Tutorial

Author: Hui Ma
Date: 2020-08-20
Notebook Source: plotting_tutorial.ipynb

In [1]:
import pegasus as pg

For this plotting tutorial, we provide an analysis result of gene-count matrix dataset on Human Bone Marrow with 8 donors. You can get the data from https://storage.googleapis.com/terra-featured-workspaces/Cumulus/MantonBM_result.zarr.zip, or use gsutil to download via its Google bucket URL (gs://terra-featured-workspace/Cumulus/MantonBM_result.zarr.zip):

After downloading, load the file using Pegasus read_input function:

In [2]:
data = pg.read_input("MantonBM_result.zarr.zip")
data
2020-08-20 20:01:36,868 - pegasusio.readwrite - INFO - zarr file 'MantonBM_result.zarr.zip' is loaded.
2020-08-20 20:01:36,869 - pegasusio.readwrite - INFO - Function 'read_input' finished in 3.59s.
Out[2]:
MultimodalData object with 1 UnimodalData: 'GRCh38-rna'
    It currently binds to UnimodalData object GRCh38-rna

UnimodalData object with n_obs x n_vars = 35465 x 25653
    Genome: GRCh38; Modality: rna
    It contains 2 matrices: 'X', 'raw.X'
    It currently binds to matrix 'X' as X

    obs: 'n_genes', 'Channel', 'n_counts', 'percent_mito', 'scale', 'Group', 'louvain_labels', 'anno', 'gender'
    var: 'featureid', 'n_cells', 'percent_cells', 'robust', 'highly_variable_features', 'mean', 'var', 'hvf_loess', 'hvf_rank'
    obsm: 'X_diffmap', 'X_fitsne', 'X_fle', 'X_pca', 'X_pca_harmony', 'X_phi', 'X_umap'
    varm: 'gmeans', 'gstds', 'means', 'partial_sum', 'de_res'
    uns: 'genome', 'modality', 'Channels', 'Groups', 'PCs', 'c2gid', 'diffmap_evals', 'diffmap_knn_distances', 'diffmap_knn_indices', 'fmat_highly_variable_features', 'gncells', 'ncells', 'pca_harmony_knn_distances', 'pca_harmony_knn_indices', 'W_diffmap', 'W_pca_harmony', 'df_qcplot', 'pca'

In the following sections, we'll cover Pegasus plotting functions using this dataset. Moreover, for gene plots, the canonical gene markers below will be used:

  • B cells and Plasma cells: CD38, JCHAIN, CD79A, MS4A1.
  • T cells: TRAC, CD3D, CCR7.
  • Cytotoxic T cells: CD8A, CD8B.
  • NK cells: NKG7.
  • Monocytes: CD14, FCGR3A.
  • Erythroid cells: GYPA.
  • Hematopoietic Stem cells: CD34, SELL.
  • Dendritic cells: HLA-DPA1, CD4.
In [3]:
marker_genes = ['CD38', 'JCHAIN', 'FCGR3A', 'HLA-DPA1', 'CD14', 'CD79A', 'MS4A1', 'CD34', 'TRAC', 'CD3D', 'CD8A',
                'CD8B', 'GYPA', 'NKG7', 'CD4', 'SELL', 'CCR7']

QC Violin Plot

The first step in preprocessing is to perform the quality control analysis, and remove cells and genes of low quality.

pg.qcviolin shows the effect of quality control more intuitively by presenting the violin plot of cell distribution before and after filtration.

plot_type='gene' shows the number of expressed cells before and after filtration.

In [4]:
pg.qcviolin(data, plot_type='gene', dpi=100)

Quality control stats on number of percentage of mitochondrial genes:

In [5]:
pg.qcviolin(data, plot_type='mito', dpi=100)

The number of UMIs before and after filtration is also an important aspect of quality control.

In [6]:
pg.qcviolin(data, plot_type='count', dpi=100)

Highly Variable Feature Plot

Highly Variable Genes (HVG) are more likely to convey information discriminating different cell types and states. Thus, rather than considering all genes, people usually focus on selected HVGs for downstream analyses.

Pegasus provides hvfplot function to generate a scatterplot of genes upon HVG selection. This plot only works for Pegasus-flavor HVGs (i.e. flavor='pegasus' in Pegasus highly_variable_features function).

After selecting 2000 HVGs using the Pegasus selection method, the plot below is generated. Each point stands for one gene. Blue points are selected to be HVGs, which account for the majority of variation of the dataset. By default, it prints labels of 20 top HVGs. You can change this number in top_n parameter.

In [7]:
pg.hvfplot(data, dpi=200)

Composition Plot

Composition plot is a bar plot showing the cell compositions (under different conditions) in each cluster. Below is to show the composition of different samples in each Louvain cluster:

In [8]:
fig = pg.compo_plot(data, 'louvain_labels', 'Channel', style = 'frequency')

Composition plot is useful to fast assess library quality and batch effects.

Scatter Plot

Scatter plot requires at least 2 parameters

  • data – Gene-count matrix to show.
  • basis – Cell embedding to show. Can be either 'umap', 'tsne', ‘fitsne’, ‘fle’, ‘net_fitsne’, ‘net_umap’ or ‘net_fle’.

For this demonstration, we select annotation and channel as data attributes, and fitsne as basis.

Cell Visualization

To visualize the data, users need to first run visualization functions to calcualte the corresponding cell embeddings. See here for all visualization functions in Pegasus.

tSNE Plot. Pegasus provides 2 different kinds of tSNE plots: tsne, or fitsne. We only show FIt-SNE plot in this tutorial, as FIt-SNE works better on large-scale datasets:

In [9]:
pg.scatter(data, attrs=['anno','Channel'], basis='fitsne')

You can show both cluster-specific cell types (anno) and samples (Channel) in one function.

Note here, the legend of left plot is not fully dislplayed. We need to adjust the horizon distance between two plots by changing "wspace"

"wspace" is the ratio of horizon distance to width of plots. For example ,the default value of wspace is 0.4, which means the gap is 0.4 times the width of plot.

In [10]:
pg.scatter(data, attrs=['anno','Channel'], basis='fitsne', wspace=1.2)

There are also other optional parameters. For example, the commanly used ones

  • alpha (float or List[float], optional, default: 1.0) – Alpha value for blending, from 0.0 (transparent) to 1.0 (opaque). If this is a list, the length must match attrs, which means we set a separate alpha value for each attribute.
  • dpi (float, optional, default: 300.0) – The resolution of the figure in dots-per-inch.

For the plots below, Left one has alpha=1.0, while right one has alpha=0.1

In [11]:
pg.scatter(data, attrs=['Channel','Channel'], basis='fitsne',alpha=[1.0,0.1], wspace=0.6)

Plots with lower dpi are smaller and have lower resolution.

In [12]:
pg.scatter(data, attrs=['Channel'], basis='fitsne',dpi=50)

UMAP Plot. Pegasus also provides 2 different kinds of UMAP plots: umap, or net_umap.

The embedding methods can be changed by changing basis

In [13]:
pg.scatter(data,attrs=['anno', 'Channel'],basis='umap', wspace=1.2)

Cell Developmental Trajectory. Pegasus provides Force-directed Layout (FLE) based plots to show pseudo-trajectories of cell development. To show it, you need to first run diffmap, then FLE-like visualization, fle or net_fle, and finally plot using scatter with the corresponding basis:

In [14]:
pg.scatter(data,attrs='anno', basis='fle')

Feature Plot

Besides plotting cell attributes in data.obs, you can also specify gene names in attrs parameter of scatter function to generate Feature Plots, which is useful for inspecting specific genes in details.

Below are feature plots of 4 genes, together with UMAP plot of cell type annotations:

In [15]:
pg.scatter(data, attrs=['TRAC', 'CD79A', 'CD14', 'CD34', 'anno'], basis='umap', ncols=2)

As the plots show, TRAC is highly expressed in T-cell clusters, CD79A in B-cell and Plasma-cell clusters, CD14 in CD14+ Monocytes, and CD34 in the Stem-cell cluster.

Scatter Plot against Categories

By using scatter_groups function, user can generate a set of scatter plots against a categorical cell attribute, for example Channel, one plot regarding a category.

Below is an example

In [16]:
pg.scatter_groups(data, attr='louvain_labels', groupby='Channel', basis='fitsne', 
                  nrows = 3, ncols = 3, legend_fontsize=10)

There are 8 categories in Channel attribute, since the dataset contains 8 samples. As a result, 9 FIt-SNE plots are generated: first one shows cells from all samples, followed by 8 plots, one of which only shows cells from one sample.

You can see that all the 8 sample-specific plots have similar cell distribution regarding clusters. This is due to the batch correction on the original data.

Violin Plot

The violin plot gives us an idea of the distribution of gene expression values across clusters. The violin plot below shows the expression of the marker gene set we use in this tutorial against Louvain clusters:

In [17]:
pg.violin(data, attrs=marker_genes, groupby='anno', dpi=80)

Notice that all the genes' violin plots are stacked together. And you can check that their expression levels across clusters are pretty consistent with our knowledge.

Pegasus violin plot can also be split regarding a categorical cell attributes with 2 categories. For example, below is a plot showing expression of gene XIST and RPS4Y1 with cells split into male and female donors:

In [18]:
pg.violin(data, attrs=['XIST', 'RPS4Y1'], groupby='anno', hue='gender', dpi=100)

It's easy to see that in the plot above, XIST is highly expressed in female samples across all clusters, while RPS4Y1 is highly expressed in male samples.

Heat Map

Besides Violin Plot, Pegasus also provides Heat Map.

It is also useful to add a dendrogram to the graph to bring together similar clusters or genes, based on the expression of selected genes.

By default, heatmap shows a Heat Map based on average expression of selected genes within each cluster, and cluster similar genes altogether. You can set row_cluster=True to merge similar clusters:

In [19]:
pg.heatmap(data, genes=marker_genes, groupby='anno', row_cluster=True, dpi=80)

switch_axes=True can switch x-axis (genes) and y-axis (clusters):

In [20]:
pg.heatmap(data, genes=marker_genes, groupby='anno', col_cluster=True, switch_axes=True, dpi=80)

In addition, you can also show Heat Map based on expression of individual cell by setting on_average=False:

In [21]:
pg.heatmap(data, genes=marker_genes, groupby='anno', on_average=False, dpi=80)
/Users/yy939/mgh/miniconda3/envs/pegasus-python37/lib/python3.7/site-packages/seaborn/matrix.py:649: UserWarning: Clustering large matrix with scipy. Installing `fastcluster` may give better performance.
  warnings.warn(msg)

By comparing with Violin plot, you can see the Heat maps are consistent with it.

Dot Plot

Dot plot is an alternative to Heat map. Besides the mean expression of a gene within a cluster (dot color), it also conveys the fraction of cells within the cluster expressing that gene (dot size):

In [22]:
pg.dotplot(data, genes=marker_genes, groupby='anno')

Similarly as heatmap, switch_axes=True switches the two axes:

In [23]:
pg.dotplot(data, genes=marker_genes, groupby='anno',switch_axes=True)

Dendrogram

Pegasus provides dendrograms based on hierarchical clustering. There are 2 ways for clustering: one is regarding the expression of a selected set of genes:

In [24]:
pg.dendrogram(data, genes=marker_genes, groupby='anno', dpi=100)

Another way is to use a cell embedding for clustering. As this dataset is corrected using Harmony algorithm, with the corrected PCA embedding stored at data.obsm['X_pca_harmony'], we can simply set rep parameter in dendrogram to 'pca_harmony' to use this embedding for clustering:

In [25]:
pg.dendrogram(data, rep='pca_harmony', groupby='anno', dpi=100)

Now we have a different dendrogram.

Volcano Plot

Differential Expression (DE) Analysis is to discover cluster-specific marker genes. For each cluster, it compares cells within the cluster with all the others, then finds genes significantly highly expressed (up-regulated) and lowly expressed (down-regulated) for the cluster.

After running de_analysis function in Pegasus, we can use Volcano plot to see the DE result. Here we use Louvain cluster 1 for illustration. You can see that this cluster is annotated as "Naive T cell":

In [26]:
data.obs.loc[data.obs['louvain_labels']=='1', 'anno'].astype('str').value_counts()
Out[26]:
Naive T cell    6230
Name: anno, dtype: int64

Below is its Volcano plot:

In [27]:
pg.volcano(data, cluster_id = '1', dpi=200)

The plot uses the default thresholds: Log fold change at $\pm 1.0$, and Q-value at $0.05$. Each point stands for a gene. Red ones are significant marker genes: those at right-hand side are up-regulated genes for Cluster 1, while those at left-hand side are down-regulated genes.

In specific, TRAC gene, a marker gene of T cells, is the second to the rightmost up-regulated gene for Cluster 1. This is consistent with its cell type annotation.

The clusters used in Volcano plot depend on how you run de_analysis function. For this dataset, we performed DE analysis on Louvain clusters, as a result, we have to use their labels in Volcano plot.

By default, volcano uses t test results. User can set de_test parameter to either 'fisher' or 'mwu' to use other test result, as long as those tests have been performed in de_analysis ahead.

Read more ...

See Pegasus_Tutorial for tutorial on running downstream analysis using Pegasus.

See Plotting_Documentation for details on parameters of Pegasus plotting functions.