# Pegasus Tutorial¶

Author: Yiming Yang
Date: 2021-06-24
Notebook Source: pegasus_analysis.ipynb

## Count Matrix File¶

For this tutorial, we provide a count matrix dataset on Human Bone Marrow with 8 donors stored in zarr format (with file extension ".zarr.zip").

This file is achieved by aggregating gene-count matrices of the 8 10X channels using PegasusIO, and further filtering out cells with fewer than $100$ genes expressed. Please see here for how to do it interactively.

Now load the file using pegasus read_input function:

The count matrix is managed as a UnimodalData object defined in PegasusIO module, and users can manipulate the data from top level via MultimodalData structure, which can contain multiple UnimodalData objects as members.

For this example, as show above, data is a MultimodalData object, with only one UnimodalData member of key "GRCh38-rna", which is its default UnimodalData. Any operation on data will be applied to this default UnimodalData object.

UnimodalData has the following structure:

It has 6 major parts:

• Raw count matrix: data.X, a Scipy sparse matrix (sometimes can be a dense matrix), with rows the cell barcodes, columns the genes/features:

This dataset contains $48,219$ barcodes and $36,601$ genes.

• Cell barcode attributes: data.obs, a Pandas data frame with barcode as the index. For now, there is only one attribute "Channel" referring to the donor from which the cell comes from:
• Gene attributes: data.var, also a Pandas data frame, with gene name as the index. For now, it only has one attribute "gene_ids" referring to the unique gene ID in the experiment:
• Unstructured information: data.uns, a Python hashed dictionary. It usually stores information not restricted to barcodes or features, but about the whole dataset, such as its genome reference and modality type:
• Finally, embedding attributes on cell barcodes: data.obsm; as well as on genes, data.varm. We'll see it in later sections.

## Preprocessing¶

### Filtration¶

The first step in preprocessing is to perform the quality control analysis, and remove cells and genes of low quality.

We can generate QC metrics using the following method with default settings:

The metrics considered are:

• Number of genes: Keep cells with $500 \leq \text{# Genes} < 6000$. **(Notice that starting from v1.4.0, Pegasus doesn't filter cells by number of genes threshold by default.)**
• Number of UMIs: Don't filter cells due to UMI bounds (Default);
• Percent of Mitochondrial genes: Look for mitochondrial genes with name prefix MT- **(Notice that starting from v1.4.0, you'll need to manually specify this prefix.)**, and keep cells with percent $< 10\%$.

For details on customizing your own thresholds, see documentation.

Numeric summaries on filtration on cell barcodes and genes can be achieved by get_filter_stats method:

The results is a Pandas data frame on samples.

You can also check the QC stats via plots. Below is on number of genes:

Then on number of UMIs:

On number of percentage of mitochondrial genes:

Now filter cells based on QC metrics set in qc_metrics:

You can see that $35,465$ cells ($73.55\%$) are kept.

Moreover, for genes, only those with no cell expression are removed. After that, we identify robust genes for downstream analysis:

The metric is the following:

• Gene is expressed in at least $0.05\%$ of cells, i.e. among every 6000 cells, there are at least 3 cells expressing this gene.

Please see its documentation for details.

As a result, $25,653$ ($70.09\%$) genes are kept. Among them, $17,516$ are robust.

We can now view the cells of each sample after filtration:

### Normalization and Logarithmic Transformation¶

After filtration, we need to first normalize the distribution of counts w.r.t. each cell to have the same sum (default is $10^5$, see documentation), and then transform into logarithmic space by $log(x + 1)$ to avoid number explosion:

For the downstream analysis, we may need to make a copy of the count matrix, in case of coming back to this step and redo the analysis:

### Highly Variable Gene Selection¶

Highly Variable Genes (HVG) are more likely to convey information discriminating different cell types and states. Thus, rather than considering all genes, people usually focus on selected HVGs for downstream analyses.

You need to set consider_batch flag to consider or not consider batch effect. At this step, set it to False:

By default, we select 2000 HVGs using the pegasus selection method. Alternative, you can also choose the traditional method that both Seurat and SCANPY use, by setting flavor='Seurat'. See documentation for details.

We can view HVGs by ranking them from top:

We can also view HVGs in a scatterplot:

In this plot, each point stands for one gene. Blue points are selected to be HVGs, which account for the majority of variation of the dataset.

### Principal Component Analysis¶

To reduce the dimension of data, Principal Component Analysis (PCA) is widely used. Briefly speaking, PCA transforms the data from original dimensions into a new set of Principal Components (PC) of a much smaller size. In the transformed data, dimension is reduced, while PCs still cover a majority of the variation of data. Moreover, the new dimensions (i.e. PCs) are independent with each other.

pegasus uses the following method to perform PCA:

By default, pca uses:

• Before PCA, scale the data to standard Normal distribution $N(0, 1)$, and truncate them with max value $10$;
• Number of PCs to compute: 50;
• Apply PCA only to highly variable features.

See its documentation for customization.

To explain the meaning of PCs, let's look at the first PC (denoted as $PC_1$), which covers the most of variation:

This is an array of 2000 elements, each of which is a coefficient corresponding to one HVG.

With the HVGs as the following:

$PC_1$ is computed by

\begin{equation*} PC_1 = \text{coord_pc1}[0] \cdot \text{HES4} + \text{coord_pc1}[1] \cdot \text{ISG15} + \text{coord_pc1}[2] \cdot \text{TNFRSF18} + \cdots + \text{coord_pc1}[1997] \cdot \text{RPS4Y2} + \text{coord_pc1}[1998] \cdot \text{MT-CO1} + \text{coord_pc1}[1999] \cdot \text{MT-CO3} \end{equation*}

Therefore, all the 50 PCs are the linear combinations of the 2000 HVGs.

The calculated PCA count matrix is stored in the obsm field, which is the first embedding object we have

For each of the $35,465$ cells, its count is now w.r.t. 50 PCs, instead of 2000 HVGs.

## Nearest Neighbors¶

All the downstream analysis, including clustering and visualization, needs to construct a k-Nearest-Neighbor (kNN) graph on cells. We can build such a graph using neighbors method:

It uses the default setting:

• For each cell, calculate its 100 nearest neighbors;
• Use PCA matrix for calculation;
• Use L2 distance as the metric;
• Use hnswlib search algorithm to calculate the approximated nearest neighbors in a really short time.

See its documentation for customization.

Below is the result:

Each row corresponds to one cell, listing its neighbors (not including itself) from nearest to farthest. data_trial.uns['pca_knn_indices'] stores their indices, and data_trial.uns['pca_knn_distances'] stores distances.

## Clustering and Visualization¶

Now we are ready to cluster the data for cell type detection. pegasus provides 4 clustering algorithms to use:

• louvain: Louvain algorithm, using louvain package.
• leiden: Leiden algorithm, using leidenalg package.
• spectral_louvain: Spectral Louvain algorithm, which requires Diffusion Map.
• spectral_leiden: Spectral Leiden algorithm, which requires Diffusion Map.

See this documentation for details.

In this tutorial, we use the Louvain algorithm:

As a result, Louvain algorithm finds 19 clusters:

We can check each cluster's composition regarding donors via a composition plot: