Regress Out Tutorial

Author: Yiming Yang
Date: 2020-09-01
Notebook Source: regress_out.ipynb

This tutorial shows how to regress out cell cycle using Pegasus. The dataset used contains human bone marrow data collected from 8 donors. It's stored at https://storage.googleapis.com/terra-featured-workspaces/Cumulus/MantonBM_nonmix_subset.zarr.zip.

Cell-Cycle Scores

First, load the data:

In [1]:
import pegasus as pg

data = pg.read_input("MantonBM_nonmix_subset.zarr.zip")
data
2020-09-01 00:02:10,425 - pegasusio.readwrite - INFO - zarr file 'MantonBM_nonmix_subset.zarr.zip' is loaded.
2020-09-01 00:02:10,425 - pegasusio.readwrite - INFO - Function 'read_input' finished in 0.27s.
Out[1]:
MultimodalData object with 1 UnimodalData: 'GRCh38-rna'
    It currently binds to UnimodalData object GRCh38-rna

UnimodalData object with n_obs x n_vars = 48219 x 36601
    Genome: GRCh38; Modality: rna
    It contains 1 matrices: 'X'
    It currently binds to matrix 'X' as X

    obs: 'n_genes', 'Channel'
    var: 'featureid'
    obsm: 
    varm: 
    uns: 'genome', 'modality'

We need to do some preprocessing work before calculating the cell-cycle scores:

In [2]:
pg.qc_metrics(data, percent_mito=10)
pg.filter_data(data)
pg.identify_robust_genes(data)
pg.log_norm(data)
2020-09-01 00:02:11,734 - pegasusio.qc_utils - INFO - After filtration, 35465 out of 48219 cell barcodes are kept in UnimodalData object GRCh38-rna.
2020-09-01 00:02:12,165 - pegasus.tools.preprocessing - INFO - After filtration, 25653/36601 genes are kept. Among 25653 genes, 17516 genes are robust.
2020-09-01 00:02:13,291 - pegasus.tools.preprocessing - INFO - Function 'log_norm' finished in 1.13s.

Now calculate cell-cycle scores using calc_signature_score (see here for its documentation):

In [3]:
pg.calc_signature_score(data, 'cell_cycle_human')
2020-09-01 00:02:18,557 - pegasus.tools.signature_score - INFO - Loaded signatures from GMT file /Users/yy939/GitHub/pegasus/pegasus/data_files/cell_cycle_human.gmt.
2020-09-01 00:02:18,562 - pegasus.tools.signature_score - INFO - Signature G1/S: 43 out of 43 genes are used in signature score calculation.
2020-09-01 00:02:18,663 - pegasus.tools.signature_score - INFO - Signature G2/M: 54 out of 54 genes are used in signature score calculation.
2020-09-01 00:02:18,806 - pegasus.tools.signature_score - INFO - Function 'calc_signature_score' finished in 5.51s.

By setting cell_cycle_human, Pegasus uses the cell cycle genes defined in [Tirosh et al. 2015] for calculation. We can fetch these genes from Pegasus pre-stored file (https://raw.githubusercontent.com/klarman-cell-observatory/pegasus/master/pegasus/data_files/cell_cycle_human.gmt), as we'll need them in the following sections:

In [4]:
cell_cycle_genes = []
with open("cell_cycle_human.gmt", 'r') as f:
    for line in f:
        cell_cycle_genes += line.strip().split('\t')[2:]

Moreover, the predicted phases of cells are stored in data.obs field with key predicted_phase:

In [5]:
data.obs['predicted_phase'].value_counts()
Out[5]:
G0      30980
G1/S     2636
G2/M     1849
Name: predicted_phase, dtype: int64

And the corresponding scores are in data.obs['G1/S'] and data.obs['G2/M'].

Cell Cycle Effects

The presence of cell cycle effects can be shown in terms of Principal Components (PC), which are calculated from gene-count matrix with respect to cell cycle genes:

In [6]:
data_cc_genes = data[:, cell_cycle_genes].copy()
pg.pca(data_cc_genes)
data.obsm['X_pca'] = data_cc_genes.obsm['X_pca']
2020-09-01 00:02:19,240 - pegasus.tools.preprocessing - INFO - Function 'pca' finished in 0.25s.

After achieving PCA result, we can generate a scatter plot of first two PCs with predicted phase being the colors for cells:

In [7]:
pg.scatter(data, attrs='predicted_phase', basis='pca', dpi=130)

In this plot, we can see that cells are quite separated due to their phases when cell cycle genes are considered.

Regress Out Cell Cycle Effects

This section shows how to regress out cell cycle effects.

Pegasus performs this regressing out on PCs. After getting the original PCA matrix, it uses the cell-cycle scores for this operation:

In [8]:
pca_key = pg.pc_regress_out(data, attrs=['G1/S', 'G2/M'])
2020-09-01 00:02:19,718 - pegasus.tools.preprocessing - INFO - Function 'pc_regress_out' finished in 0.21s.

When pc_regress_out is finished, the resulting new PCA matrix is stored in data.obsm['X_pca_regressed'], with the representation key 'pca_regressed' returned to variable pca_key.

Then in downstream analysis, you should assign either pca_key or 'pca_regressed' to rep argument of Pegasus functions whenever applicable.

Now let's see how PC scatter plot looks on the new PCs:

In [9]:
pg.scatter(data, attrs=['predicted_phase'], basis=pca_key, dpi=130)

You can see that after regressing out cell-cycle scores, no evident cell-cycle effects are observed.