Author: Yiming Yang
Date: 2020-12-02
Notebook Source: regress_out.ipynb
This tutorial shows how to regress out cell cycle using Pegasus. To benchmark the work analogous to Seurat's tutorial and SCANPY's tutorial, we use the same dataset of murine hematopoietic progenitors from [Nestorowa et al., Blood 2016], which can be downloaded here.
First, load the data:
import pegasus as pg
data = pg.read_input("nestorawa_forcellcycle_expressionMatrix.txt")
data
2020-12-02 00:06:30,984 - pegasusio.readwrite - INFO - tsv file 'nestorawa_forcellcycle_expressionMatrix.txt' is loaded. 2020-12-02 00:06:30,985 - pegasusio.readwrite - INFO - Function 'read_input' finished in 3.00s.
MultimodalData object with 1 UnimodalData: 'unknown-rna' It currently binds to UnimodalData object unknown-rna UnimodalData object with n_obs x n_vars = 773 x 24193 Genome: unknown; Modality: rna It contains 1 matrices: 'X' It currently binds to matrix 'X' as X obs: var: obsm: varm: uns: 'genome', 'modality'
We need to do some preprocessing work before calculating the cell-cycle scores:
pg.qc_metrics(data, min_genes=0, max_genes=1e5)
pg.filter_data(data)
pg.identify_robust_genes(data)
pg.log_norm(data, norm_count=1e4)
2020-12-02 00:06:31,125 - pegasusio.qc_utils - INFO - After filtration, 773 out of 773 cell barcodes are kept in UnimodalData object unknown-rna. 2020-12-02 00:06:31,259 - pegasus.tools.preprocessing - INFO - After filtration, 24158/24193 genes are kept. Among 24158 genes, 24158 genes are robust. 2020-12-02 00:06:31,374 - pegasus.tools.preprocessing - INFO - Function 'log_norm' finished in 0.11s.
To be consistent with Seurat's tutorial, the filtration threshold on number of genes is set to be large enough to avoid removing any cell barcode, and a normalization of TP10k (transcripts per 10k) is performed.
Now calculate cell-cycle scores using calc_signature_score
(see here for its documentation):
pg.calc_signature_score(data, 'cell_cycle_human')
2020-12-02 00:06:31,424 - pegasus.tools.signature_score - INFO - Loaded signatures from GMT file /Users/yy939/GitHub/pegasus/pegasus/data_files/cell_cycle_human.gmt. 2020-12-02 00:06:31,427 - pegasus.tools.signature_score - INFO - Signature G1/S: 42 out of 43 genes are used in signature score calculation. 2020-12-02 00:06:31,447 - pegasus.tools.signature_score - INFO - Signature G2/M: 52 out of 54 genes are used in signature score calculation. 2020-12-02 00:06:31,475 - pegasus.tools.signature_score - INFO - Function 'calc_signature_score' finished in 0.09s.
By setting cell_cycle_human
, Pegasus uses the cell cycle genes defined in [Tirosh et al. 2015] for calculation. We can fetch these genes from Pegasus pre-stored file (https://raw.githubusercontent.com/klarman-cell-observatory/pegasus/master/pegasus/data_files/cell_cycle_human.gmt), as we'll need them in the following sections:
cell_cycle_genes = []
with open("cell_cycle_human.gmt", 'r') as f:
for line in f:
cell_cycle_genes += line.strip().split('\t')[2:]
Moreover, the predicted phases of cells are stored in data.obs
field with key predicted_phase
:
data.obs['predicted_phase'].value_counts()
G0 468 G1/S 157 G2/M 148 Name: predicted_phase, dtype: int64
And the corresponding scores are in data.obs['G1/S']
and data.obs['G2/M']
.
The presence of cell cycle effects can be shown in terms of Principal Components (PC), which are calculated from gene-count matrix with respect to cell cycle genes:
data_cc_genes = data[:, cell_cycle_genes].copy()
pg.pca(data_cc_genes)
data.obsm['X_pca'] = data_cc_genes.obsm['X_pca']
2020-12-02 00:06:31,626 - pegasus.tools.preprocessing - INFO - Function 'pca' finished in 0.07s.
After achieving PCA result, we can generate a scatter plot of first two PCs with predicted phase being the colors for cells:
pg.scatter(data, attrs='predicted_phase', basis='pca', dpi=130)
In this plot, we can see that cells are quite separated due to their phases when cell cycle genes are considered.
This section shows how to regress out cell cycle effects.
Pegasus performs this regressing out on PCs. After getting the original PCA matrix, it uses the cell-cycle scores for this operation:
pca_key = pg.pc_regress_out(data, attrs=['G1/S', 'G2/M'])
2020-12-02 00:06:31,935 - pegasus.tools.preprocessing - INFO - Function 'pc_regress_out' finished in 0.13s.
When pc_regress_out
is finished, the resulting new PCA embedding matrix is stored in data.obsm['X_pca_regressed']
, with the representation key 'pca_regressed'
returned to variable pca_key
.
Now let's see how PC scatter plot looks on the new PCs:
pg.scatter(data, attrs=['predicted_phase'], basis=pca_key, dpi=130)
You can see that after regressing out cell-cycle scores, no evident cell-cycle effects are observed in the first 2 PCs.
The scatter plot above only shows the first 2 PCs after regressing out.
Since it's not possible to visualize all the 50 PCs in 2D plot, we need to perform another PCA step on them, and then plot the top 2 PCs from the generated new ones, which account for most variability of the PCs after regressing out:
import numpy as np
from sklearn.decomposition import PCA
X = data.obsm['X_' + pca_key]
pca = PCA(n_components=X.shape[1], random_state=0, svd_solver='full')
X_pca_new = pca.fit_transform(X)
data.obsm['X_pca_new'] = np.ascontiguousarray(X_pca_new)
pg.scatter(data, attrs=['predicted_phase'], basis='pca_new', dpi=130)
So indeed, no evident cell-cycle effects are observed in all the 50 PCs after regressing out.
You can use the immediate result of pc_regress_out
function (i.e. data.obsm['X_pca_regressed']
) for the downstream analysis. Just remember to set the basis
argument of the Pegasus functions whenever applicable.
The major difference between Pegasus' method and conventional method is that Pegasus performs regressing out on PCs, rather than all the features. For example, in this case, Pegasus only considers $50$ PCs, instead of all the $24,158$ features. Therefore, Pegasus method can be more efficient on large-scale datasets which contains many more cell barcodes.