Author: Hui Ma, Yiming Yang
Date: 2020-08-20
Notebook Source: batch_correction.ipynb
import pegasus as pg
In this tutorial, we'll use a gene-count matrix dataset on human bone marrow from 8 donors, and show how to use the batch correction methods in Pegasus to tackle the batch effects in the data.
The dataset is stored at https://storage.googleapis.com/terra-featured-workspaces/Cumulus/MantonBM_nonmix_subset.zarr.zip. You can also use gsutil to download it via its Google bucket URL (gs://terra-featured-workspaces/Cumulus/MantonBM_nonmix_subset.zarr.zip).
Now load the count matrix:
data = pg.read_input("MantonBM_nonmix_subset.zarr.zip")
data
'Channel'
is the batch key. Each batch is associated with one donor, so there are 8 batches in total.
First, preprocess the data. This includes filtration, selecting robust genes, and log-normalization:
pg.qc_metrics(data, percent_mito=10)
pg.filter_data(data)
pg.identify_robust_genes(data)
pg.log_norm(data)
After quality-control, distribution of cells in all the 8 batches is:
data.obs['Channel'].value_counts()
We first perform downstream steps without considering batch effects. In this way, you can see where the batch effects exist, and moreover, we'll use this result as the baseline when comparing different batch correction methods.
data_baseline = data.copy()
pg.highly_variable_features(data_baseline, consider_batch=False)
In this tutorial, the downstream steps consists of:
robust=True
for reproducibility;pg.pca(data_baseline, robust=True)
pg.neighbors(data_baseline)
pg.louvain(data_baseline)
pg.umap(data_baseline)
pg.scatter(data_baseline, attrs=['louvain_labels', 'Channel'], basis='umap')
Let's have a quick look at the UMAP plot above. If checking the cells in Louvain cluster 10 and 13 in the right-hand-side plot, you can see that most of them are from sample MantonBM3_HiSeq_1
. This indicates strong batch effects.
Batch effect occurs when data samples are generated in different conditions, such as date, weather, lab setting, equipment, etc. Unless informed that all the samples were generated under the similar condition, people may suspect presumable batch effects if they see a visualization graph with samples kind-of isolated from each other.
In this tutorial, you'll see how to apply the batch correction methods in Pegasus to this dataset.
As a common step ahead, we need to re-select HVGs considering batch effects:
pg.highly_variable_features(data, consider_batch=True)
Harmony is a widely-used method for data integration. Pegasus uses harmony-pytorch package to perform Harmony batch correction.
Harmony works on PCA matrix. So we need to first calculate the original PCA matrix:
data_harmony = data.copy()
pg.pca(data_harmony, robust=True)
In pca
function, robust=True
is for reproducibility purpose.
Now we are ready to run Harmony integration:
harmony_key = pg.run_harmony(data_harmony)
When finished, the corrected PCA matrix is stored in data_harmony.obsm['X_pca_harmony']
, and run_harmony
returns the representation key 'pca_harmony'
as variable harmony_key
. In the downstream steps, you can set rep
parameter to either harmony_key
or 'pca_harmony'
in Pegasus functions whenever applicable.
For details on parameters of run_harmony
other than the default setting, please see here.
With the new corrected PCA matrix, we can perform kNN-graph-based clustering and calculate UMAP embeddings as follows:
pg.neighbors(data_harmony, rep=harmony_key)
pg.louvain(data_harmony, rep=harmony_key)
pg.umap(data_harmony, rep=harmony_key)
Then show UMAP plot:
pg.scatter(data_harmony, attrs=['louvain_labels', 'Channel'], basis='umap')
Pegasus also provides a canonical Location and Scale (L/S) Adjustment method (See reference) for batch correction. Different from Harmony, L/S method modifies the log-normalized count matrix. It is faster, and works well on large-scale datasets.
Pegasus uses this method for batch correction in Cumulus paper.
After HVG selection, directly run correct_batch
function to perform L/S batch correction:
data_ls = data.copy()
pg.correct_batch(data_ls, features='highly_variable_features')
In correct_batch
function, features
parameter specifies which genes/features to consider in batch correction. By default, it considers all the features. Here, as we've already selected a HVG set, we can assign its key in data_ls.var
field, which is 'highly_variable_features'
, to this parameter.
data_ls.uns['fmat_highly_variable_features'].shape
As shown above, the corrected count matrix is stored at adata.uns['fmt_highly_variable_features']
, with dimension cell-by-HVG.
See its documnetation for customization.
Now we can perform the downstream analysis:
pg.pca(data_ls, robust=True)
pg.neighbors(data_ls)
pg.louvain(data_ls)
pg.umap(data_ls)
pg.scatter(data_ls, attrs=['louvain_labels', 'Channel'], basis='umap')
data_scan = data.copy()
scan_key = pg.run_scanorama(data_scan)
You can check details on run_scanorama
parameters here.
By default, it considers count matrix only regarding the selected HVGs, and calculates the corrected PCA matrix of $50$ PCs. When finished, this new PCA matrix is stored in data_scan.obsm['X_scanorama']
, and returns its representation key 'scanorama'
as variable scan_key
. In the downstream steps, you can set rep
parameter to either scan_key
or 'scanorama'
in Pegasus functions whenever applicable:
pg.neighbors(data_scan, rep=scan_key)
pg.louvain(data_scan, rep=scan_key)
pg.umap(data_scan, rep=scan_key)
Now check its UMAP plot:
pg.scatter(data_scan, attrs=['louvain_labels', 'Channel'], basis='umap')
To compare the performance on the three methods, one metric is runtime, which you can see from the logs in sections above: L/S Adjustment method is the fastest, then Harmony, and Scanorama is the slowest.
In this section, we'll use 2 other metrics for comparison:
We have 4 results: No batch correction (Baseline), Harmony, L/S, and Scanorama. For each result, kBET and kSIM acceptance rates are calculated on its (corrected) PCA matrix for comparison.
Details on these 2 metrics can be found in Cumulus paper.
We can use calc_kBET
function to calculate on kBET acceptance rates. Besides,
attr
parameter, use the batch key, which is 'Channel'
in this tutorial.rep
parameter, set to the corresponding (corrected) PCA matrix if needed. By default, it uses 'pca'
representation key;_, _, kBET_baseline = pg.calc_kBET(data_baseline, attr='Channel')
_, _, kBET_harmony = pg.calc_kBET(data_harmony, attr='Channel', rep=harmony_key)
_, _, kBET_ls = pg.calc_kBET(data_ls, attr='Channel')
_, _, kBET_scan = pg.calc_kBET(data_scan, attr='Channel', rep=scan_key)
We need pre-annotated cell type information as ground truth to calculate kSIM acceptance rate. This is achieved by:
This ground truth is stored at https://storage.googleapis.com/terra-featured-workspaces/Cumulus/cell_types.csv, or we can get it using gsutil
from its Google bucket URL (gs://terra-featured-workspaces/Cumulus/cell_types.csv):
Now load this file, and attach its 'cell_types'
column to the 4 resulting count matrices above:
import pandas as pd
import numpy as np
df_celltypes = pd.read_csv("cell_types.csv", index_col='barcodekey')
assert np.sum(df_celltypes.index!=data_baseline.obs_names) == 0
data_baseline.obs['cell_types'] = df_celltypes['cell_types']
assert np.sum(df_celltypes.index!=data_harmony.obs_names) == 0
data_harmony.obs['cell_types'] = df_celltypes['cell_types']
assert np.sum(df_celltypes.index!=data_ls.obs_names) == 0
data_ls.obs['cell_types'] = df_celltypes['cell_types']
assert np.sum(df_celltypes.index!=data_scan.obs_names) == 0
data_scan.obs['cell_types'] = df_celltypes['cell_types']
We can then use calc_kSIM
function to calculate kSIM acceptance rates. Besides,
attr
parameter, use the ground truth key 'cell_types'
;rep
parameter, similarly as in kBET section, set to the corresponding (corrected) PCA matrix if needed. By default, it uses 'pca'
representation key;_, kSIM_baseline = pg.calc_kSIM(data_baseline, attr='cell_types')
_, kSIM_harmony = pg.calc_kSIM(data_harmony, attr='cell_types', rep=harmony_key)
_, kSIM_ls = pg.calc_kSIM(data_ls, attr='cell_types')
_, kSIM_scan = pg.calc_kSIM(data_scan, attr='cell_types', rep=scan_key)
Now draw a scatterplot regarding these two metrics on the 4 results:
import seaborn as sns
import matplotlib.pyplot as plt
df_plot = pd.DataFrame({'method': ['Baseline', 'Harmony', 'L/S', 'Scanorama'],
'kBET': [kBET_baseline, kBET_harmony, kBET_ls, kBET_scan],
'kSIM': [kSIM_baseline, kSIM_harmony, kSIM_ls, kSIM_scan]})
plt.figure(dpi=100)
ax = sns.scatterplot(x = 'kSIM', y = 'kBET', hue = 'method', data = df_plot, legend = False)
for line in range(0, df_plot.shape[0]):
x_pos = df_plot.kSIM[line] + 0.003
if df_plot.method[line] == 'Baseline':
x_pos = df_plot.kSIM[line] - 0.003
y_pos = df_plot.kBET[line]
alignment = 'right' if df_plot.method[line] == 'Baseline' else 'left'
ax.text(x_pos, y_pos, df_plot.method[line], ha = alignment, size = 'medium', color = 'black')
plt.xlabel('kSIM acceptance rate')
plt.ylabel('kBET acceptance rate')
As this plot shows, a trade-off exists between good mixture of cells and maintaining the biology well. Harmony method achieves the best mixture of cells, while its consistency with the ground truth biology is the least. Scanorama seems to have the best balance between the two measurements.
Therefore, in general, the choice of batch correction method really depends on the dataset and your analysis goal.