Batch Correction Tutorial

Author: Hui Ma, Yiming Yang
Date: 2020-08-20
Notebook Source: batch_correction.ipynb

In [1]:
import pegasus as pg

Dataset

In this tutorial, we'll use a gene-count matrix dataset on human bone marrow from 8 donors, and show how to use the batch correction methods in Pegasus to tackle the batch effects in the data.

The dataset is stored at https://storage.googleapis.com/terra-featured-workspaces/Cumulus/MantonBM_nonmix_subset.zarr.zip. You can also use gsutil to download it via its Google bucket URL (gs://terra-featured-workspaces/Cumulus/MantonBM_nonmix_subset.zarr.zip).

Now load the count matrix:

In [2]:
data = pg.read_input("MantonBM_nonmix_subset.zarr.zip")
data
2020-08-20 20:53:05,729 - pegasusio.readwrite - INFO - zarr file 'MantonBM_nonmix_subset.zarr.zip' is loaded.
2020-08-20 20:53:05,730 - pegasusio.readwrite - INFO - Function 'read_input' finished in 0.24s.
Out[2]:
MultimodalData object with 1 UnimodalData: 'GRCh38-rna'
    It currently binds to UnimodalData object GRCh38-rna

UnimodalData object with n_obs x n_vars = 48219 x 36601
    Genome: GRCh38; Modality: rna
    It contains 1 matrices: 'X'
    It currently binds to matrix 'X' as X

    obs: 'n_genes', 'Channel'
    var: 'featureid'
    obsm: 
    varm: 
    uns: 'genome', 'modality'

'Channel' is the batch key. Each batch is associated with one donor, so there are 8 batches in total.

Preprocessing

First, preprocess the data. This includes filtration, selecting robust genes, and log-normalization:

In [3]:
pg.qc_metrics(data, percent_mito=10)
pg.filter_data(data)
pg.identify_robust_genes(data)
pg.log_norm(data)
2020-08-20 20:53:06,293 - pegasusio.qc_utils - INFO - After filtration, 35465 out of 48219 cell barcodes are kept in UnimodalData object GRCh38-rna.
2020-08-20 20:53:06,710 - pegasus.tools.preprocessing - INFO - After filtration, 25653/36601 genes are kept. Among 25653 genes, 17516 genes are robust.
2020-08-20 20:53:07,649 - pegasus.tools.preprocessing - INFO - Function 'log_norm' finished in 0.94s.

After quality-control, distribution of cells in all the 8 batches is:

In [4]:
data.obs['Channel'].value_counts()
Out[4]:
MantonBM2_HiSeq_1    4935
MantonBM6_HiSeq_1    4665
MantonBM8_HiSeq_1    4511
MantonBM7_HiSeq_1    4452
MantonBM1_HiSeq_1    4415
MantonBM3_HiSeq_1    4225
MantonBM4_HiSeq_1    4172
MantonBM5_HiSeq_1    4090
Name: Channel, dtype: int64

Clustering without Correcting Batch Effects

We first perform downstream steps without considering batch effects. In this way, you can see where the batch effects exist, and moreover, we'll use this result as the baseline when comparing different batch correction methods.

In [5]:
data_baseline = data.copy()
pg.highly_variable_features(data_baseline, consider_batch=False)
2020-08-20 20:53:08,502 - pegasus.tools.hvf_selection - INFO - 2000 highly variable features have been selected.
2020-08-20 20:53:08,502 - pegasus.tools.hvf_selection - INFO - Function 'highly_variable_features' finished in 0.57s.

In this tutorial, the downstream steps consists of:

  • PCA calculation: set robust=True for reproducibility;
  • kNN graph construction, and Louvain clustering based on it;
  • Calculate UMAP embedding, and show UMAP plot.
In [6]:
pg.pca(data_baseline, robust=True)
pg.neighbors(data_baseline)
pg.louvain(data_baseline)
pg.umap(data_baseline)
pg.scatter(data_baseline, attrs=['louvain_labels', 'Channel'], basis='umap')
2020-08-20 20:53:14,368 - pegasus.tools.preprocessing - INFO - Function 'pca' finished in 5.86s.
2020-08-20 20:53:18,373 - pegasus.tools.nearest_neighbors - INFO - Function 'get_neighbors' finished in 4.00s.
2020-08-20 20:53:19,410 - pegasus.tools.nearest_neighbors - INFO - Function 'calculate_affinity_matrix' finished in 1.04s.
2020-08-20 20:53:20,687 - pegasus.tools.graph_operations - INFO - Function 'construct_graph' finished in 1.27s.
2020-08-20 20:53:38,158 - pegasus.tools.clustering - INFO - Louvain clustering is done. Get <function cluster at 0x7ff2902fc200> clusters.
2020-08-20 20:53:38,284 - pegasus.tools.clustering - INFO - Function 'louvain' finished in 18.87s.
2020-08-20 20:53:38,297 - pegasus.tools.visualization - INFO - UMAP(min_dist=0.5, random_state=0, verbose=True)
2020-08-20 20:53:38,298 - pegasus.tools.visualization - INFO - Construct fuzzy simplicial set
2020-08-20 20:53:40,107 - pegasus.tools.visualization - INFO - Construct embedding
	completed  0  /  200 epochs
	completed  20  /  200 epochs
	completed  40  /  200 epochs
	completed  60  /  200 epochs
	completed  80  /  200 epochs
	completed  100  /  200 epochs
	completed  120  /  200 epochs
	completed  140  /  200 epochs
	completed  160  /  200 epochs
	completed  180  /  200 epochs
2020-08-20 20:54:01,246 - pegasus.tools.visualization - INFO - UMAP is calculated. Time spent = 22.96s.
2020-08-20 20:54:01,247 - pegasus.tools.visualization - INFO - Function 'umap' finished in 22.96s.

Let's have a quick look at the UMAP plot above. If checking the cells in Louvain cluster 10 and 13 in the right-hand-side plot, you can see that most of them are from sample MantonBM3_HiSeq_1. This indicates strong batch effects.

Batch Correction Methods

Batch effect occurs when data samples are generated in different conditions, such as date, weather, lab setting, equipment, etc. Unless informed that all the samples were generated under the similar condition, people may suspect presumable batch effects if they see a visualization graph with samples kind-of isolated from each other.

In this tutorial, you'll see how to apply the batch correction methods in Pegasus to this dataset.

As a common step ahead, we need to re-select HVGs considering batch effects:

In [7]:
pg.highly_variable_features(data, consider_batch=True)
2020-08-20 20:54:03,131 - pegasus.tools.hvf_selection - INFO - Estimation on feature statistics per channel is finished. Time spent = 0.71s.
2020-08-20 20:54:03,173 - pegasus.tools.hvf_selection - INFO - 2000 highly variable features have been selected.
2020-08-20 20:54:03,173 - pegasus.tools.hvf_selection - INFO - Function 'highly_variable_features' finished in 0.75s.

Harmony Algorithm

Harmony is a widely-used method for data integration. Pegasus uses harmony-pytorch package to perform Harmony batch correction.

Harmony works on PCA matrix. So we need to first calculate the original PCA matrix:

In [8]:
data_harmony = data.copy()
pg.pca(data_harmony, robust=True)
2020-08-20 20:54:09,033 - pegasus.tools.preprocessing - INFO - Function 'pca' finished in 5.76s.

In pca function, robust=True is for reproducibility purpose.

Now we are ready to run Harmony integration:

In [9]:
harmony_key = pg.run_harmony(data_harmony)
2020-08-20 20:54:09,344 - pegasus.tools.batch_correction - INFO - Start integration using Harmony.
	Initialization is completed.
	Completed 1 / 10 iteration(s).
	Completed 2 / 10 iteration(s).
	Completed 3 / 10 iteration(s).
	Completed 4 / 10 iteration(s).
	Completed 5 / 10 iteration(s).
	Completed 6 / 10 iteration(s).
	Completed 7 / 10 iteration(s).
	Completed 8 / 10 iteration(s).
	Completed 9 / 10 iteration(s).
	Completed 10 / 10 iteration(s).
Reach convergence after 10 iteration(s).
2020-08-20 20:54:29,357 - pegasus.tools.batch_correction - INFO - Function 'run_harmony' finished in 20.32s.

When finished, the corrected PCA matrix is stored in data_harmony.obsm['X_pca_harmony'], and run_harmony returns the representation key 'pca_harmony' as variable harmony_key. In the downstream steps, you can set rep parameter to either harmony_key or 'pca_harmony' in Pegasus functions whenever applicable.

For details on parameters of run_harmony other than the default setting, please see here.

With the new corrected PCA matrix, we can perform kNN-graph-based clustering and calculate UMAP embeddings as follows:

In [10]:
pg.neighbors(data_harmony, rep=harmony_key)
pg.louvain(data_harmony, rep=harmony_key)
pg.umap(data_harmony, rep=harmony_key)
2020-08-20 20:54:34,284 - pegasus.tools.nearest_neighbors - INFO - Function 'get_neighbors' finished in 4.92s.
2020-08-20 20:54:35,290 - pegasus.tools.nearest_neighbors - INFO - Function 'calculate_affinity_matrix' finished in 1.01s.
2020-08-20 20:54:36,738 - pegasus.tools.graph_operations - INFO - Function 'construct_graph' finished in 1.45s.
2020-08-20 20:54:56,995 - pegasus.tools.clustering - INFO - Louvain clustering is done. Get <function cluster at 0x7ff2902fc200> clusters.
2020-08-20 20:54:57,124 - pegasus.tools.clustering - INFO - Function 'louvain' finished in 21.83s.
2020-08-20 20:54:57,134 - pegasus.tools.visualization - INFO - UMAP(min_dist=0.5, random_state=0, verbose=True)
2020-08-20 20:54:57,135 - pegasus.tools.visualization - INFO - Construct fuzzy simplicial set
2020-08-20 20:54:57,281 - pegasus.tools.visualization - INFO - Construct embedding
	completed  0  /  200 epochs
	completed  20  /  200 epochs
	completed  40  /  200 epochs
	completed  60  /  200 epochs
	completed  80  /  200 epochs
	completed  100  /  200 epochs
	completed  120  /  200 epochs
	completed  140  /  200 epochs
	completed  160  /  200 epochs
	completed  180  /  200 epochs
2020-08-20 20:55:17,222 - pegasus.tools.visualization - INFO - UMAP is calculated. Time spent = 20.10s.
2020-08-20 20:55:17,224 - pegasus.tools.visualization - INFO - Function 'umap' finished in 20.10s.

Then show UMAP plot:

In [11]:
pg.scatter(data_harmony, attrs=['louvain_labels', 'Channel'], basis='umap')

L/S Adjustment Method

Pegasus also provides a canonical Location and Scale (L/S) Adjustment method (See reference) for batch correction. Different from Harmony, L/S method modifies the log-normalized count matrix. It is faster, and works well on large-scale datasets.

Pegasus uses this method for batch correction in Cumulus paper.

After HVG selection, directly run correct_batch function to perform L/S batch correction:

In [12]:
data_ls = data.copy()
pg.correct_batch(data_ls, features='highly_variable_features')
2020-08-20 20:55:18,720 - pegasus.tools.batch_correction - INFO - Adjustment parameters are estimated.
2020-08-20 20:55:19,357 - pegasus.tools.batch_correction - INFO - Features are selected.
2020-08-20 20:55:20,059 - pegasus.tools.batch_correction - INFO - Batch correction is finished. Time spent = 0.71s.

In correct_batch function, features parameter specifies which genes/features to consider in batch correction. By default, it considers all the features. Here, as we've already selected a HVG set, we can assign its key in data_ls.var field, which is 'highly_variable_features', to this parameter.

In [13]:
data_ls.uns['fmat_highly_variable_features'].shape
Out[13]:
(35465, 2000)

As shown above, the corrected count matrix is stored at adata.uns['fmt_highly_variable_features'], with dimension cell-by-HVG.

See its documnetation for customization.

Now we can perform the downstream analysis:

In [14]:
pg.pca(data_ls, robust=True)
pg.neighbors(data_ls)
pg.louvain(data_ls)
pg.umap(data_ls)
2020-08-20 20:55:26,347 - pegasus.tools.preprocessing - INFO - Function 'pca' finished in 6.28s.
2020-08-20 20:55:31,270 - pegasus.tools.nearest_neighbors - INFO - Function 'get_neighbors' finished in 4.92s.
2020-08-20 20:55:32,204 - pegasus.tools.nearest_neighbors - INFO - Function 'calculate_affinity_matrix' finished in 0.93s.
2020-08-20 20:55:33,568 - pegasus.tools.graph_operations - INFO - Function 'construct_graph' finished in 1.36s.
2020-08-20 20:55:49,701 - pegasus.tools.clustering - INFO - Louvain clustering is done. Get <function cluster at 0x7ff2902fc200> clusters.
2020-08-20 20:55:49,822 - pegasus.tools.clustering - INFO - Function 'louvain' finished in 17.62s.
2020-08-20 20:55:49,832 - pegasus.tools.visualization - INFO - UMAP(min_dist=0.5, random_state=0, verbose=True)
2020-08-20 20:55:49,833 - pegasus.tools.visualization - INFO - Construct fuzzy simplicial set
2020-08-20 20:55:49,974 - pegasus.tools.visualization - INFO - Construct embedding
	completed  0  /  200 epochs
	completed  20  /  200 epochs
	completed  40  /  200 epochs
	completed  60  /  200 epochs
	completed  80  /  200 epochs
	completed  100  /  200 epochs
	completed  120  /  200 epochs
	completed  140  /  200 epochs
	completed  160  /  200 epochs
	completed  180  /  200 epochs
2020-08-20 20:56:10,040 - pegasus.tools.visualization - INFO - UMAP is calculated. Time spent = 20.22s.
2020-08-20 20:56:10,040 - pegasus.tools.visualization - INFO - Function 'umap' finished in 20.22s.
In [15]:
pg.scatter(data_ls, attrs=['louvain_labels', 'Channel'], basis='umap')

Scanorama Algorithm

Scanorama is another widely-used batch correction algorithm. Pegasus uses scanorama package.

Similarly as Harmony, Scanorama corrects the original PCA matrix of the dataset. But as PCA step is already integrated in run_scanorama function, we don't need to run pca seperately:

In [16]:
data_scan = data.copy()
scan_key = pg.run_scanorama(data_scan)
2020-08-20 20:56:11,467 - pegasus.tools.batch_correction - INFO - Start integration using Scanorama.
Found 2000 genes among all datasets
[[0.         0.70305776 0.29467456 0.50237826 0.52420538 0.46795284
  0.57802945 0.3832853 ]
 [0.         0.         0.13846154 0.66119554 0.32885086 0.60425532
  0.66666667 0.22256706]
 [0.         0.         0.         0.17809204 0.61633136 0.08660236
  0.16923077 0.55147929]
 [0.         0.         0.         0.         0.46234719 0.63590604
  0.54650048 0.42917313]
 [0.         0.         0.         0.         0.         0.40977995
  0.49731051 0.67872861]
 [0.         0.         0.         0.         0.         0.
  0.68081458 0.33440514]
 [0.         0.         0.         0.         0.         0.
  0.         0.59565507]
 [0.         0.         0.         0.         0.         0.
  0.         0.        ]]
Processing datasets (0, 1)
Processing datasets (5, 6)
Processing datasets (4, 7)
Processing datasets (1, 6)
Processing datasets (1, 3)
Processing datasets (3, 5)
Processing datasets (2, 4)
Processing datasets (1, 5)
Processing datasets (6, 7)
Processing datasets (0, 6)
Processing datasets (2, 7)
Processing datasets (3, 6)
Processing datasets (0, 4)
Processing datasets (0, 3)
Processing datasets (4, 6)
Processing datasets (0, 5)
Processing datasets (3, 4)
Processing datasets (3, 7)
Processing datasets (4, 5)
Processing datasets (0, 7)
Processing datasets (5, 7)
Processing datasets (1, 4)
Processing datasets (0, 2)
Processing datasets (1, 7)
Processing datasets (2, 3)
Processing datasets (2, 6)
Processing datasets (1, 2)
2020-08-20 20:57:27,399 - pegasus.tools.batch_correction - INFO - Function 'run_scanorama' finished in 75.95s.

You can check details on run_scanorama parameters here.

By default, it considers count matrix only regarding the selected HVGs, and calculates the corrected PCA matrix of $50$ PCs. When finished, this new PCA matrix is stored in data_scan.obsm['X_scanorama'], and returns its representation key 'scanorama' as variable scan_key. In the downstream steps, you can set rep parameter to either scan_key or 'scanorama' in Pegasus functions whenever applicable:

In [17]:
pg.neighbors(data_scan, rep=scan_key)
pg.louvain(data_scan, rep=scan_key)
pg.umap(data_scan, rep=scan_key)
2020-08-20 20:57:32,292 - pegasus.tools.nearest_neighbors - INFO - Function 'get_neighbors' finished in 4.89s.
2020-08-20 20:57:33,301 - pegasus.tools.nearest_neighbors - INFO - Function 'calculate_affinity_matrix' finished in 1.01s.
2020-08-20 20:57:34,647 - pegasus.tools.graph_operations - INFO - Function 'construct_graph' finished in 1.35s.
2020-08-20 20:57:45,821 - pegasus.tools.clustering - INFO - Louvain clustering is done. Get <function cluster at 0x7ff2902fc200> clusters.
2020-08-20 20:57:45,946 - pegasus.tools.clustering - INFO - Function 'louvain' finished in 12.64s.
2020-08-20 20:57:45,958 - pegasus.tools.visualization - INFO - UMAP(min_dist=0.5, random_state=0, verbose=True)
2020-08-20 20:57:45,959 - pegasus.tools.visualization - INFO - Construct fuzzy simplicial set
2020-08-20 20:57:46,124 - pegasus.tools.visualization - INFO - Construct embedding
	completed  0  /  200 epochs
	completed  20  /  200 epochs
	completed  40  /  200 epochs
	completed  60  /  200 epochs
	completed  80  /  200 epochs
	completed  100  /  200 epochs
	completed  120  /  200 epochs
	completed  140  /  200 epochs
	completed  160  /  200 epochs
	completed  180  /  200 epochs
2020-08-20 20:58:06,645 - pegasus.tools.visualization - INFO - UMAP is calculated. Time spent = 20.70s.
2020-08-20 20:58:06,647 - pegasus.tools.visualization - INFO - Function 'umap' finished in 20.70s.

Now check its UMAP plot:

In [18]:
pg.scatter(data_scan, attrs=['louvain_labels', 'Channel'], basis='umap')

Comparing Batch Correction Methods

To compare the performance on the three methods, one metric is runtime, which you can see from the logs in sections above: L/S Adjustment method is the fastest, then Harmony, and Scanorama is the slowest.

In this section, we'll use 2 other metrics for comparison:

  • kBET acceptance rate: kBET measures whether batches are well-mixed in the local neighborhood of each cell. Then kBET acceptance rate is the percent of cells with kBET p-values $\geq 0.05$. The higher, the better.
  • kSIM acceptance rate: kSIM measures whether cells of the same pre-annotated cell type are still close to each other in the local neighborhoods after batch correction. Then kSIM acceptance rate is the percent of cells with kSIM p-values $\geq 0.05$. We use this metric to reflect whether known biological relationships are preserved after correction. The higher, the better.

We have 4 results: No batch correction (Baseline), Harmony, L/S, and Scanorama. For each result, kBET and kSIM acceptance rates are calculated on its (corrected) PCA matrix for comparison.

Details on these 2 metrics can be found in Cumulus paper.

kBET Acceptance Rate

We can use calc_kBET function to calculate on kBET acceptance rates. Besides,

  • For attr parameter, use the batch key, which is 'Channel' in this tutorial.
  • For rep parameter, set to the corresponding (corrected) PCA matrix if needed. By default, it uses 'pca' representation key;
  • For returned values, we only care about kBET acceptance rates, so just ignore the first two returned values.
In [19]:
_, _, kBET_baseline = pg.calc_kBET(data_baseline, attr='Channel')
_, _, kBET_harmony = pg.calc_kBET(data_harmony, attr='Channel', rep=harmony_key)
_, _, kBET_ls = pg.calc_kBET(data_ls, attr='Channel')
_, _, kBET_scan = pg.calc_kBET(data_scan, attr='Channel', rep=scan_key)
2020-08-20 20:58:07,750 - pegasus.tools.nearest_neighbors - INFO - Found cached kNN results, no calculation is required.
2020-08-20 20:58:07,750 - pegasus.tools.nearest_neighbors - INFO - Function 'get_neighbors' finished in 0.00s.
2020-08-20 20:58:17,030 - pegasus.tools.nearest_neighbors - INFO - Found cached kNN results, no calculation is required.
2020-08-20 20:58:17,031 - pegasus.tools.nearest_neighbors - INFO - Function 'get_neighbors' finished in 0.00s.
2020-08-20 20:58:19,873 - pegasus.tools.nearest_neighbors - INFO - Found cached kNN results, no calculation is required.
2020-08-20 20:58:19,874 - pegasus.tools.nearest_neighbors - INFO - Function 'get_neighbors' finished in 0.00s.
2020-08-20 20:58:22,753 - pegasus.tools.nearest_neighbors - INFO - Found cached kNN results, no calculation is required.
2020-08-20 20:58:22,754 - pegasus.tools.nearest_neighbors - INFO - Function 'get_neighbors' finished in 0.00s.

kSIM Acceptance Rate

We need pre-annotated cell type information as ground truth to calculate kSIM acceptance rate. This is achieved by:

  • Starting with Louvain clusters in the baseline case;
  • Merging two CD14+ Monocyte clusters, as one is donor-specific;
  • For the only cluster with no strong cell type evidence, classifying its cells using a Gradient Boosting model trained from its neighbor T-cell clusters.

This ground truth is stored at https://storage.googleapis.com/terra-featured-workspaces/Cumulus/cell_types.csv, or we can get it using gsutil from its Google bucket URL (gs://terra-featured-workspaces/Cumulus/cell_types.csv):

Now load this file, and attach its 'cell_types' column to the 4 resulting count matrices above:

In [20]:
import pandas as pd
import numpy as np

df_celltypes = pd.read_csv("cell_types.csv", index_col='barcodekey')

assert np.sum(df_celltypes.index!=data_baseline.obs_names) == 0
data_baseline.obs['cell_types'] = df_celltypes['cell_types']

assert np.sum(df_celltypes.index!=data_harmony.obs_names) == 0
data_harmony.obs['cell_types'] = df_celltypes['cell_types']

assert np.sum(df_celltypes.index!=data_ls.obs_names) == 0
data_ls.obs['cell_types'] = df_celltypes['cell_types']

assert np.sum(df_celltypes.index!=data_scan.obs_names) == 0
data_scan.obs['cell_types'] = df_celltypes['cell_types']

We can then use calc_kSIM function to calculate kSIM acceptance rates. Besides,

  • For attr parameter, use the ground truth key 'cell_types';
  • For rep parameter, similarly as in kBET section, set to the corresponding (corrected) PCA matrix if needed. By default, it uses 'pca' representation key;
  • For returned values, we only care about kSIM acceptance rates, so just ignore the first returned values.
In [21]:
_, kSIM_baseline = pg.calc_kSIM(data_baseline, attr='cell_types')
_, kSIM_harmony = pg.calc_kSIM(data_harmony, attr='cell_types', rep=harmony_key)
_, kSIM_ls = pg.calc_kSIM(data_ls, attr='cell_types')
_, kSIM_scan = pg.calc_kSIM(data_scan, attr='cell_types', rep=scan_key)
2020-08-20 20:58:25,845 - pegasus.tools.nearest_neighbors - INFO - Found cached kNN results, no calculation is required.
2020-08-20 20:58:25,846 - pegasus.tools.nearest_neighbors - INFO - Function 'get_neighbors' finished in 0.00s.
2020-08-20 20:58:25,862 - pegasus.tools.nearest_neighbors - INFO - Found cached kNN results, no calculation is required.
2020-08-20 20:58:25,862 - pegasus.tools.nearest_neighbors - INFO - Function 'get_neighbors' finished in 0.00s.
2020-08-20 20:58:25,878 - pegasus.tools.nearest_neighbors - INFO - Found cached kNN results, no calculation is required.
2020-08-20 20:58:25,878 - pegasus.tools.nearest_neighbors - INFO - Function 'get_neighbors' finished in 0.00s.
2020-08-20 20:58:25,888 - pegasus.tools.nearest_neighbors - INFO - Found cached kNN results, no calculation is required.
2020-08-20 20:58:25,888 - pegasus.tools.nearest_neighbors - INFO - Function 'get_neighbors' finished in 0.00s.

Performance Plot

Now draw a scatterplot regarding these two metrics on the 4 results:

In [22]:
import seaborn as sns
import matplotlib.pyplot as plt

df_plot = pd.DataFrame({'method': ['Baseline', 'Harmony', 'L/S', 'Scanorama'],
                        'kBET': [kBET_baseline, kBET_harmony, kBET_ls, kBET_scan],
                        'kSIM': [kSIM_baseline, kSIM_harmony, kSIM_ls, kSIM_scan]})

plt.figure(dpi=100)
ax = sns.scatterplot(x = 'kSIM', y = 'kBET', hue = 'method', data = df_plot, legend = False)
for line in range(0, df_plot.shape[0]):
    x_pos = df_plot.kSIM[line] + 0.003
    if df_plot.method[line] == 'Baseline':
        x_pos = df_plot.kSIM[line] - 0.003
    y_pos = df_plot.kBET[line]
    alignment = 'right' if df_plot.method[line] == 'Baseline' else 'left'
    ax.text(x_pos, y_pos, df_plot.method[line], ha = alignment, size = 'medium', color = 'black')
plt.xlabel('kSIM acceptance rate')
plt.ylabel('kBET acceptance rate')
Out[22]:
Text(0, 0.5, 'kBET acceptance rate')

As this plot shows, a trade-off exists between good mixture of cells and maintaining the biology well. Harmony method achieves the best mixture of cells, while its consistency with the ground truth biology is the least. Scanorama seems to have the best balance between the two measurements.

Therefore, in general, the choice of batch correction method really depends on the dataset and your analysis goal.