Skip to contents

Introduction

The R package malbacR is a new package that deals with batch correction methods of small molecule omics data. It works in conjunction with pmartR, an R package designed for preprocessing, filtering, and analyzing multi-omics data (Stratton et al. 2019; Degnan et al. 2023).

This outline demonstrates the use of the function rank_becas inside the the malbacR package. This function allows users to rank batch correction methods based on multiple commonly used metrics, the coefficient of variation and the distance between centroids of batch clusters from a PCA. Additionally, we incorporate a new metric that is more robust to data that has both a strong biological signal as well as data that may not have a strong of a signal - the difference in R2m/R2c between the batch corrected data and the unnormalized data.

First we load in the necessary libraries which is the malbacR package itself and pmartR (Degnan et al. 2023) which malbacR works with compatibly. Additionally, ggplot2 is loaded for some slight graphical manipulations.

We load in the pmart_amide data set from the package malbacR originally found in the package WaveICA2.0 (Deng and Li 2022).

data(pmart_amide)

We then run through some general pre-processing steps with the use of functions from pmartR by assigning a group designation, log2 transforming the data, and run a median centering normalization.

pmart_amide <- group_designation(pmart_amide,main_effects = "group",batch_id = "batch")
pmart_amide_log <- edata_transform(pmart_amide,"log2")
pmart_amide_norm <- normalize_global(pmart_amide_log,subset_fn = "all",norm_fn = "median",
                                apply_norm = TRUE,backtransform = TRUE)

We then run eight different BECAs using the corresponding functions in malbacR using the preferred data input (i.e. log2 transformed or raw abundance values, normalized data or unnormalized data) for each specific BECA. Missing values are imputed for BECAs that require complete data.

# range scaling
amide_range <- bc_range(omicsData = pmart_amide_log)

# power scaling
amide_power <- bc_power(omicsData = pmart_amide_log)

# pareto scaling
amide_pareto <- bc_pareto(omicsData = pmart_amide_log)

# tiger
tigerFilt <- tiger_filter(pmart_amide,sampletype_cname = "group",test_val = "QC")
pmart_amideFilt <- apply_tigerFilt(tigerFilt,pmart_amide)
amide_tiger_abundance <- bc_tiger(omicsData = pmart_amideFilt,sampletype_cname = "group",
                        test_val = "QC",injection_cname = "Injection_order",group_cname = "group")
amide_tiger <- edata_transform(omicsData = amide_tiger_abundance, data_scale = "log2")

# serrf
impObj <- imputation(omicsData = pmart_amide_log)
amide_imp_log <- apply_imputation(impObj,pmart_amide_log)
amide_imp_raw <- edata_transform(amide_imp_log,"abundance")

amide_serrf_abundance <- bc_serrf(omicsData = amide_imp_raw,sampletype_cname = "group",test_val = "QC",group_cname = "group")
amide_serrf <- edata_transform(omicsData = amide_serrf_abundance,data_scale = "log2")

# qcrfsc
amide_qcrfsc_abundance <- bc_qcrfsc(omicsData = amide_imp_raw,qc_cname = "group",qc_val = "QC",
                                    order_cname = "Injection_order",group_cname = "group",
                                    ntree = 500, keep_qc = FALSE)
amide_qcrfsc <- edata_transform(omicsData = amide_qcrfsc_abundance,data_scale = "log2")

# combat
amide_combat <- bc_combat(omicsData = pmart_amide_norm, use_groups = FALSE)

# eigenMS
amide_eigen <- bc_eigenMS(omicsData = pmart_amide_log)

As some methods (QC-RFSC, TIGER, and SERRF) remove the QC samples in the process, we remove those samples from the other methods as well.

qc_samples <- pmart_amide_log$f_data$SampleID[which(pmart_amide_log$f_data$group == "QC")]

cfilt <- pmartR::custom_filter(amide_combat, f_data_remove = qc_samples)
amide_combat <- pmartR::applyFilt(cfilt,amide_combat)

cfilt <- pmartR::custom_filter(amide_eigen, f_data_remove = qc_samples)
amide_eigen <- pmartR::applyFilt(cfilt,amide_eigen)

cfilt <- pmartR::custom_filter(amide_power, f_data_remove = qc_samples)
amide_power <- pmartR::applyFilt(cfilt,amide_power)

cfilt <- pmartR::custom_filter(amide_pareto, f_data_remove = qc_samples)
amide_pareto <- pmartR::applyFilt(cfilt,amide_pareto)

cfilt <- pmartR::custom_filter(amide_range, f_data_remove = qc_samples)
amide_range <- pmartR::applyFilt(cfilt,amide_range)

We then create a named list becas which contains the pmartR objects for each of the 8 different BECAs being compared.

becas = list(Power = amide_power, Pareto = amide_pareto, Range = amide_range,
             ComBat = amide_combat, EigenMS = amide_eigen,
             QCRFSC = amide_qcrfsc, TIGER = amide_tiger, SERRF = amide_serrf)

Ranking BECAs

We then rank the methods with three different potential metrics: the coefficient of variation (CV), the distance between batch clusters in a PCA, and the difference in R2m/R2c between batch corrected data and unnormalized data. This metric utilizes a mixed effect model for each molecule and determines the conditional and marginal R2 values based on the method introduced by Nakagawa and Schielzeth (Nakagawa and Schielzeth 2012). It then subtracts the ratio of the marginal R2 and the conditional R2 of the unnormalized dataset from the ratio of the batch corrected data and determines the median value. In general, we recommend this final metric as it is more robust to data that has both a large biological signal as well as to data in which there is less biological signal.

We rank the data using the function rank_becas in malbacR. This function has five arguments:

  1. omicsData_beca_list: a named list with the pmartR objects of batch corrected data
  2. comparison_method: a string supplying the metric used to compare BECAs (cv for the coefficient of variation, distance_pca for the distance between clusters of batches in PCA, and r2_diff for the difference in R2m/R2c between batch corrected data and unnormalized data)
  3. batch_effect_cname: a string supplying the column name in f_data element of a pmartR object that contains batch information for each sample
  4. main_effect_cname: a string supplying the column name in f_data element of a pmartR object that contains group/main effect information for each sample
  5. omicsData_unnormalized: pmartR object containing the unnormalized data (this is only required if the comparison_method is r2_diff, otherwise the default is NULL)

The output of this function is a dataframe with three columns. Each row in the dataframe corresponds to one of the pmartR objects in omicsData_beca_list. The first column is BECA which corresponds to the names supplied to the named list in omicsData_beca_list, the second column is Value which is the value of the metric for the specific batch corrected dataset, and the final column is Ranking which is simply an integer from 1 to the number of datasets being analyzed (with 1 being the best BECA based on that specific metric).

First, we rank the BECAs using the coefficient of variation.

cv_rankings <- rank_becas(omicsData_beca_list = becas, comparison_method = "cv",
                          batch_effect_cname = "batch", main_effect_cname = "group")
cv_rankings
##      BECA  Value Ranking
## 1   Power  7.818       1
## 2   Range 10.529       2
## 3 EigenMS 39.976       3
## 4   SERRF 43.358       4
## 5  ComBat 46.881       5
## 6   TIGER 47.648       6
## 7  QCRFSC 50.567       7
## 8  Pareto 62.897       8

Next, we rank the data based on the distance between batch clusters in a PCA.

pca_rankings <- rank_becas(omicsData_beca_list = becas, comparison_method = "distance_pca",
                          batch_effect_cname = "batch", main_effect_cname = "group")
pca_rankings
##      BECA Value Ranking
## 1  ComBat 0.012       1
## 2 EigenMS 0.021       2
## 3   SERRF 0.039       3
## 4  QCRFSC 0.418       4
## 5   TIGER 0.481       5
## 6   Power 0.891       6
## 7  Pareto 0.894       7
## 8   Range 0.894       8

Finally, we rank the data using the difference in R2m/R2c for batch corrected data and unnormalized data.

r2_rankings <- rank_becas(omicsData_beca_list = becas, comparison_method = "r2_diff",
                          batch_effect_cname = "batch", main_effect_cname = "group",
                          omicsData_unnormalized = pmart_amide_log)
r2_rankings
##      BECA  Value Ranking
## 1  ComBat  0.991       1
## 2 EigenMS  0.990       2
## 3   SERRF  0.298       3
## 4  QCRFSC  0.011       4
## 5   TIGER  0.007       5
## 6   Power -0.002       6
## 7  Pareto -0.002       7
## 8   Range -0.002       8

It becomes evident that some of these metrics contradict each other. For example, using the CV, Power Scaling becomes the preferred BECA of choice, but it is ranked 6th by the other 2 metrics. In this scenario, the distance between clusters and the difference in R2m/R2c have very similar rankings as there is a strong biological signal within this dataset. However, in scenarios where there is less signal from the main effect, the distance between PCA clusters is likely to underperform. In contrast, the difference in R2m/R2c is more robust to the amount of biological signal from the data.

References

Degnan, David J, Kelly G Stratton, Rachel Richardson, Daniel Claborne, Evan A Martin, Nathan A Johnson, Damon Leach, Bobbie-Jo M Webb-Robertson, and Lisa M Bramer. 2023. “pmartR 2.0: A Quality Control, Visualization, and Statistics Pipeline for Multiple Omics Datatypes.” Journal of Proteome Research 22 (2): 570–76. https://doi.org/10.1021/acs.jproteome.2c00610.
Deng, Kui, and Kang Li. 2022. WaveICA2.0: WaveICA 2.0.
Nakagawa, Shinichi, and Holger Schielzeth. 2012. “A General and Simple Method for Obtaining R2 from Generalized Linear Mixed-Effects Models.” Methods in Ecology and Evolution 4 (2): 133–42. https://doi.org/10.1111/j.2041-210x.2012.00261.x.
Stratton, Kelly G, Bobbie-Jo M Webb-Robertson, Lee Ann McCue, Daniel Claborne, Bryan Stanfill, Iobani Godinez, Thomas Johansen, et al. 2019. “pmartR: Quality Control and Statistics for Mass Spectrometry-Based Biological Data.” Journal of Proteome Research.