malbacR: A Vignette Pertaining to Pre-Filtered Data

The R package malbacR is a new package that deals with batch correction methods of small molecule omics data. It works in conjunction with pmartR, an R package designed for preprocessing, filtering, and analyzing multi-omics data (Stratton et al. 2019; Degnan et al. 2023).

This outline demonstrates the functionality and use of the malbacR package. Within this package, there are four data sets: pmart_amide, pmart_amideFilt,pmart_mix, and pmart_mixFilt. The first two objects contain data originally found in the package WaveICA2.0 (Deng and Li 2022). The second two objects contain data originally found in the package crmn (Redestig et al. 2009). The data sets with the suffix “Filt” are filtered versions of the original data sets. That is, unlike their non-filtered counterparts, these data objects require no additional data manipulation for batch correction to successfully run. For this example, we work with the filtered data sets to demonstrate how the malbacR functions work.

Within malbacR there are 12 batch correction methods:

Range Scaling (Berg et al. 2006)
Power Scaling (Berg et al. 2006)
Pareto Scaling (Berg et al. 2006)
ComBat (Müller et al. 2016)
EigenMS (Karpievitch et al. 2014)
NOMIS (Sysi-Aho et al. 2007)
RUV-random (De Livera et al. 2015)
QC-RLSC (Dunn et al. 2011)
WaveICA2.0 (Deng et al. 2019)
TIGER (Han et al. 2022)
SERRF (Fan et al. 2019)
QC-RFSC (Luan et al. 2018)

library(malbacR)
library(pmartR)
library(ggplot2)

Data Set 1: Amide Data

Load in the Data

A pmartR friendly version of the “Amide” data that has already undergone filtering and log2 transformations is already implemented within malbacR. Therefore, we simply load in the data.

data(pmart_amideFilt)

Run Batch Correction Methods

The “Amide” data set contains information regarding quality control samples. Therefore the following methods can be used:

Range Scaling
Power Scaling
Pareto Scaling
TIGER
EigenMS
WaveICA2.0
SERRF
ComBat
QCRLSC
QCRFSC

It is important to note that the data pmart_amideFilt has been filtered, transformed, and imputed such that the data runs with all possible methods, but it may be conservative for a given method. For example, some methods do not require imputation to run, but because some methods require imputation (like SERRF and WaveICA2.0) (Fan et al. 2019; Deng et al. 2019), the data has been imputed.

The data is currently on a log2 scale, therefore, the data must be converted to abundance values for the following methods: SERRF, TIGER, QC-RFSC, and WaveICA.

# SCALING METHODS
# range scaling
amide_range <- bc_range(omicsData = pmart_amideFilt)

# power scaling
amide_power <- bc_power(omicsData = pmart_amideFilt)

# pareto scaling
amide_pareto <- bc_pareto(omicsData = pmart_amideFilt)

# QUALITY CONTROL METHODS
# TIGER
pmart_amideFilt_abundance <- edata_transform(omicsData = pmart_amideFilt, data_scale = "abundance")
amide_tiger_abundance <- bc_tiger(omicsData = pmart_amideFilt_abundance,sampletype_cname = "group",
                        test_val = "QC",injection_cname = "Injection_order",group_cname = "group")
amide_tiger <- edata_transform(omicsData = amide_tiger_abundance, data_scale = "log2")

# QC-RLSC
amide_qcrlsc <- bc_qcrlsc(omicsData = pmart_amideFilt,block_cname = "batch",
                          qc_cname = "group", qc_val = "QC", order_cname = "Injection_order",
                          missing_thresh = 0.5, rsd_thresh = 0.3, backtransform  = FALSE,keep_qc = FALSE)

# SERRF
amide_serrf_abundance <- bc_serrf(omicsData = pmart_amideFilt_abundance,sampletype_cname = "group",test_val = "QC",group_cname = "group")
amide_serrf <- edata_transform(omicsData = amide_serrf_abundance, data_scale = "log2")

# OTHER METHODS
# ComBat
amide_combat <- bc_combat(omicsData = pmart_amideFilt, use_groups = FALSE)

# EigenMS
amide_eigen <- bc_eigenMS(omicsData = pmart_amideFilt)

# WaveICA2.0
amide_wave_abundance <- bc_waveica(omicsData = pmart_amideFilt_abundance, injection_cname = "Injection_order",
                         version = "WaveICA2.0", alpha = 0, cutoff_injection = 0.1, K = 10,
                         negative_to_na = TRUE)
amide_wave <- edata_transform(omicsData = amide_wave_abundance, data_scale = "log2")

# QC-RFSC
amide_qcrfsc_abundance <- bc_qcrfsc(omicsData = pmart_amideFilt_abundance,qc_cname = "group",qc_val = "QC",
                          order_cname = "Injection_order",ntree = 500, keep_qc = FALSE,group_cname = "group")
amide_qcrfsc <- edata_transform(amide_qcrfsc_abundance, data_scale = "log2")

Data Visualization

After obtaining all the different batch corrected data sets, we can plot the PPCA to see if they are successfully returning batch corrected data. We set the group_designation to use our batch information as the main_effects so as to color the data by batch. All of this code is run using functions from pmartR demonstrating the utility between the two packages.

p1 <- plot(dim_reduction(omicsData = pmart_amideFilt),omicsData = pmart_amide,color_by = "batch") + labs(title = "Amide: Unadjusted")
p2 <- plot(dim_reduction(omicsData = amide_range),omicsData = amide_range,color_by = "batch") + labs(title = "Amide: Range")
p3 <- plot(dim_reduction(omicsData = amide_power),omicsData = amide_power,color_by = "batch") + labs(title = "Amide: Power")
p4 <- plot(dim_reduction(omicsData = amide_pareto),omicsData = amide_pareto,color_by = "batch") + labs(title = "Amide: Pareto")
p5 <- plot(dim_reduction(omicsData = amide_tiger),omicsData = amide_tiger,color_by = "batch") + labs(title = "Amide: TIGER")
p6 <- plot(dim_reduction(omicsData = amide_qcrlsc),omicsData = amide_qcrlsc,color_by = "batch") + labs(title = "Amide: QC-RLSC")
p7 <- plot(dim_reduction(omicsData = amide_combat),omicsData = amide_combat,color_by = "batch") + labs(title = "Amide: ComBat")
p8 <- plot(dim_reduction(omicsData = amide_wave),omicsData = amide_wave,color_by = "batch") + labs(title = "Amide: WaveICA2.0")
p9 <- plot(dim_reduction(omicsData = amide_serrf),omicsData = amide_serrf,color_by = "batch") + labs(title = "Amide: SERRF")
p10 <- plot(dim_reduction(omicsData = amide_qcrfsc),omicsData = amide_qcrfsc,color_by = "batch") + labs(title = "Amide: QC-RFSC")
p11 <- plot(dim_reduction(omicsData = amide_eigen),omicsData = amide_eigen,color_by = "batch") + labs(title = "Amide: EigenMS")

p1;p2;p3;p4;p5;p6;p7;p8;p9;p10;p11

Data Set 2: Mix Data

Load in the Data

The malbacR package includes another pmartR friendly data set for the “mix” data which has already undergone filtering and log2 transformations is already implemented within malbacR. Therefore, we simply load in the data.

data(pmart_mixFilt)

Run Batch Correction Methods

The “mix” data set contains information regarding negative controls/internal standards. Therefore the following methods can be used:

Range Scaling
Power Scaling
Pareto Scaling
RUV-random
NOMIS
ComBat

As there is no information regarding QC samples or injection order, the other batch correction methods in the package or not able to be used with this data set.

As with the pmart_amideFilt data set, is important to note that the data in pmart_mixFilt has been filtered and transformed such that the data runs with all possible methods, but it may be conservative for a given method. As there were no missing data with the “mix” data set, no imputation was computed.

The data is currently on a log2 scale, therefore, the data must be converted to abundance values for the following methods: NOMIS and WaveICA.

# SCALING METHODS
# range scaling
mix_range <- bc_range(omicsData = pmart_mixFilt)

# pareto scaling
mix_pareto <- bc_pareto(omicsData = pmart_mixFilt)

# power scaling
mix_power <- bc_power(omicsData = pmart_mixFilt)

# INTERNAL STANDARDS/NEGATIVE CONTROLS
# RUV-random
mix_ruv <- bc_ruvRandom(omicsData = pmart_mixFilt, nc_cname = "tag",nc_val = "IS", k = 3)

# NOMIS
pmart_mixFilt_abundance <- edata_transform(omicsData = pmart_mixFilt, data_scale = "abundance")
mix_nomis_abundance <- bc_nomis(omicsData = pmart_mixFilt_abundance, is_cname = "tag", is_val = "IS", num_pc = 2)
mix_nomis <- edata_transform(omicsData = mix_nomis_abundance, data_scale = "log2")

# OTHER METHODS
mix_combat <- bc_combat(omicsData = pmart_mixFilt)

mix_waveica_abundance <- bc_waveica(omicsData = pmart_mixFilt_abundance,batch_cname = "BatchNum",
                                    version = "WaveICA",alpha = 0, K = 20, cutoff_batch = 0.05, 
                                    cutoff_group = 0.05,negative_to_na = TRUE)

## ######Decomposition 1 ########
## ######Decomposition 2 ########
## ######Decomposition 3 ########
## ######Decomposition 4 ########
## ######Decomposition 5 ########
## ######Decomposition 6 ########
## ######Decomposition 7 ########
## ######Decomposition 8 ########
## ######Decomposition 9 ########
## ######Decomposition 10 ########
## ######Decomposition 11 ########
## ######Decomposition 12 ########
## ######Decomposition 13 ########
## ######Decomposition 14 ########
## ######Decomposition 15 ########
## ######Decomposition 16 ########
## ######Decomposition 17 ########
## ######Decomposition 18 ########
## ######Decomposition 19 ########
## ######Decomposition 20 ########
## ######Decomposition 21 ########
## ######Decomposition 22 ########
## ######Decomposition 23 ########
## ######Decomposition 24 ########
## ######Decomposition 25 ########
## ######Decomposition 26 ########
## ######Decomposition 27 ########
## ######Decomposition 28 ########
## ######Decomposition 29 ########
## ######Decomposition 30 ########
## ######Decomposition 31 ########
## ######Decomposition 32 ########
## ######Decomposition 33 ########
## ######Decomposition 34 ########
## ######Decomposition 35 ########
## ######Decomposition 36 ########
## ######Decomposition 37 ########
## ######Decomposition 38 ########
## ######Decomposition 39 ########
## ######Decomposition 40 ########
## ######Decomposition 41 ########
## ######Decomposition 42 ########
## ######Decomposition 43 ########
## ######Decomposition 44 ########
## ######Decomposition 45 ########
## ######Decomposition 46 ########
## ######### ICA 1 #############
## [1] "Removing 0 components with P value less than 0.05"
## ######### ICA 2 #############
## [1] "Removing 10 components with P value less than 0.05"
## ######### ICA 3 #############
## [1] "Removing 13 components with P value less than 0.05"
## ######### ICA 4 #############
## [1] "Removing 11 components with P value less than 0.05"
## ######### ICA 5 #############
## [1] "Removing 19 components with P value less than 0.05"
## ######### ICA 6 #############
## [1] "Removing 1 components with P value less than 0.05"
## ######Reconstruction 1 ########
## ######Reconstruction 2 ########
## ######Reconstruction 3 ########
## ######Reconstruction 4 ########
## ######Reconstruction 5 ########
## ######Reconstruction 6 ########
## ######Reconstruction 7 ########
## ######Reconstruction 8 ########
## ######Reconstruction 9 ########
## ######Reconstruction 10 ########
## ######Reconstruction 11 ########
## ######Reconstruction 12 ########
## ######Reconstruction 13 ########
## ######Reconstruction 14 ########
## ######Reconstruction 15 ########
## ######Reconstruction 16 ########
## ######Reconstruction 17 ########
## ######Reconstruction 18 ########
## ######Reconstruction 19 ########
## ######Reconstruction 20 ########
## ######Reconstruction 21 ########
## ######Reconstruction 22 ########
## ######Reconstruction 23 ########
## ######Reconstruction 24 ########
## ######Reconstruction 25 ########
## ######Reconstruction 26 ########
## ######Reconstruction 27 ########
## ######Reconstruction 28 ########
## ######Reconstruction 29 ########
## ######Reconstruction 30 ########
## ######Reconstruction 31 ########
## ######Reconstruction 32 ########
## ######Reconstruction 33 ########
## ######Reconstruction 34 ########
## ######Reconstruction 35 ########
## ######Reconstruction 36 ########
## ######Reconstruction 37 ########
## ######Reconstruction 38 ########
## ######Reconstruction 39 ########
## ######Reconstruction 40 ########
## ######Reconstruction 41 ########
## ######Reconstruction 42 ########
## ######Reconstruction 43 ########
## ######Reconstruction 44 ########
## ######Reconstruction 45 ########
## ######Reconstruction 46 ########

mix_waveica <- edata_transform(omicsData = mix_waveica_abundance, data_scale = "log2")

Mix: Data Visualization

Similar to the previous data set, we can compare the PCA plots between the unadjusted and adjusted data sets.

p1 <- plot(dim_reduction(omicsData = pmart_mixFilt),omicsData = pmart_mixFilt,color_by = "BatchNum") + labs(title = "Mix: Unadjusted")
p2 <- plot(dim_reduction(omicsData = mix_ruv),omicsData = mix_ruv,color_by = "BatchNum") + labs(title = "Mix: ruv-Random")
p3 <- plot(dim_reduction(omicsData = mix_nomis),omicsData = mix_nomis,color_by = "BatchNum") + labs(title = "Mix: NOMIS")
p4 <- plot(dim_reduction(omicsData = mix_combat),omicsData = mix_combat,color_by = "BatchNum") + labs(title = "Mix: ComBat")
p5 <- plot(dim_reduction(omicsData = mix_range),omicsData = mix_range,color_by = "BatchNum") + labs(title = "Mix: Range")
p6 <- plot(dim_reduction(omicsData = mix_power),omicsData = mix_power,color_by = "BatchNum") + labs(title = "Mix: Power")
p7 <- plot(dim_reduction(omicsData = mix_pareto),omicsData = mix_pareto,color_by = "BatchNum") + labs(title = "Mix: Pareto")
p8 <- plot(dim_reduction(omicsData = mix_waveica),omicsData = mix_waveica,color_by = "BatchNum") + labs(title = "Mix: WaveICA")

p1;p2;p3;p4;p5;p6;p7;p8

References

Berg, Robert A. van der, Huub CJ Hoefsloot, Johan A. Westerhuis, Age K. Smilde, and Mariët J. van der Werf. 2006. “Centering, Scaling, and Transformations: Improving the Biological Information Content of Metabolomics Data.” BMC Genomics 1: 142. https://doi.org/10.1186/1471-2164-7-142.

De Livera, Alysha M., Marko Sysi-Aho, Jacob Laurent, Johann A. Gagnon-Bartsch, Sandra Castillo, Julie A. Simpson, and Terence P. Speed. 2015. “Statistical Methods for Handling Unwanted Variation in Metabolomics Data.” Analytical Chemistry 87 (7): 3606–15. https://doi.org/10.1021/ac502439y.

Degnan, David J, Kelly G Stratton, Rachel Richardson, Daniel Claborne, Evan A Martin, Nathan A Johnson, Damon Leach, Bobbie-Jo M Webb-Robertson, and Lisa M Bramer. 2023. “pmartR 2.0: A Quality Control, Visualization, and Statistics Pipeline for Multiple Omics Datatypes.” Journal of Proteome Research 22 (2): 570–76. https://doi.org/10.1021/acs.jproteome.2c00610.

Deng, Kui, and Kang Li. 2022. WaveICA2.0: WaveICA 2.0.

Deng, Kui, Fan Zhang, Qilong Tan, Yue Huang, Zhiwei Rong, Zheng-Jiang Zhu, Kang Li, and Zhenzi Li. 2019. “WaveICA: A Novel Algorithm to Remove Batch Effects for Large-Scale Untargeted Metabolomics Data Based on Wavelet Analysis.” Analytica Chimica Acta 1061: 60–69. https://doi.org/10.1016/j.aca.2019.02.010.

Dunn, Warwick B., David Broadhurst, Paul Begley, Eva Zelena, Sue Francis-McIntyrre, Nadine Anderson, Marie Brown, et al. 2011. “Procedures for Large-Scale Metabolic Profiling of Serum and Plasma Using Gas Chromatography and Liquid Chromatography Coupled to Mass Spectrometry.” Nature Protocols 6 (7): 1060–83. https://doi.org/10.1038/nprot.2011.335.

Fan, Sili, Tobias Kind, Tomas Cajka, Stanley L. Hazen, H. W. Wilson Tang, Rima Kaddurah-Daouk, Marguerite R. Irvin, Donna K. Arnett, Dinesh K. Barupal, and Oliver Fiehn. 2019. “Systematic Error Removal Using Random Forest for Normalizing Large-Scale Untargeted Lipidomics Data.” Analytical Chemistry 91 (5): 3590–96. https://doi.org/10.1021/acs.analchem.8b05592.

Han, Siyu, Jialing Huang, Francesco Foppiano, Cornelia Prehn, Jerzy Adamski, Karsten Suhre, Ying Li, et al. 2022. “TIGER: Technical Variation Elimination for Metabolomics Data Using Ensemble Learning Architecture.” Briefings in Bioinformatics 23 (2). https://doi.org/10.1093/bib/bbab535.

Karpievitch, Yuliya V., Sonja B. Nikolic, Richard Wilson, James E. Sharman, and Lindsay M. Edwards. 2014. “Metabolomics Data Normalization with EigenMS.” PLOS ONE 9 (12). https://doi.org/10.1371/journal.pone.0116221.

Luan, Hemi, Fenfen Ji, Yu Chen, and Zongwei Cai. 2018. “statTarget: A Streamlined Tool for Signal Drift Correction and Interpretations of Quantitative Mass Spectrometry-Based Omics Data.” Analytica Chimica Acta 1036: 66–72. https://doi.org/10.1016/j.aca.2018.08.002.

Müller, Christian, Arne Schillert, Caroline Röthmeier, David-Alexandre Trégouët, Caroline Proust, Harald Binder, Norbert Pfeiffer, et al. 2016. “Removing Batch Effects from Longitudinal Gene Expression - Quantile Normalization Plus ComBat as Best Approach for Microarray Transcriptome Data.” PLOS ONE 11 (6). https://doi.org/10.1371/journal.pone.0156594.

Redestig, Henning, Atsushi Fukushima, Hans Stenlund, Thomas Moritz, Masanori Arita, Kazuki Saito, and Miyako Kusano. 2009. “Compensation for Systematic Cross-Contribution Improves Normalization of Mass Spectrometry Based Metabolomics Data.” Analytical Chemistry 81: 7974–80.

Stratton, Kelly G, Bobbie-Jo M Webb-Robertson, Lee Ann McCue, Daniel Claborne, Bryan Stanfill, Iobani Godinez, Thomas Johansen, et al. 2019. “pmartR: Quality Control and Statistics for Mass Spectrometry-Based Biological Data.” Journal of Proteome Research.

Sysi-Aho, Marko, Mikko Katajamaa, Laxman Yetukuri, and Matej Orešič. 2007. “Normalization Method for Metabolomics Data Using Optimal Selection of Multiple Internal Standards.” BMC Bioinformatics 8: 93. https://doi.org/10.1186/1471-2105-8-93.

Damon Leach, Kelly Stratton, Lisa Bramer

2024-02-06

Data Set 1: Amide Data

Load in the Data

Run Batch Correction Methods

Data Visualization

Data Set 2: Mix Data

Load in the Data

Run Batch Correction Methods

Mix: Data Visualization

References