Calculates normalization parameters based on the data using the specified subset and normalization functions with option to apply the normalization to the data.

normalize_global(
  omicsData,
  subset_fn,
  norm_fn,
  params = NULL,
  apply_norm = FALSE,
  backtransform = FALSE,
  min_prop = NULL,
  check.names = NULL
)

Arguments

omicsData

an object of the class 'pepData', 'proData', 'metabData', 'lipidData', 'nmrData', created by as.pepData, as.proData, as.metabData, as.lipidData, as.nmrData, respectively. The function group_designation must have been run on omicsData to use several of the subset functions (i.e. rip and ppp_rip).

subset_fn

character string indicating the subset function to use for normalization. See details for the current offerings.

norm_fn

character string indicating the normalization function to use for normalization. See details for the current offerings.

params

additional arguments passed to the specified subset function. See details for parameter specification and default values.

apply_norm

logical argument indicating if the normalization should be applied to the data. Defaults to FALSE. If TRUE, the normalization is applied to the data and an S3 object of the same class as omicsData (e.g. 'pepData') with normalized values in e_data is returned. If FALSE, the normalization is not applied to the data and an S3 object of class 'normRes' is returned.

backtransform

logical argument indicating if parameters for back transforming the data, after normalization, should be calculated. Defaults to FALSE. If TRUE, the parameters for back transforming the data after normalization will be calculated, and subsequently included in the data normalization if apply_norm is TRUE. See the details section for an explanation of how these factors are calculated.

min_prop

numeric threshold between 0 and 1 giving the minimum value for the proportion of biomolecules subset (rows of e_data)

check.names

deprecated

Value

If apply_norm is FALSE, an S3 object of type 'normRes' is returned. This object contains a list with: subset method, normalization method, normalization parameters, number of biomolecules used in normalization, and proportion of biomolecules used in normalization. plot() and summary() methods are available for this object. If apply_norm is TRUE, then the normalized data is returned in an object of the appropriate S3 class (e.g. pepData).

Details

Below are details for specifying function and parameter options.

Subset Functions

Specifying a subset function indicates the subset of biomolecules (rows of e_data) that should be used for computing normalization factors. The following are valid options: "all", "los", "ppp", "complete", "rip", and "ppp_rip". The option "all" is the subset that includes all biomolecules (i.e. no subsetting is done). The option "los" identifies the subset of the biomolecules associated with the top L order statistics, where L is a proportion between 0 and 1. Specifically, the biomolecules falling within the top L proportion of highest absolute abundance are retained for each sample, and the union of these biomolecules is taken as the subset identified (Wang et al., 2006). The option "ppp" (originally stands for percentage of peptides present) identifies the subset of biomolecules that are present/non-missing for a minimum proportion of samples (Karpievitch et al., 2009; Kultima et al., 2009). The option "complete" retains molecules with no missing data across all samples, equivalent to "ppp" with proportion = 1. The option "rip" identifies biomolecules with complete data that have a p-value greater than a defined threshold alpha (common values include 0.1 or 0.25) when subjected to a Kruskal-Wallis test based (non-parametric one-way ANOVA) on group membership (Webb-Robertson et al., 2011). The option "ppp_rip" is equivalent to "rip" however rather than requiring biomolecules with complete data, biomolecules with at least a proportion of non-missing values are subject to the Kruskal-Wallis test.

Normalization Functions

Specifying a normalization function indicates how normalization scale and location parameters should be calculated. The following are valid options: "median", "mean", "zscore", and "mad". For median centering, the location estimates are the sample-wise medians of the subset data and there are no scale estimates. For mean centering, the location estimates are the sample-wise means of the subset data and there are no scale estimates. For z-score transformation, the location estimates are the subset means for each sample and the scale estimates are the subset standard deviations for each sample. For median absolute deviation (MAD) transformation, the location estimates are the subset medians for each sample and the scale estimates are the subset MADs for each sample.

Specifying Subset Parameters Using the params Argument

Parameters for the chosen subset function should be specified in a list with the function specification followed by an equal sign and the desired parameter value. For example, if LOS with 0.1 is desired, one should use params = list(los = 0.1). ppp_rip can be specified in one of two ways: specify the parameters with each separate function or combine using a nested list (e.g. params = list(ppp_rip = list(ppp = 0.5, rip = 0.2))).

The following functions have parameters that can be specified:

losa value between 0 and 1 indicating the top proportion of order statistics. Defaults to 0.05 if unspecified.
pppa value between 0 and 1 specifying the proportion of samples that must have non-missing values for a biomolecule to be retained. Defaults to 0.5 if unspecified.
ripa value between 0 and 1 specifying the p-value threshold for determining rank invariance. Defaults to 0.2 if unspecified.
ppp_riptwo values corresponding to the RIP and PPP parameters above. Defaults to 0.5 and 0.2, respectively.

Backtransform

The purpose of back transforming data is to ensure values are on a scale similar to their raw values before normaliztion. The following values are calculated and/or applied for backtransformation purposes:

medianscale is NULL and location parameter is a global median across all samples
meanscale is NULL and location parameter is a global median across all samples
zscorescale is pooled standard deviation and location is global mean across all samples
madscale is pooled median absolute deviation and location is global median across all samples

References

Webb-Robertson BJ, Matzke MM, Jacobs JM, Pounds JG, Waters KM. A statistical selection strategy for normalization procedures in LC-MS proteomics experiments through dataset-dependent ranking of normalization scaling factors. Proteomics. 2011;11(24):4736-41.

Author

Lisa Bramer

Examples

library(pmartRdata)

mymetab <- edata_transform(
  omicsData = metab_object,
  data_scale = "log2"
)
mymetab <- group_designation(
  omicsData = mymetab,
  main_effects = "Phenotype"
)
norm_object <- normalize_global(
  omicsData = mymetab,
  subset_fn = "all",
  norm_fn = "median"
)
norm_data <- normalize_global(
  omicsData = mymetab,
  subset_fn = "all",
  norm_fn = "median",
  apply_norm = TRUE,
  backtransform = TRUE
)