Skip to main content

Table 1 Commonly used supervised data mining methods for the search and prioritization of biomarker candidates in independent and dependent samples

From: Bioinformatic-driven search for metabolic biomarkers in disease

Independent samples Method Basic principle and key features of the method Reference
  Unpaired null hypothesis testing (Two-sample t-test*, Mann-Whitney-U test°) - univariate filter method
- P value serves as evaluation measure for the discriminatory ability of variables
- is an accepted statistical measure
- appropriate for two class problems only
- P value is sample size dependent
Lehmann, Springer Verlag, 2005 [32]
  Principal component analysis (PCA)# - unsupervised projection method
- PCA calculates linear combinations of variables based on the variance of the original data space
- appropriate for multiple class problems
- visualizable loading and score plots (scores can be labeled according to class membership)
- no ranking and prioritization of features possible
Jolliffe, Springer Verlag, 2005 [33], Ringnér, Nat Biotechnol, 2008 [34]
  Information gain (IG) - univariate filter method
- IG calculates how well a given feature separates data by pursuing reduction of entropy
- appropriate for multiple class problems
- quick and effective ranking of features
- IG scores permit prioritization of features
Hall and Holmes, IEEE Trans Knowl Data Eng, 2003 [28]
  ReliefF (RF) - multivariate filter method
- RF score relies on the concept that values of a significant feature are correlated with the feature values of an instance of the same class, and uncorrelated with the feature values of an instance of the other class
- appropriate for multiple class problems
- RF scores permit prioritization of features
Robnik-Sikonja & Kononenko, Mach Learn, 2003 [35] Hall and Holmes, IEEE Trans Knowl Data Eng, 2003 [28]
  Associative voting (AV) - multivariate filter method
- AV uses a rule-based evaluation criterion by a special form of association rules; considers interaction among features
- appropriate for two class problems only
- AV scores permit prioritization of features
- restriction of the rule search space necessary
Osl et al., Bioinformatics, 2008 [36]
  Unpaired Biomarker Identifier (uBI) - univariate filter method
- statistical evaluation score by combining a discriminance measure with a biological effect term
- appropriate for two class problems only
- quick and effective ranking of features
- uBI scores permit prioritization of features
- uBI scores closely related to pBI scores
Baumgartner et al., Bioinformatics, 2010 [13]
  Guilt-by-association feature selection (GBA-FS) - multivariate subset selection method
- GBA-FS uses a hierarchical clustering with correlation as distance measure; the most relevant features of each cluster are assessed by their discriminatory power, as measured for example by two-sample t-test
- accounts for redundancy between features
- appropriate for two class problems only
Shin et al., J Biomed Inform, 2007 [37]
  Support vector machine-recursive feature elimination (SVM-REF) - embedded selection method
- SVM-REF uses optimized weights of SVM classifier to rank features
- appropriate for two class problems only
Guyon et al., Mach Learn, 2002 [38]
  Random forest models (RFM) - embedded selection method
- RFM uses bagging and random subspace methods to construct a collection of decision trees aiming at identifying a complete set of significant features
- appropriate for multiple class problems
Enot et al., PNAS, 2006 [39]
  Aggregating feature selection (AFS) - ensemble selection method
- aggregating multiple feature selection results to a consensus ranking, e.g. using the concept of weighted voting or by counting the most frequently selected features to derive the consensus feature subset
- appropriate for multiple class problems
Saeys et al., Lecture Notes in Artificial Intelligence, 2008 [30]
  Stacked feature ranking (SFR) - ensemble selection method
- stacked learning architecture to construct a consensus feature ranking by combining multiple feature selection methods
- appropriate for multiple class problems
- feature selection by optimizing the discriminatory ability (AUC)
Netzer et al., Bioinformatics, 2009 [31]
  Wrapper approach - evaluating the merit of a feature subset by accuracy estimates using a classifier
- produces subsets of very few features that are dominated by stronger and uncorrelated attributes
- increased computational runtime; necessitates heuristic search methods like forward selection, backward elimination, or more sophisticated methods such as genetic algorithms
Hall and Holmes, IEEE Trans Knowl Data Eng, 2003 [28]
Dependent samples Paired null hypothesis testing (Paired t-test*, Wilcoxon signed-rank test°) - univariate filter method
- P value serves as evaluation measure for the discriminatory ability of variables
- is an accepted statistical measure
- appropriate for two class problems only
- P value is sample size dependent
- two dependent samples
Lehmann, Springer Verlag, 2005 [32]
  Repeated measure analysis - univariate and multivariate approaches
- mixed model analysis (GLMM, General Linear Mixed Model)
- time series (multiple time points) analysis
Crowder & Hand, Analysis of repeated measures, 1990 [40]
  Paired Biomarker Identifier (pBI) - univariate filter method
- pBI uses a statistical evaluation score by combining a discriminance measure with a biological effect term
- appropriate for two class problems only
- pBI scores permit prioritization of features
- pBI scores closely related to uBI scores
Baumgartner et al., Bioinformatics, 2010 [13]
  1. * data normal distributed, ° data non-normal distributed. # PCA is an unsupervised method also used for data containing class information. All algorithms are run on continuous data as data generated in metabolomics are usually of metric nature. Data can represent absolute metabolite concentrations (given as intensity counts or more specific in μmol/L if internal standards are available) or simple m/z values from raw or preprocessed mass spectra.