Skip to main content

Table 1 Commonly used supervised data mining methods for the search and prioritization of biomarker candidates in independent and dependent samples

From: Bioinformatic-driven search for metabolic biomarkers in disease

Independent samples

Method

Basic principle and key features of the method

Reference

 

Unpaired null hypothesis testing (Two-sample t-test*, Mann-Whitney-U test°)

- univariate filter method

- P value serves as evaluation measure for the discriminatory ability of variables

- is an accepted statistical measure

- appropriate for two class problems only

- P value is sample size dependent

Lehmann, Springer Verlag, 2005 [32]

 

Principal component analysis (PCA)#

- unsupervised projection method

- PCA calculates linear combinations of variables based on the variance of the original data space

- appropriate for multiple class problems

- visualizable loading and score plots (scores can be labeled according to class membership)

- no ranking and prioritization of features possible

Jolliffe, Springer Verlag, 2005 [33], Ringnér, Nat Biotechnol, 2008 [34]

 

Information gain (IG)

- univariate filter method

- IG calculates how well a given feature separates data by pursuing reduction of entropy

- appropriate for multiple class problems

- quick and effective ranking of features

- IG scores permit prioritization of features

Hall and Holmes, IEEE Trans Knowl Data Eng, 2003 [28]

 

ReliefF (RF)

- multivariate filter method

- RF score relies on the concept that values of a significant feature are correlated with the feature values of an instance of the same class, and uncorrelated with the feature values of an instance of the other class

- appropriate for multiple class problems

- RF scores permit prioritization of features

Robnik-Sikonja & Kononenko, Mach Learn, 2003 [35] Hall and Holmes, IEEE Trans Knowl Data Eng, 2003 [28]

 

Associative voting (AV)

- multivariate filter method

- AV uses a rule-based evaluation criterion by a special form of association rules; considers interaction among features

- appropriate for two class problems only

- AV scores permit prioritization of features

- restriction of the rule search space necessary

Osl et al., Bioinformatics, 2008 [36]

 

Unpaired Biomarker Identifier (uBI)

- univariate filter method

- statistical evaluation score by combining a discriminance measure with a biological effect term

- appropriate for two class problems only

- quick and effective ranking of features

- uBI scores permit prioritization of features

- uBI scores closely related to pBI scores

Baumgartner et al., Bioinformatics, 2010 [13]

 

Guilt-by-association feature selection (GBA-FS)

- multivariate subset selection method

- GBA-FS uses a hierarchical clustering with correlation as distance measure; the most relevant features of each cluster are assessed by their discriminatory power, as measured for example by two-sample t-test

- accounts for redundancy between features

- appropriate for two class problems only

Shin et al., J Biomed Inform, 2007 [37]

 

Support vector machine-recursive feature elimination (SVM-REF)

- embedded selection method

- SVM-REF uses optimized weights of SVM classifier to rank features

- appropriate for two class problems only

Guyon et al., Mach Learn, 2002 [38]

 

Random forest models (RFM)

- embedded selection method

- RFM uses bagging and random subspace methods to construct a collection of decision trees aiming at identifying a complete set of significant features

- appropriate for multiple class problems

Enot et al., PNAS, 2006 [39]

 

Aggregating feature selection (AFS)

- ensemble selection method

- aggregating multiple feature selection results to a consensus ranking, e.g. using the concept of weighted voting or by counting the most frequently selected features to derive the consensus feature subset

- appropriate for multiple class problems

Saeys et al., Lecture Notes in Artificial Intelligence, 2008 [30]

 

Stacked feature ranking (SFR)

- ensemble selection method

- stacked learning architecture to construct a consensus feature ranking by combining multiple feature selection methods

- appropriate for multiple class problems

- feature selection by optimizing the discriminatory ability (AUC)

Netzer et al., Bioinformatics, 2009 [31]

 

Wrapper approach

- evaluating the merit of a feature subset by accuracy estimates using a classifier

- produces subsets of very few features that are dominated by stronger and uncorrelated attributes

- increased computational runtime; necessitates heuristic search methods like forward selection, backward elimination, or more sophisticated methods such as genetic algorithms

Hall and Holmes, IEEE Trans Knowl Data Eng, 2003 [28]

Dependent samples

Paired null hypothesis testing (Paired t-test*, Wilcoxon signed-rank test°)

- univariate filter method

- P value serves as evaluation measure for the discriminatory ability of variables

- is an accepted statistical measure

- appropriate for two class problems only

- P value is sample size dependent

- two dependent samples

Lehmann, Springer Verlag, 2005 [32]

 

Repeated measure analysis

- univariate and multivariate approaches

- mixed model analysis (GLMM, General Linear Mixed Model)

- time series (multiple time points) analysis

Crowder & Hand, Analysis of repeated measures, 1990 [40]

 

Paired Biomarker Identifier (pBI)

- univariate filter method

- pBI uses a statistical evaluation score by combining a discriminance measure with a biological effect term

- appropriate for two class problems only

- pBI scores permit prioritization of features

- pBI scores closely related to uBI scores

Baumgartner et al., Bioinformatics, 2010 [13]

  1. * data normal distributed, ° data non-normal distributed. # PCA is an unsupervised method also used for data containing class information. All algorithms are run on continuous data as data generated in metabolomics are usually of metric nature. Data can represent absolute metabolite concentrations (given as intensity counts or more specific in μmol/L if internal standards are available) or simple m/z values from raw or preprocessed mass spectra.