From: Bioinformatic-driven search for metabolic biomarkers in disease
Independent samples | Method | Basic principle and key features of the method | Reference |
---|---|---|---|
 | Unpaired null hypothesis testing (Two-sample t-test*, Mann-Whitney-U test°) | - univariate filter method - P value serves as evaluation measure for the discriminatory ability of variables - is an accepted statistical measure - appropriate for two class problems only - P value is sample size dependent | Lehmann, Springer Verlag, 2005 [32] |
 | Principal component analysis (PCA)# | - unsupervised projection method - PCA calculates linear combinations of variables based on the variance of the original data space - appropriate for multiple class problems - visualizable loading and score plots (scores can be labeled according to class membership) - no ranking and prioritization of features possible | Jolliffe, Springer Verlag, 2005 [33], Ringnér, Nat Biotechnol, 2008 [34] |
 | Information gain (IG) | - univariate filter method - IG calculates how well a given feature separates data by pursuing reduction of entropy - appropriate for multiple class problems - quick and effective ranking of features - IG scores permit prioritization of features | Hall and Holmes, IEEE Trans Knowl Data Eng, 2003 [28] |
 | ReliefF (RF) | - multivariate filter method - RF score relies on the concept that values of a significant feature are correlated with the feature values of an instance of the same class, and uncorrelated with the feature values of an instance of the other class - appropriate for multiple class problems - RF scores permit prioritization of features | Robnik-Sikonja & Kononenko, Mach Learn, 2003 [35] Hall and Holmes, IEEE Trans Knowl Data Eng, 2003 [28] |
 | Associative voting (AV) | - multivariate filter method - AV uses a rule-based evaluation criterion by a special form of association rules; considers interaction among features - appropriate for two class problems only - AV scores permit prioritization of features - restriction of the rule search space necessary | Osl et al., Bioinformatics, 2008 [36] |
 | Unpaired Biomarker Identifier (uBI) | - univariate filter method - statistical evaluation score by combining a discriminance measure with a biological effect term - appropriate for two class problems only - quick and effective ranking of features - uBI scores permit prioritization of features - uBI scores closely related to pBI scores | Baumgartner et al., Bioinformatics, 2010 [13] |
 | Guilt-by-association feature selection (GBA-FS) | - multivariate subset selection method - GBA-FS uses a hierarchical clustering with correlation as distance measure; the most relevant features of each cluster are assessed by their discriminatory power, as measured for example by two-sample t-test - accounts for redundancy between features - appropriate for two class problems only | Shin et al., J Biomed Inform, 2007 [37] |
 | Support vector machine-recursive feature elimination (SVM-REF) | - embedded selection method - SVM-REF uses optimized weights of SVM classifier to rank features - appropriate for two class problems only | Guyon et al., Mach Learn, 2002 [38] |
 | Random forest models (RFM) | - embedded selection method - RFM uses bagging and random subspace methods to construct a collection of decision trees aiming at identifying a complete set of significant features - appropriate for multiple class problems | Enot et al., PNAS, 2006 [39] |
 | Aggregating feature selection (AFS) | - ensemble selection method - aggregating multiple feature selection results to a consensus ranking, e.g. using the concept of weighted voting or by counting the most frequently selected features to derive the consensus feature subset - appropriate for multiple class problems | Saeys et al., Lecture Notes in Artificial Intelligence, 2008 [30] |
 | Stacked feature ranking (SFR) | - ensemble selection method - stacked learning architecture to construct a consensus feature ranking by combining multiple feature selection methods - appropriate for multiple class problems - feature selection by optimizing the discriminatory ability (AUC) | Netzer et al., Bioinformatics, 2009 [31] |
 | Wrapper approach | - evaluating the merit of a feature subset by accuracy estimates using a classifier - produces subsets of very few features that are dominated by stronger and uncorrelated attributes - increased computational runtime; necessitates heuristic search methods like forward selection, backward elimination, or more sophisticated methods such as genetic algorithms | Hall and Holmes, IEEE Trans Knowl Data Eng, 2003 [28] |
Dependent samples | Paired null hypothesis testing (Paired t-test*, Wilcoxon signed-rank test°) | - univariate filter method - P value serves as evaluation measure for the discriminatory ability of variables - is an accepted statistical measure - appropriate for two class problems only - P value is sample size dependent - two dependent samples | Lehmann, Springer Verlag, 2005 [32] |
 | Repeated measure analysis | - univariate and multivariate approaches - mixed model analysis (GLMM, General Linear Mixed Model) - time series (multiple time points) analysis | Crowder & Hand, Analysis of repeated measures, 1990 [40] |
 | Paired Biomarker Identifier (pBI) | - univariate filter method - pBI uses a statistical evaluation score by combining a discriminance measure with a biological effect term - appropriate for two class problems only - pBI scores permit prioritization of features - pBI scores closely related to uBI scores | Baumgartner et al., Bioinformatics, 2010 [13] |