From: Bioinformatic-driven search for metabolic biomarkers in disease
Independent samples | Method | Basic principle and key features of the method | Reference |
---|---|---|---|
Unpaired null hypothesis testing (Two-sample t-test*, Mann-Whitney-U test°) |
- univariate filter method - P value serves as evaluation measure for the discriminatory ability of variables - is an accepted statistical measure - appropriate for two class problems only - P value is sample size dependent | Lehmann, Springer Verlag, 2005 [32] | |
Principal component analysis (PCA)^{#} |
- unsupervised projection method - PCA calculates linear combinations of variables based on the variance of the original data space - appropriate for multiple class problems - visualizable loading and score plots (scores can be labeled according to class membership) - no ranking and prioritization of features possible | Jolliffe, Springer Verlag, 2005 [33], Ringnér, Nat Biotechnol, 2008 [34] | |
Information gain (IG) |
- univariate filter method - IG calculates how well a given feature separates data by pursuing reduction of entropy - appropriate for multiple class problems - quick and effective ranking of features - IG scores permit prioritization of features | Hall and Holmes, IEEE Trans Knowl Data Eng, 2003 [28] | |
ReliefF (RF) |
- multivariate filter method - RF score relies on the concept that values of a significant feature are correlated with the feature values of an instance of the same class, and uncorrelated with the feature values of an instance of the other class - appropriate for multiple class problems - RF scores permit prioritization of features | Robnik-Sikonja & Kononenko, Mach Learn, 2003 [35] Hall and Holmes, IEEE Trans Knowl Data Eng, 2003 [28] | |
Associative voting (AV) |
- multivariate filter method - AV uses a rule-based evaluation criterion by a special form of association rules; considers interaction among features - appropriate for two class problems only - AV scores permit prioritization of features - restriction of the rule search space necessary | Osl et al., Bioinformatics, 2008 [36] | |
Unpaired Biomarker Identifier (uBI) |
- univariate filter method - statistical evaluation score by combining a discriminance measure with a biological effect term - appropriate for two class problems only - quick and effective ranking of features - uBI scores permit prioritization of features - uBI scores closely related to pBI scores | Baumgartner et al., Bioinformatics, 2010 [13] | |
Guilt-by-association feature selection (GBA-FS) |
- multivariate subset selection method - GBA-FS uses a hierarchical clustering with correlation as distance measure; the most relevant features of each cluster are assessed by their discriminatory power, as measured for example by two-sample t-test - accounts for redundancy between features - appropriate for two class problems only | Shin et al., J Biomed Inform, 2007 [37] | |
Support vector machine-recursive feature elimination (SVM-REF) |
- embedded selection method - SVM-REF uses optimized weights of SVM classifier to rank features - appropriate for two class problems only | Guyon et al., Mach Learn, 2002 [38] | |
Random forest models (RFM) |
- embedded selection method - RFM uses bagging and random subspace methods to construct a collection of decision trees aiming at identifying a complete set of significant features - appropriate for multiple class problems | Enot et al., PNAS, 2006 [39] | |
Aggregating feature selection (AFS) |
- ensemble selection method - aggregating multiple feature selection results to a consensus ranking, e.g. using the concept of weighted voting or by counting the most frequently selected features to derive the consensus feature subset - appropriate for multiple class problems | Saeys et al., Lecture Notes in Artificial Intelligence, 2008 [30] | |
Stacked feature ranking (SFR) |
- ensemble selection method - stacked learning architecture to construct a consensus feature ranking by combining multiple feature selection methods - appropriate for multiple class problems - feature selection by optimizing the discriminatory ability (AUC) | Netzer et al., Bioinformatics, 2009 [31] | |
Wrapper approach |
- evaluating the merit of a feature subset by accuracy estimates using a classifier - produces subsets of very few features that are dominated by stronger and uncorrelated attributes - increased computational runtime; necessitates heuristic search methods like forward selection, backward elimination, or more sophisticated methods such as genetic algorithms | Hall and Holmes, IEEE Trans Knowl Data Eng, 2003 [28] | |
Dependent samples | Paired null hypothesis testing (Paired t-test*, Wilcoxon signed-rank test°) |
- univariate filter method - P value serves as evaluation measure for the discriminatory ability of variables - is an accepted statistical measure - appropriate for two class problems only - P value is sample size dependent - two dependent samples | Lehmann, Springer Verlag, 2005 [32] |
Repeated measure analysis |
- univariate and multivariate approaches - mixed model analysis (GLMM, General Linear Mixed Model) - time series (multiple time points) analysis | Crowder & Hand, Analysis of repeated measures, 1990 [40] | |
Paired Biomarker Identifier (pBI) |
- univariate filter method - pBI uses a statistical evaluation score by combining a discriminance measure with a biological effect term - appropriate for two class problems only - pBI scores permit prioritization of features - pBI scores closely related to uBI scores | Baumgartner et al., Bioinformatics, 2010 [13] |