A diagnostic methodology for Alzheimer’s disease
© Hsu et al.; licensee BioMed Central Ltd. 2013
Received: 28 February 2013
Accepted: 19 April 2013
Published: 25 April 2013
Skip to main content
© Hsu et al.; licensee BioMed Central Ltd. 2013
Received: 28 February 2013
Accepted: 19 April 2013
Published: 25 April 2013
Like all other neurodegenerative diseases, Alzheimer’s disease (AD) remains a very challenging and difficult problem for diagnosis and therapy. For many years, only historical, behavioral and psychiatric measures have been available to AD cases. Recently, a definitive diagnostic framework, using biomarkers and imaging, has been proposed. In this paper, we propose a promising diagnostic methodology for the framework.
In a previous paper, we developed an efficient SVM (Support Vector Machine) based method, which we have now applied to discover important biomarkers and target networks which provide strategies for AD therapy.
The methodology selects a number of blood-based biomarkers (fewer than 10% of initial numbers on three AD datasets from NCBI), and the results are statistically verified by cross-validation. The resulting SVM is a classifier of AD vs. normal subjects. We construct target networks of AD based on MI (mutual information). In addition, a hierarchical clustering is applied on the initial data and clustered genes are visualized in a heatmap. The proposed method also performs gender analysis by classifying subjects based on gender.
Unlike other traditional statistical analyses, our method uses a machine learning-based algorithm. Our method selects a small set of important biomarkers for AD, differentiates noisy (irrelevant) from relevant biomarkers and also provides the target networks of the selected biomarkers, which will be useful for diagnosis and therapeutic design. Finally, based on the gender analysis, we observe that gender could play a role in AD diagnosis.
Analysis of Alzheimer’s disease (AD) has been a challenging problem for diagnosis and therapy. Currently, a definitive clinical diagnosis can be obtained only by historical, behavioral and psychiatric measures, and only when the patient’s condition has sufficiently deteriorated . In , a dynamic model was proposed for AD diagnosis and has led to several studies of biomarker-based analysis. However, in order to validate the model, continuous studies of biomarkers are necessary to identify critical time points when changes or permutations of biomarkers occur . The specificity and sensitivity of AD diagnosis remain in doubt due to the lack of comparisons of AD with other neurodegenerative disease . In addition, the standards and guidelines for blood-sample biomarkers are still in the process of development . The current methods for biomarker collection are also problematic, due to the need for expensive instrumentation and the invasiveness of the procedures . Recently, Zhang et al. integrated three modalities--MRI, FDG-PET, and CSF biomarker--into a Multi-Kernel SVM to classify AD vs. normal samples .
In the past few years, the technologies for both biomarker analysis and imaging have provided promising contributions to definitive diagnoses of AD. In 2011, the NIH National Institute on Aging and the Alzheimer’s Association also established new guidelines to allow use of biomarkers and imaging for diagnoses . Since the announcement from NIH, research to identify and compare biomarkers has been thriving.
High-dimensional pattern classifiers such as SVMs (Support Vector Machines) are adapted to contribute classifications. In [7–9], biomarker selections were performed by SVM-RFE, a feature (biomarker) extraction method, by reclusively eliminating features based on the validation accuracy of SVM . During the selection process, the least useful feature is iteratively removed from the original subset. However, a group of weak features can still construct a strong classifier . Once a feature is removed from the original subset, it cannot be evaluated by different combinations of the remaining features. Thus, the SVM-RFE approach usually suffers from selection of a sub-optimal subset since the classification ability of features should be evaluated by subsets instead of by individuals.
In , we first proposed an efficient algorithm, AMFES (Adaptive Multiple FEatues Selection), to select important biomarkers for cancers. Based on that initial success, this paper reports the extension of previous results on the datasets provided by Maes et al. in an attempt to discover important biomarkers for AD from the blood-based samples . Unlike traditional statistical analyses, AMFES is an SVM-based methodology, which can select a much smaller subset of important biomarkers. In addition, AMFES applies an adaptive method which enables selection of a globally optimal subset of important biomarkers compared to SVM-RFE. AMFES is particularly useful for differentiating noisy biomarkers from the relevant ones when interferences between biomarkers exist. Our results are supported by a high ROC/AUC (Receiver Operating Characteristic/Area Under Curve) value when we apply a cross-validation verification. Thus, AMFES should play an important role in the classification framework of multi-modalities proposed by Zhang et al. in . In this paper, we shall develop the details of AMFES for blood-based biomarkers.
The target networks of AD with statistical dependencies (mutual information) are constructed by these selected biomarkers. The resulting AD target network is characterized as a signature of the disease, and will enable a more detailed diagnosis. In addition, the MI (Mutual Information) values of AD subjects are found to be lower than those of normal subjects. Based on our method and results, a promising framework for definitive diagnosis is proposed.
The organization of this paper is as follows: The Methods section describes AMFES , as well as the computations of mutual information between two biomarkers and the construction of the target networks. In the Results section, we describe the PBMC (Peripheral Blood Mononuclear Cells) datasets of sporadic AD: GSE4226, GSE4227, and GSE4229 [12–14]. In addition, we describe the biomarkers and target networks of AD selected according to our approach on these 3 datasets.
Selecting a small subset out of hundreds and thousands of features is always a challenging task due to the COD (Curse of Dimensionality) problem for microarray datasets. To tackle this problem, we use a gene selection methodology, AMFES, to select an optimal subset of genes by training an SVM with subsets of genes generated adaptively . AMFES is developed based on two fundamental processes, ranking and selection.
The gene ranking process contains several stages. In the first stage, all genes are ranked by their ranking scores in a descending order. Then, in the next stage, only the top half of the ranked genes are ranked again, while the bottom half holds the current order in the subsequent stage. The same iteration repeats recursively until only three genes remain to be ranked again to complete one ranking process.
where I is an indicator function, such that Iproposition = 1 if the proposition is true; otherwise, Iproposition = 0. In other words, if gene g is randomly selected for the subset Si, it is denoted as gϵSi and Iproposition = 1.
where ||θ|| is understood as the Euclidean norm of vector θ.
The ranking process is performed by ranking both artificial and original features together. The use of artificial features has been demonstrated as a useful tool to distinguish the relevant features from the irrelevant ones, as in [15–17]. When a set of genes is given, we generate artificial genes and rank them together with original ones. After finishing the ranking of the set, we assign a gene-index to each original gene by the proportion of artificial ones that are ranked above it, where the gene-index is a real numerical value between 0 and 1. Then, we generate a few subset candidates from which the optimal subset is chosen. Each subset has a subset value, p i , and it contains original genes whose indices are smaller than or equal to p i . We train an SVM on every subset, and compute its validation accuracy v(p i ). We stop at the first p k at which its validation accuracy is better than baseline (i.e., the case in which all features are involved in training ).
When starting to apply AMFES, we first divide all samples into either learning samples or testing samples. Then, we randomly extract r training-validation pairs from the learning samples according to the heuristic rule , where n is the number of learning samples in the dataset. The heuristic ratio and rule are chosen based on experience of the balance of time consumption and performance. The ranking and selection processes from previous sections correspond to one training-validation pair. To increase the reliability of validation, we generate r pairs to find the optimal subset.
We calculate the validation accuracy of all pairs and the average accuracy, av(p i ). Then, we perform the subset search as explained in the previous section to find the optimal p i value, denoted as p*. However, p* is a derived value and does not belong to a unique subset. Thus, we have to adapt all training samples and repeat the entire process in order to find a unique subset.
We generate artificial genes and rank them together with the original genes. Finally, we select the original genes whose indices are smaller than or equal to the p* derived previously as the subset of genes we select for the dataset .
To treat a complex disease or injury such as AD, an optimal approach is to discover important biomarkers for which we can find certain treatments. These biomarkers form a certain dependency network as a framework for diagnosis and therapy . We call such a network a target network of these biomarkers .
where both w, u are indices of samples w,u = 1,2,…M.
Computation of pairwise genes of a microarray dataset usually involves a nested loops calculation which takes a dramatic amount of time. Assume a dataset has N genes and each gene has M samples. To calculate the pairwise mutual information values, the computation usually first finds the kernel distance between any two samples for a given gene. Then, the same process goes through every pair of genes in the dataset. In order to be computationally efficient, two improvements are applied . The first one is to calculate the marginal probability of each gene in advance and use it repeatedly during the process [21, 22]. The second improvement is to move the summation of each sample pair for a given gene to the most outer for-loop rather than inside a nested for-loop for every pairwise gene. As a result, the kernel distance between two samples is only calculated twice instead N times, thereby saving considerable computational time. LNO (Loops Nest Optimization) which changes the order of nested loops is a common time-saving technique in computer science field .
In our approach, a constructed target network is represented by an undirected graph in which nodes represent genes in the system and edges represent the dependency between gene pairs . For each gene pair, we use MI (Mutual Information) to measure the dependency between them and represent the weight of linkages. Assuming that the graph contains N nodes (genes), there should be pairwise MI values for all genetic pairs. An adjacency matrix of N × N elements is used to hold MI values of all the linkages in the graph. The adjacency matrix can be visualized as a heatmap. In addition, hierarchical clustering is often used to help verify the dependency between genes. In this paper, we adapt the Matlab clustergram() function, which uses Euclidean distance as the default method to calculate pairwise distance to visualize the heatmap after clustering.
The gene expressions used for this paper are based on PBMC (Peripheral Blood Mononuclear Cells) blood-based biomarkers [12–14]. Subject AD and normal elderly patients all took the MMSE (Mini-Mental State Examination). Those with chronic metabolic conditions such as diabetes, rheumatoid arthritis and other chronic illnesses or familial AD problems are not included in the analysis [12–14]. Fields such as immunology, transplant immunology, vaccine development often use PBMCs.
AMFES is used to analyze the gene expressions from the BMC (Blood Mononuclear Cell) of AD patients . The dataset contains 9600 features from 14 normal elderly control samples (7 females and 7 male) and 14 AD patient samples (7 females and 7 males). The average age of the patients is 79 ± 5 years with 11 ± 4 years of formal educational background. The platform of the dataset is GPL1211 and gene expressions are extracted by using the technology of NIA (National Institution on Aging) Human MGC (Mammalian Genome Collection) cDNA microarray. The raw normalized dataset can be found in Additional file 1.
The dataset GSE4227 was extracted from BMC and under the same GPL1211 platform as GSE4226. It was used to identify the genes with expressions associated with GSTM3 (Glutathione S-Transferase Mu 3) . The dataset contains 9600 features and 34 samples (16 sporadic AD samples and 18 normal elderly control samples). The raw normalized dataset can be found in Additional file 2.
This dataset contains new subjects and some subjects from GSE4226 and GSE4227 . The blood samples were extracted by phlebotomy in an EDTA vacutainer. The dataset also contains 9600 features and 40 samples (18 AD samples and 22 normal elderly control samples). The platform is the same as GSE4226 and GES4227. The raw normalized dataset can be found in Additional file 3.
Descriptions of 3 datasets: GSE4226, GSE4227, and GSE4229
Number of Biomarkers
Type of Biomarkers
Number of Samples
28 (14 AD vs. 14 Normal)
34(14 AD vs. 18 normal)
40(18 AD vs. 22 normal)
Results of selected subsets of genes
Number of Biomarkers Selected
Results of analysis of MI matrices
Mean value of MI
Standard deviation of MI
Num of positive values
Num of negative values
The comparisons of female genes and male gene selected by AMFES
Number of features selected for female
Number of features selected for male
The selected female genes of GSE4226
The selected male genes of GSE4226
In this paper, the GSE 4226 dataset is studied in more detail because the number of female and male subjects is equal, thereby avoiding the biased sampling problem of the datasets (i.e., when the number of samples is unbalanced for two classes). Traditionally, statistical software such as SAM (Significance Analysis of Microarrays) , PAM (Prediction Analysis for Microarrays)  or ANOVA (Analysis of Variance) are used for analyses of biomarkers [12–14]. Compared to the results in [12–14], AMFES selects a much smaller, yet important, set of biomarkers which are supported by the cross-validation. In [7, 8, 29], the researches were performed based on SVM-RFE for AD biomarker analyses. Here, AMFES can appreciably improve the performance of biomarker analysis. In our current research, we are extending the framework of Zhang et al.  by AMFES, and this work will be reported shortly. Finally, interestingly for the gender analysis, when we compare results for female AD subjects with those for male AD subjects, there are no overlapping genes, indicating that the important biomarkers may differ according to gender.
We thank the two reviewers for their valuable comments and suggestions.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.