### AMFES

Selecting a small subset out of hundreds and thousands of features is always a challenging task due to the COD (Curse of Dimensionality) problem for microarray datasets. To tackle this problem, we use a gene selection methodology, AMFES, to select an optimal subset of genes by training an SVM with subsets of genes generated adaptively [11]. AMFES is developed based on two fundamental processes, ranking and selection.

The gene ranking process contains several stages. In the first stage, all genes are ranked by their ranking scores in a descending order. Then, in the next stage, only the top half of the ranked genes are ranked again, while the bottom half holds the current order in the subsequent stage. The same iteration repeats recursively until only three genes remain to be ranked again to complete one ranking process.

Assume at a given ranking stage, there are

*k* genes indexed from

*1* to

*k*. To rank these

*k* genes, we follow 4 steps below. (I) We first generate

*m* independent subsets S

_{
1
}… S

_{m.} Each subset S

_{
i
},

*i* = 1, 2…

*m*, has

*j* genes which are selected randomly and independently from the

*k* genes (II) Let C

_{
i
} be the SVM classifier that is trained on each subset of genes

_{
,
}*i* = 1, 2…

*m*. For each gene of

*k* genes, we compute the ranking score

*θ*_{
m
}(g) of the gene g, as equation (

1) below [

11]. (III) We use the average weight of gene

*g*, given by the summation of weights of

*g* in

*m* subsets divided by the number of subsets for which

*g* is randomly selected. The

*weight*_{
i
}(g) is then defined as the change in the objective function due to

*g* as equation (

2) [

11] and the

*m* value is obtained when θ

_{m} satisfies the equation (

3) in [

11]. This increases the robustness to represent the true classifying ability of gene

*g*. (IV) The

*k* genes are then ranked in descending order by their ranking scores.

${\theta}_{m}\left(g\right)=\frac{\sum _{i=1}^{m}{I}_{\left\{g\phantom{\rule{0.5em}{0ex}}\in {S}_{i}\right\}}{\mathit{\text{weight}}}_{i}\left(g\right)}{\sum _{i-1}^{m}{I}_{\left\{g\phantom{\rule{0.5em}{0ex}}\in {S}_{i}\right\}}}$

(1)

where I is an indicator function, such that I_{proposition} = 1 if the proposition is true; otherwise, I_{proposition} = 0. In other words, if gene g is randomly selected for the subset S_{i}, it is denoted as gϵS_{i} and I_{proposition} = 1.

We denote the objective function of C

_{
i
} as

*obj*_{
i
}(

*v*_{1},

*v*_{2}, …,

*v*_{5}) where

**v**_{1},

**v**_{2}…

**v**_{s} are support vectors of C

_{
i
}. The

*weight*_{
i
}(g) is then defined as the change in the objective function due to g, i.e., [

6–

8].

${\mathit{weight}}_{i}\left(g\right)=\left|\mathit{ob}{j}_{i}\left({v}_{1},{v}_{2},\dots {v}_{s}\right)-\mathit{ob}{j}_{i}\left({v}_{1}^{\left(g\right)},{v}_{2}^{\left(g\right)},\dots ,{v}_{3}^{\left(g\right)}\right)\right|$

(2)

Note that if

**v** is a vector,

**v**^{(g)} is the vector obtained by dropping gene g from

**v**. Let θ

_{m} be a vector comprising the ranking scores derived from the

*m* gene subsets generated thus far and θ

_{m-1} be the vector at the previous stage. The

*m* value is determined when θ

_{m} satisfies equation (

3) by adding a gene to an empty subset once a time.

$\frac{{\u2225{\mathbf{\theta}}_{m-1}\phantom{\rule{0.5em}{0ex}}-\phantom{\rule{0.5em}{0ex}}{\mathbf{\theta}}_{m}\u2225}^{2}}{{\u2225{\mathbf{\theta}}_{m-1}\u2225}^{2}}<\phantom{\rule{0.5em}{0ex}}0.01$

(3)

where ||θ|| is understood as the Euclidean norm of vector θ.

The ranking process is performed by ranking both artificial and original features together. The use of artificial features has been demonstrated as a useful tool to distinguish the relevant features from the irrelevant ones, as in [15–17]. When a set of genes is given, we generate artificial genes and rank them together with original ones. After finishing the ranking of the set, we assign a gene-index to each original gene by the proportion of artificial ones that are ranked above it, where the gene-index is a real numerical value between 0 and 1. Then, we generate a few subset candidates from which the optimal subset is chosen. Each subset has a subset value, *p*_{
i
}*,* and it contains original genes whose indices are smaller than or equal to *p*_{
i
}[11]. We train an SVM on every subset, and compute its validation accuracy *v*(*p*_{
i
}). We stop at the first *p*_{
k
} at which its validation accuracy is better than baseline (i.e., the case in which all features are involved in training [11]).

When starting to apply AMFES, we first divide all samples into either learning samples or testing samples. Then, we randomly extract *r* training-validation pairs from the learning samples according to the heuristic rule $r=\mathrm{max}\left(5,\left(int\right)\frac{500}{n+0.5}\right)$, where *n* is the number of learning samples in the dataset. The heuristic ratio and rule are chosen based on experience of the balance of time consumption and performance. The ranking and selection processes from previous sections correspond to one training-validation pair. To increase the reliability of validation, we generate *r* pairs to find the optimal subset.

We calculate the validation accuracy of all pairs and the average accuracy, *av*(*p*_{
i
}). Then, we perform the subset search as explained in the previous section to find the optimal *p*_{
i
} value, denoted as *p**. However, *p** is a derived value and does not belong to a unique subset. Thus, we have to adapt all training samples and repeat the entire process in order to find a unique subset.

We generate artificial genes and rank them together with the original genes. Finally, we select the original genes whose indices are smaller than or equal to the *p** derived previously as the subset of genes we select for the dataset [11].

### Mutual information

To treat a complex disease or injury such as AD, an optimal approach is to discover important biomarkers for which we can find certain treatments. These biomarkers form a certain dependency network as a framework for diagnosis and therapy [18]. We call such a network a target network of these biomarkers [11].

Mutual information has been used to measure the dependency between two random variables. Assume the two random variables X and Y are continuous numbers. The mutual information is defined as [

19]:

$I\left(X,Y\right)={\displaystyle \iint f\left(x,y\right)\text{log}\left(\frac{f\left(x,y\right)}{f\left(x\right)f\left(y\right)}\right)\mathit{\text{dxdy}}}$

(7)

where

*f*(x,y) denotes the joint probability distribution, and

*f*(x) and

*f*(y) denote the marginal probability distributions of X and Y. By using the Gaussian kernel estimation, the

*f*(x, y),

*f*(x) and

*f*(y) can be further represented as [

20]:

$f\left(x,y\right)=\frac{1}{M}\sum _{2\pi {h}^{2}}{e}^{-\phantom{\rule{0.5em}{0ex}}\frac{1}{2{h}^{2}}\left({\left(x-{x}_{u}\right)}^{2}+\left(y-{y}_{u}^{2}\right)\right)}$

(8)

$f\left(x\right)=\frac{1}{M}{\displaystyle \sum \frac{1}{\sqrt{2\pi {h}^{2}}}{e}^{\frac{1}{2{h}^{2}}{\left(x-{x}_{u}\right)}^{2}}}$

(9)

$f\left(y\right)=\frac{1}{M}{\displaystyle \sum \frac{1}{\sqrt{2\pi {h}^{2}}}{e}^{\frac{1}{2{h}^{2}}{\left(y-{y}_{u}\right)}^{2}}}$

(10)

where

*M* represents the number of samples for both X and Y,

*u* is index of samples

*u* = 1,2,…

*M*, and

*h* is a parameter controlling the width of the kernels. Thus, the mutual information

*I*(

*X,Y*) can then be represented as:

$I\left(X,Y\right)=\frac{1}{M}{\displaystyle \sum _{i}log\frac{M{\displaystyle {\sum}_{i}{e}^{-\frac{1}{2{h}^{2}}\left({\left({x}_{w}-{x}_{u}\right)}^{2}+{\left({y}_{w}-{y}_{u}\right)}^{2}\right)}}}{{\displaystyle {\sum}_{j}{e}^{-\frac{1}{2{h}^{2}}{\left({x}_{w}-{x}_{u}\right)}^{2}}}{\displaystyle {\sum}_{j}{e}^{-\frac{1}{2{h}^{2}}{\left({y}_{w}-{y}_{u}\right)}^{2}}}}}$

(11)

where both *w, u* are indices of samples *w*,*u* = 1,2,…*M.*

Computation of pairwise genes of a microarray dataset usually involves a nested loops calculation which takes a dramatic amount of time. Assume a dataset has *N* genes and each gene has *M* samples. To calculate the pairwise mutual information values, the computation usually first finds the kernel distance between any two samples for a given gene. Then, the same process goes through every pair of genes in the dataset. In order to be computationally efficient, two improvements are applied [21]. The first one is to calculate the marginal probability of each gene in advance and use it repeatedly during the process [21, 22]. The second improvement is to move the summation of each sample pair for a given gene to the most outer for-loop rather than inside a nested for-loop for every pairwise gene. As a result, the kernel distance between two samples is only calculated twice instead *N* times, thereby saving considerable computational time. LNO (Loops Nest Optimization) which changes the order of nested loops is a common time-saving technique in computer science field [23].