### AMFES

The COD (Curse of Dimensionality) has been a major challenge of microarray data analysis due to the large number of genes (features) and relatively small number of samples (patterns). To tackle this problem, many gene selection methodologies were developed to select only significant subsets of genes in a microarray dataset. AMFES selects an optimal subset of genes by training a SVM with subsets of genes generated adaptively [6].

When AMFES runs a dataset, all samples are randomly divided into a training subset *S* of samples and a testing subset *T* of samples at a heuristic ratio of 5:1. *S* is used for ranking and selecting of genes and for constructing a classifier out of the selected genes. *T* is used for computing test accuracy. When a training subset *S* is given, we extract *r* training-validation pairs from *S* according to the heuristic rule *r* = max (5, (int) (500/*n*+0.5)) and *n* is the number of samples in *S*. Each pair randomly divides *S* into a training component of samples and a validation component of samples at a ratio of 4:1. The heuristic ratio and rule are chosen based on the experimental experiences at the balance of time consumption and performance. Basically, AMFES has two fundamental processes, ranking and selection. We first explain each process in details and then the integrated version at the end.

### Ranking

The gene ranking process contains a few ranking stages. At first stage, all genes are ranked by their ranking scores in a descending order. Then, in the next stage, only the top half ranked genes are ranked again while the bottom half holds the current order in the subsequent stage. The same iteration repeats recursively until only three genes are remained to be ranked again to complete one ranking process. Assume at a given ranking stage, there are

*k* genes indexed from

*1* to

*k*. To rank these

*k* genes, we follow 4 steps below. (I) We first generate

*m* independent subsets

*S*_{
1
}*… S*_{m.} Each subset

*S*_{
i
},

*i* = 1, 2…

*m*, has

*j* genes which are selected randomly and independently from the

*k* genes, where

*j* = (int) (

*k*/2). (II) Let C

_{
i
} be the SVM classifier that is trained on each subset of genes

_{
,
}*i* = 1, 2…

*m*. For each gene of

*k* genes, we compute the ranking score

${\theta}_{m}$ (

g) of the gene

*g*, as equation (

1). (III) We use the average weight of the gene

*g*, the summation of weights of

*g* in

*m* subsets divided by the number of subsets for which

*g* is randomly selected. This increases the robustness to represent the true classifying ability of the gene

*g*. (IV) Rank

*k* genes in the descending order by their ranking scores.

${\theta}_{m}\left(g\right)=\frac{{\displaystyle \sum _{i=1}^{m}{I}_{\left\{g\in {S}_{i}\right\}}weight\mathit{i}\left(g\right)}}{{\displaystyle \sum _{i=1}^{m}{I}_{\left\{g\in {S}_{i}\right\}}}}$

(1)

where I is an indicator function such that I_{proposition} = 1 if the proposition is true; otherwise, I_{proposition} = 0. In other word, if gene *g* is randomly selected for the subset *S*_{
i
}, it is denoted as $g\in {S}_{i}$ and I_{proposition} = 1.

We denote the objective function of C

_{
i
} as

$ob{j}_{i}\left({v}_{1},{v}_{2},\dots ,{v}_{s}\right)$ where

**v**_{1},

**v**_{2}…

**v**_{s} are support vectors of C

_{
i
}. The

*weight*_{
i
}(g) is then defined as the change in the objective function due to g, i.e.,

$weigh{t}_{i}\left(g\right)=\left|ob{j}_{i}\left({v}_{1},{v}_{2},\dots {v}_{s}\right)-ob{j}_{i}\left({v}_{1}^{\left(g\right)},{v}_{2}^{\left(g\right)},\dots ,{v}_{3}^{\left(g\right)}\right)\right|$

(2)

[

6][

7,

8]. Note that if

**v** is a vector,

**v**^{(g)} is the vector obtained by dropping gene

*g* from

**v**. Let θ

_{m} be a vector comprising the ranking scores derived from the

*m* gene subsets generated thus far and θ

_{m-1} is the vector at the previous stage. The

*m* value is determined when θ

_{m} satisfies the equation (

3) by adding a gene to an empty subset once a time.

$\frac{{\Vert {\mathbf{\theta}}_{m-1}\phantom{\rule{0.5em}{0ex}}-\phantom{\rule{0.5em}{0ex}}{\mathbf{\theta}}_{m}\Vert}^{2}}{{\Vert {\mathbf{\theta}}_{m-1}\Vert}^{2}}<\phantom{\rule{0.5em}{0ex}}0.01$

(3)

where ||θ|| is understood as the Euclidean norm of vector θ. The pseudo codes of ranking process are shown in below.

#### Pseudo codes for ranking process of AMFES

*RANK-SUBROUTINE*

*INPUT: a subset of k genes to be ranked*

*Generate k artificial genes and put them next to the original genes*

*Pick an initial tentative value of m*

*DO WHILE m does not satisfies equation (3)*

*FOR each subset Si of m subsets*

*Randomly select j elements from k genes to form the subset Si.*

*Train an SVM to get weight*
_{
i
}
*(g) for each gene in the subset*

*ENDFOR*

*FOR each gene of k genes*

*Compute the average score of the gene from m subsets*

*ENDFOR*

*List k genes in descending order by their ranking scores*

*ENDDO*

*OUPUT: a ranked k genes*

### Selection

Ranking artificial features together with original features has been demonstrated as a useful tool to distinguish relevant features from irrelevant ones as in [9–11]. In our selection process, we also use this technique to find the optimal subset of genes.

Assume a set of genes is given. We generate artificial genes and rank them together with original ones. After finishing ranking the set, we assign a gene-index to each original gene by the proportion of artificial ones that are ranked above it where the gene-index is the real numerical value between 0 and 1. Then, we generate a few subset candidates from which the optimal subset is chosen. Let *p*_{
1
}, *p*_{
2
}, be the sequence of subset-indices of the candidates with *p*_{
1
} < *p*_{
2
} < ….where *p*_{
i
} = *i*×0.005 and *i*= 1,2,…200. Let B(*p*_{
i
}) denote the corresponding subset of subset-index *p*_{
i
}*,* and it contains original genes whose indices are smaller than or equal to *p*_{
i
}*.* Then*,* we train a SVM on every B(*p*_{
i
}), and compute its validation accuracy *v*(*p*_{
i
}).

We stop at the first *p*_{
k
} at which v(*p*_{
k
}) ≥ *v*_{
baseline
} and *v*(*p*_{
k
}) ≥ *v*(*p*_{
l
}) for *k* ≤ *l* ≤ *k*+10, where *v*_{
baseline
} is the validation accuracy rate of the SVM trained on the baseline, i.e., the case in which all features are involved in training. The final result, B(*p*_{
k
}), is then the optimal subset for the given set of genes. The pseudo codes for selection process of AMFES are listed below.

#### Pseudo codes for selection process of AMFES

*SELECTION-SUBROUTINE*

*INPUT: a few subsets with their validation accuracies, av(p*
_{
i
}
*)*

*Compute the validation accuracy of all genes, vbaseline.*

*FOR each subset given*

*IF v(p*
_{
k
}
*) ≥ v*
_{
baseline
}
*and v(p*
_{
k
}
*) ≥ v(p*
_{
l
}
*) for k ≤ l ≤ k+10 THEN*

*Resulted subset is B(p*
_{
k
}
*)*

*ENDIF*

*ENDFOR*

*OUPUT: B(p*
_{
k
}
*)*

### Integrated version

The ranking and selection processes from previous sections are for one training- validation pair. To increase the reliability of validation, we generate *r* pairs to find the optimal subset. We calculate the validation accuracy of the *q*^{th} pair for all *p*_{
q-i
} subsets where *q* denotes pair-index and *i* denotes the subset-index. Then, we compute *av*(*p*_{
i
}), the average of *v*(*p*_{
q-i
}) over *r* training-validation pairs and perform the subset search as explained in selection section on *av*(*p*_{
i
}) to find the optimal *p*_{
i
}, denoted as *p**.However, *p** does not correspond to a unique subset, since each pair has its own B(*p**) and they can be all different. Thus, we adopt all samples of *S* as training samples in order to find a unique subset. We generate artificial genes and rank them together with original genes. Finally, we select the original genes whose indices are smaller than or equal to the *p** as the genes we select for *S*. The integrated version of process is shown below. In the pseudo codes below, the AMFES-ALGORITHM represents the integrated version of the whole process while RANK-SUBROUTINE represents the ranking process and SELECTION-SUBROUTINE represents the selection process.

#### Pseudo codes for integrated version of AMFES

*AMFES ALGORITHM-Integrated Version*

*INPUT: a dataset*

*Divide a dataset into train samples and test*
*samples.*

*Divide the train samples into r training-validation components pairs*

*FOR each pair of r train-validation components pairs*

*Generate 200 candidate subsets p*
_{
q-
}
_{
i
}

*FOR each subset of 200 subsets*

*CALL RANK subroutine to rank each subset.*

*Assign each original gene a gene-index*

*Train each subset on an SVM and compute corresponding validation accuracy, v(p*
_{
q-i
}
*), for the subset*

*END FOR*

*END FOR*

*FOR each subset of 200 subsets*

*Compute average validation rate, av(p*
_{
i
}
*), of the subsetfrom r pairs.*

*END FOR*

*CALL SELECTION subroutine to search for the optimal subset by its average validation rate and denotes it as p**

*CALL RANK subroutine to rank original genes again and select original genes which belong to the subset B(p*).*

*OUPUT: an optimal subset of genes B(p*)*

### Mutual information

Mutual information has been used to measure the dependency between two random variables based on the probability of them. If two random variables X and Y, the mutual information of X and Y, I(X; Y), can be expressed as these equivalent equations [

12]:

$I\left(X;Y\right)=H\left(X\right)-H\left(X|Y\right)$

(4)

$=H\left(Y\right)-H\left(Y|X\right)$

(5)

$=H\left(X\right)+H\left(Y\right)-H\left(X,Y\right)$

(6)

where H(X), H(Y) denote marginal entropies, H(X|Y) and H(Y|X) denote conditional entropies and H(X,Y) denotes joint entropy of the X and Y. To compute entropy, the probability distribution functions of the random variables are required to be calculated first. Because gene expressions are usually continuous numbers, we used the kernel estimation to calculate the probability distribution [13].

Assume the two random variables X and Y are continuous numbers. The mutual information is defined as [

12]:

$I\left(X,Y\right)=\int \int f\left(x,y\right)log\left(\frac{f\left(x,y\right)}{f\left(x\right)f\left(y\right)}\right)dxdy$

(7)

where

*f*(x,y) denotes the joint probability distribution, and

*f*(x) and

*f*(y) denote marginal probability distribution of X and Y. By using the Gaussian kernel estimation, the

*f*(x, y),

*f*(x) and

*f*(y) can be further represented as equations below [

14]:

$f\left(x,y\right)=\frac{1}{M}{{\displaystyle \sum _{2\pi {h}^{2}}e}}^{-\phantom{\rule{0.5em}{0ex}}\frac{1}{2{h}^{2}}\left({\left(x-{x}_{u}\right)}^{2}+\left(y-{y}_{u}^{2}\right)\right)}$

(8)

$f\left(x\right)=\frac{1}{M}\Sigma \frac{1}{\sqrt{2\pi {h}^{2}}}{e}^{-\frac{1}{2{h}^{2}}{\left(x-{y}_{u}\right)}^{2}}\text{,}$

(9)

where

*M* represents the number of samples for both X and Y,

*u* is index of samples

$u=1,2,\dots M\text{,}$ and

*h* is a parameter controlling the width of the kernels. Thus, the mutual information

$I\left(X,Y\right)$ can then be represented as:

$I\left(X,Y\right)=\frac{1}{M}{\displaystyle \sum _{i}log}\frac{M{{\displaystyle {\sum}_{i}e}}^{-\frac{1}{2{h}^{2}}\left({\left({x}_{w}-{x}_{u}\right)}^{2}+{\left({y}_{\mathit{wi}}-{y}_{u}\right)}^{2}\right)}}{{\displaystyle {\sum}_{j}{e}^{-\frac{1}{2{h}^{2}}{\left({x}_{w}-{x}_{u}\right)}^{2}}}{\displaystyle {\sum}_{j}{e}^{-\frac{1}{2{h}^{2}}{\left({y}_{\mathit{wi}}-{y}_{u}\right)}^{2}}}}$

(10)

where both *w, u* are indices of samples $w,u=1,2,\dots M$.

Computation of pairwise genes of a microarray dataset usually involves nested loops calculation which takes a dramatic amount of time. Assume a dataset has *N* genes and each gene has *M* samples. To calculate the pairwise mutual information values, the computation usually first finds the kernel distance between any two samples for a given gene. Then, the same process goes through every pair of genes in the dataset. In order to be computation efficient, two improvements are applied [13]. The first one is to calculate the marginal probability of each gene in advance and use it repeatedly during the process [13][15].The second improvement is to move the summation of each sample pair for a given gene to the most outer for-loop rather than inside a nested for-loop for every pairwise gene. As a result, the kernel distance between two samples is only calculated twice instead *N* times which saves a lot of computation time. LNO (Loops Nest Optimization) which changes the order of nested loops is a common time-saving technique in computer science field [16].

### Target network

The effect of drugs with multiple components should be viewed as a whole rather than a superposition of individual components [1][2]. Thus, a synergic concept is formed and considered as an efficient manner to design a drug [3]. In [17], mathematical models are used to measure the effect generated by the multiple components. However, it does not consider practical situation such as crosstalk between pathways. A network approach starts to be used to analyze the interactions among multiple components [4]. Initiated by work in [4], another system biological methodology, NIMS (Network-target-based Identification of Multicomponent Synergy) is proposed to measure the effect of drug agent pairs depending on their gene expression data [5]. NIMS focuses on ranking the drug agent pairs of Chinese Medicine components by their SS.

In [

5], it assumes that a drug component is denoted as a drug agent and with which a set of genes associated are denoted as agent genes of the drug agent. For a given disease, assume there are

*N* drug agents where

*N* =1, 2…

*n*. Initially, NIMS randomly chooses two drug agents from

*N*, A

_{1}, and A

_{2}, and builds a background target network by their agent genes in a graph. From the graph, NIMS calculates TS (Topology Score) of the graph by applying the PCA (Principle Component Analysis) to form a IP value which is integrated by betweenness, closeness and a variant of Eigenvalues PageRank [

18]. The TS is used to evaluate the topology significance of the target network for the drug agent pair, A

_{1} and A

_{2}, and is defined as

$T{S}_{1,2}=\frac{1}{2}\times \left[\frac{{\displaystyle \sum _{i}I{P}_{1}\left(i\right)\times exp\left(-min\left({d}_{i,j}\right)\right)}}{{\displaystyle \sum _{i}I{P}_{1}\left(i\right)}}+\frac{{\displaystyle \sum _{j}I{P}_{2}\left(j\right)\times exp\left(-min\left({d}_{j,i}\right)\right)}}{{\displaystyle \sum _{j}I{P}_{2}\left(j\right)}}\right]\text{,}$

(11)

where *IP*_{1} and *IP*_{2} denote IP values for drug agent A_{1} agent and A_{2}. Min(*d*_{i,j}) denotes minimum shortest path from gene *i* of A_{1} to all genes of A_{2} and min(*d*_{j,i}) denotes the one from gene *j* of A_{1} to all genes of A_{2}.

NIMS define another term, AS (Agent Score), to evaluate the similarity of a disease phenotype for a drug agent. For a given drug agent, if one of its agent genes has a phenotype record in the OMIM (Online Mendelian Inheritance in Man) database, the drug agent has that phenotype as one of its phenotype. The similarity score of a drug agent pair is defined as the cosine value of the pair’s feature vector angle [

19]. The AS is defined as:

$A{S}_{1,2}=\frac{{\displaystyle \sum _{i,j}{P}_{i,j}}}{M},$

(12)

where *P*_{i,j} denotes similarity score of *i*th phenotype of A_{1} and *j*th phenotype of A_{2} and *M* denotes the total number of phenotypes.

The SS of the pair is then defined as the product of TS and AS. NIMS calculates SS for all possible drug agent pairs for a disease and then can find potential drug agent pairs after ranking them by SS.