 Research
 Open Access
 Published:
A novel treebased procedure for deciphering the genomic spectrum of clinical disease entities
Journal of Clinical Bioinformaticsvolume 4, Article number: 6 (2014)
Abstract
Background
Dissecting the genomic spectrum of clinical disease entities is a challenging task. Recursive partitioning (or classification trees) methods provide powerful tools for exploring complex interplay among genomic factors, with respect to a main factor, that can reveal hidden genomic patterns. To take confounding variables into account, the partially linear treebased regression (PLTR) model has been recently published. It combines regression models and treebased methodology. It is however computationally burdensome and not well suited for situations for which a large number of exploratory variables is expected.
Methods
We developed a novel procedure that represents an alternative to the original PLTR procedure, and considered different selection criteria. A simulation study with different scenarios has been performed to compare the performances of the proposed procedure to the original PLTR strategy.
Results
The proposed procedure with a Bayesian Information Criterion (BIC) achieved good performances to detect the hidden structure as compared to the original procedure. The novel procedure was used for analyzing patterns of copynumber alterations in lung adenocarcinomas, with respect to Kirsten Rat Sarcoma Viral Oncogene Homolog gene (KRAS) mutation status, while controlling for a cohort effect. Results highlight two subgroups of pure or nearly pure wildtype KRAS tumors with particular copynumber alteration patterns.
Conclusions
The proposed procedure with a BIC criterion represents a powerful and practical alternative to the original procedure. Our procedure performs well in a general framework and is simple to implement.
Background
Recent advances in largescale genomic technologies provide an unprecedented amount of data that offer new insights into the molecular portraits of diseases. This information enables to dissect a heterogeneous disease entity into more homogeneous subentities that can be relevant for clinical purposes.
This problem is particularly appealing in oncology where molecular classification of tumors, that are based on the status of specific targeted therapy, rely mainly upon a single molecular event but overlook tumoral genomic complexity. For most solid epithelial tumors, these genomic events are primarily DNA mutations that give selective growth advantages to tumor cells.
A classical example in nonsmallcell lung cancer (NSCLC) is the activating EGFR (epidermal growth factor receptor) mutation that predicts the sensitivity to EGFR tyrosine kinase inhibitors. EGFRmutant lung adenocarcinoma is nowadays almost considered as a distinct disease entity. Such is not the case for KRAS (Kirsten Rat Sarcoma Viral Oncogene Homolog gene) mutation that represents one of the most common mutations in NSCLC. With the exception of its wellknown mutually exclusive relationship with EGFR mutation, the clinical utility of KRAS mutation status has not been clearly demonstrated [1]. Moreover, it is still unclear whether subgroups exist within KRAS wildtype or KRAS mutated tumors. The identification of more homogeneous molecular subgroups with respect to KRAS mutational status may provide new genomic taxonomy of NSCLC tumors, that may help for the advancement of personalized medicine.
The aim of the clinical study that prompted this methodological work was to decipher heterogeneity of lung adenocarcinomas with respect to KRAS mutation status based upon wholegenome copynumber alterations. Copynumber alteration (CNA) is one of the main type of genomic alterations that is linked to genome instability and represents a key feature of human carcinomas [2]. In previous cancer studies, association between specific CNAs and point mutations have been reported such as, for example, the relationship between EGFR mutations and copygains of 7p12 (which harbors EGFR gene) in lung adenocarcinomas. However, few investigations have been performed for studying the relationships between KRAS mutation and CNAs.
Identifying homogeneous subgroups, with respect to a main factor, based on the complex interplay among genomic alterations is a difficult task that cannot be easily done with standard regression models.
In contrast, recursive partitioning (or treebased) methods such as CART (Classification And Regression Tree) [3] is a flexible and powerful alternative for exploring highorder interaction between explanatory variables. From a data mining perspective, the purpose of such approach is to decompose a data space recursively into smaller areas that are defined by the set of explanatory variables and treestructured. The hypothesis space is the set of all possible hyperrectangular areas. These areas are more homogeneous with respect to the main factor as compared to the whole population. The analysis of the patterns of these areas, that are defined by the explanatory variables, can provide meaningful biological insights. In the context of nonparametric statistical methods, random forests [4] is the classical extension to treebased methods with many available R packages (for a few: VarSelRF [5], SRF [6], RF [7]). As compared to treebased methods, a forest that consists of thousands of unpruned trees is more stable and well suited for prediction. However, random forests loose the easy interpretability of CART, which represents the key objective when dissecting the clinicobiological spectrum of clinical disease entities.
For clinical epidemiology studies, an important drawback of classical treebased methodology is that it does not provide a straightforward way of adjusting for confounding variables. In practice, confounding and explanatory variables are considered in the same way. Thus, the final tree is a mixture of confounder and explanatory variables lacking of clear interpretation and whose joint effects are distorted. This problem was of particular concern in our clinical studies since our series was composed of two different subpopulations (Asian and Caucasian patients). In lung adenocarcinomas, KRAS mutation is found in about one third of the tumours in Caucasian populations, as opposed to less than one tenth is Asian populations. Thus, in our study, it was mandatory to adjust for this confounding factor.
In a pioneering a work, Chen et al. [8] have introduced a new class of regression models, called partially linear treebased regression models (PLTR). This new framework has been proposed for genetic epidemiology studies in order to assess complex joint genegene and geneenvironment effects taking into account confounding variables. In practice, PLTR models represent a new class of semiparametric regression models that integrates the advantages of generalized linear regression and treestructure models. The linear part is used to model the main effects of confounder variables and the nonparametric tree part is used to capture the distributional shape of the data relying on the complex joint effects of multiple explanatory variables. In their article, Chen et al. have proposed a fourstep selection and testing procedure for identifying the optimal tree while adjusting for linear (on the generalized linear scale) confounding variables. This methodology has been recently extended for considering multivariate outcomes [9]. However, Chen’s et al. procedure heavily relies on resampling, which is computationally burdensome and not wellsuited to situations for which a large number of explanatory variables is expected. In the present work, we propose and evaluate an alternative procedure with different selection criteria, which considers separately the objectives of selection and testing.
We first describe the novel procedure with three different selection criteria. It corresponds to a modified PLTR procedure with four steps, of which the two first are common to the one proposed by Chen et al. A simulation study with different scenarios is presented that compares the power of the proposed procedure to the original PLTR strategy. The proposed procedure is used to decipher heterogeneity of lung adenocarcinomas, with respect to KRAS mutation, based on copynumber alterations.
Methods
In the following we present our novel procedure with different selection criteria. The first two steps are similar to those of the original PLTR procedure (Chen et al.) whereas the last two steps are new. The four steps are summarized in Figure 1 and presented in details below.
Denote Y the outcome of interest (or the main factor for the application considered in this work), X the confounding variables (to be modeled linearly), and Z the explanatory variables. The PLTR model is specified by:
where g(·) is a known link function (generalized linear model), F(T(Z)) is a vector of indicator variables representing the leaves of the tree T(Z).
Step 1: Construction of a maximal tree
The maximal tree is constructed as follows:

Initialization: fit the generalized linear model (GLM) $g\left(\mathbb{E}\left(\mathbf{Y}\mathbf{X},\mathbf{Z}\right)\right)={\mathbf{X}}^{\prime}{\theta}^{\left(0\right)}+{\beta}^{\left(0\right)}$

Iterations: iterate the following steps starting with k = 1.

fit the tree part: construct a maximal tree model T^{(k)} based on Z, using X^{′}θ^{(k1)} as offset

fit the leaves of the tree: fit the GLM $g\left(\mathbb{E}\left(\mathbf{Y}\mathbf{X},\mathbf{Z}\right)\right)={\beta}_{T}^{\left(k\right)}F\left({T}^{\left(k\right)}\left(\mathbf{Z}\right)\right)$ using X^{′}θ^{(k1)} as offset

fit the parametric part: fit the GLM $g\left(\mathbb{E}\left(\mathbf{Y}\mathbf{X},\mathbf{Z}\right)\right)={\mathbf{X}}^{\prime}{\theta}^{\left(k\right)}$, with ${\beta}_{T}^{\left(k\right)}F\left({T}^{\left(k\right)}\left(\mathbf{Z}\right)\right)$ using as offset


Ending conditions: the algorithm stops when the estimates of θ stabilize within a prespecified range or after a prespecified number of iterations.
In the above procedure, an offset is a predictor variable included in the model with coefficient fixed equal to one.
In the construction of the tree, the goodness of a candidate split is assessed for each node by the deviance of a generalized linear model fitted in the node by considering X^{′}θ as the offset. More precisely, the goodness of a candidate split is the deviance of the parent node minus the sum of the deviance of the two child nodes. The recursive partitioning stops when the number of cases in each terminal node is smaller than a preassigned threshold.
Step 2: Construction of the sequence of nested subtrees
At the end of the previous step, the estimated tree T_{ R } is a maximal tree which generally overfits the data. The second step constructs a sequence of nested subtrees of T_{ R }.
The PLTR model (1) obtained from the previous step is
where T_{ R }(Z) represents the maximal tree at convergence, R being its size (number of terminal nodes or leaves).
Let $D\left(\mathbf{X};{\widehat{\theta}}_{0}\right)$ be the deviance computed under the null hypothesis
and $D\left(\mathbf{X},T\left(\mathbf{Z}\right);{\widehat{\theta}}_{T}\right)$ the deviance computed under the alternative hypothesis
with a tree T(Z) ⊆ T_{ R }(Z).
Let r ≤ R be the prespecified maximal size of subtrees to be considered. A sequence of nested candidates subtrees T_{2}(Z)⊂ ⋯ ⊂T_{ r }(Z) of T_{ R }(Z) is constructed as follows:

The procedure is forward with T_{1}(Z) representing the root of the tree T_{ R }(Z). Let $\left\{{T}_{j}^{m}\left(\mathbf{Z}\right),\phantom{\rule{0.3em}{0ex}}m=1,\dots ,{n}_{j}\right\}$ be the set of subtrees of T_{ R }(Z) with j leaves, such that for all m = 1, …, n_{ j }, T_{j1} is a subtree of ${T}_{j}^{m}$ : ${T}_{j1}\left(\mathbf{Z}\right)\subset {T}_{j}^{m}\left(\mathbf{Z}\right)$.

T_{ j }(Z) is the subtree of T_{ R }(Z) with j leaves such that T_{j1}(Z) ⊂ T_{ j }(Z), chosen as ${T}_{j}={T}_{j}^{{m}^{\ast}}$ with
$${m}^{\ast}=arg\underset{m=1,\dots ,{n}_{j}}{\text{max}}\left[D\left(X;{\widehat{\theta}}_{0}\right)\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}D\left(X,\underset{j}{\overset{m}{T}}\left(\mathbf{Z}\right);{\widehat{\theta}}_{\underset{j}{\overset{m}{T}}}\right)\right].$$
Step 3: tree selection
We select one of the trees of the sequence T_{1} ⊂ T_{2}⊂ ⋯ ⊂T_{ r }. For this selection step, we use either

penalized maximum likelihood methods: the Akaïke information criterion(AIC) [10] and the Bayesian information criterion (BIC) [11],

or a crossvalidation method.
The competing models to be considered are:
with F(T_{1}(Z)) ≡ 1 representing the situation where the tree is reduced to the root node, that is the null model (3).
BIC and AIC criteria
The BIC criterion for the model $\hat{{\mathcal{M}}_{j}}$ is
N being the sample size, δ_{ j } the number of free parameters involved in the model $\hat{{\mathcal{M}}_{j}}$ (δ_{ j } = dim(θ) + j) and $\mathcal{L}\left(\hat{{\mathcal{M}}_{j}}{\widehat{\theta}}_{{T}_{j}},{\widehat{\beta}}_{{T}_{j}}\right)$ the loglikelihood for the model $\hat{{\mathcal{M}}_{j}}$.
The model selected by the BIC criterion is $\hat{{\mathcal{M}}^{\text{bic}}}=\hat{{\mathcal{M}}_{{j}^{\phantom{\rule{0.3em}{0ex}}\text{bic}}}}$, where j^{bic} is defined by
We denote ${T}^{\text{bic}}={T}_{{j}^{\text{bic}}}$ the tree used in the model $\hat{{\mathcal{M}}^{\text{bic}}}$.
The AIC criterion for the model $\hat{{\mathcal{M}}_{j}}$ is
with δ_{ j } = dim(θ) + j. The model selected by the AIC criterion is $\hat{{\mathcal{M}}^{\text{aic}}}=\hat{{\mathcal{M}}_{{j}^{\phantom{\rule{0.3em}{0ex}}\text{aic}}}}$ where j^{aic} is defined by
We denote ${T}^{\text{aic}}={T}_{{j}^{\phantom{\rule{0.3em}{0ex}}\text{aic}}}$ the tree used in the model $\hat{{\mathcal{M}}^{\text{aic}}}$.
Crossvalidation criterion
As an alternative to the penalized maximum likelihood criteria presented above, we propose a crossvalidation procedure on the global PLTR model for selecting the optimal tree. The competing models $\hat{{\mathcal{M}}_{j}}$ are those defined in (5).
The original sample is randomly partitioned into K equal size subsamples:
For ℓ = 1, …, K, denotes by ${\mathcal{A}}_{\ell}=\bigcup _{m\ne \ell}{\mathcal{A}}_{m}$ the ℓ^{th} training set, while ${\mathcal{A}}_{\ell}$ is the corresponding validation set.
For each ℓ = 1, …, K, the following steps are performed:

fit the PLTR model (1) with the sample ${\mathcal{A}}_{\ell}$. At the end of step 1, the fitted PLTR model is
$$g\left(\mathbb{E}\left(\mathbf{Y}\mathbf{X},\mathbf{Z}\right)\right)={\mathbf{X}}^{\prime}{\widehat{\theta}}_{{T}_{R}^{\ell}}+{\widehat{\beta}}_{{T}_{R}^{\ell}}F\left({T}_{R}^{\ell}\left(\mathbf{Z}\right)\right),$$where ${T}_{R}^{\ell}\left(\mathbf{Z}\right)$ represents the maximal tree at convergence.

Construct a sequence of r  1 nested subtrees ${T}_{2}^{\ell}\left(\mathbf{Z}\right),\dots ,{T}_{r}^{\ell}\left(\mathbf{Z}\right)$ as in step 2, and determine the underlying PLTR models sequence:
$$\phantom{\rule{12.0pt}{0ex}}\hat{{\mathcal{M}}_{j}^{\ell}}:g\left(\mathbb{E}\left(\mathbf{Y}\mathbf{X},\mathbf{Z}\right)\right)\phantom{\rule{0.3em}{0ex}}=\phantom{\rule{0.3em}{0ex}}{\mathbf{X}}^{\prime}{\widehat{\theta}}_{{T}_{j}^{\ell}}+{\widehat{\beta}}_{{T}_{j}^{\ell}}F\phantom{\rule{0.3em}{0ex}}\left(\phantom{\rule{0.3em}{0ex}}{T}_{j}^{\ell}\phantom{\rule{0.3em}{0ex}}\left(\mathbf{Z}\right)\phantom{\rule{0.3em}{0ex}}\right)\phantom{\rule{0.3em}{0ex}},\phantom{\rule{0.3em}{0ex}}j\phantom{\rule{0.3em}{0ex}}=\phantom{\rule{0.3em}{0ex}}1,\dots ,r$$ 
For each j = 1, …, r, use the validation sample ${\mathcal{A}}_{\ell}$ to compute the crossvalidation error ${\text{CV}}_{j}^{\phantom{\rule{0.3em}{0ex}}\ell}$ of the model $\hat{{\mathcal{M}}_{j}^{\ell}}$.
The mean crossvalidation error is
The selected model is $\hat{{\mathcal{M}}^{\text{cv}}}=\hat{{\mathcal{M}}_{{j}^{\phantom{\rule{0.3em}{0ex}}\text{cv}}}}$ where j^{cv} is defined by
We denote ${T}^{\text{cv}}={T}_{{j}^{\text{cv}}}$ the tree used in the model $\hat{{\mathcal{M}}^{\text{cv}}}$.
Step 4: Testing
To test the null hypothesis (3) versus the alternative (4), we use the statistic
As the model ${\hat{\mathcal{M}}}_{{\mathcal{H}}_{1}}$ is not obtained as a maximum likelihood estimate, this statistic does not follow the “naïve” χ^{2}(j  1) distribution where j is the number of leaves of the tree used in ${\hat{\mathcal{M}}}_{{\mathcal{H}}_{1}}$. Fan et al. [12] demonstrated that for a variety of models involving non parametric estimators, such generalized likelihood ratio statistics follow a scaled chisquared distribution. In our case, this implies that for a defined number of leaves j the distribution of Λ is a scaled chisquared distribution:
As the theoretical determination of m and b is cumbersome, Fan et al. propose to simulate the null distribution for estimating the constants m and b. In the following, we use the conditional parametric bootstrap procedure described below:

Generate a new outcome Y^{b} from the fitted model $g\left(\mathbb{E}\left(\mathbf{Y}\mathbf{X}\right)\right)={\mathbf{X}}^{\prime}{\widehat{\theta}}_{0}+{\widehat{\beta}}_{0}$

Fit the complete model (2) with Y^{b} as the outcome (as in step 1)
$$g\left(\mathbb{E}\left({\mathbf{Y}}^{b}\mathbf{X},\mathbf{Z}\right)\right)={\mathbf{X}}^{\prime}{\widehat{\theta}}_{b}+{\widehat{\beta}}_{{T}_{R}^{b}}F\left({T}_{R}^{b}\left(\mathbf{Z}\right)\right)$$ 
Repeat the previous step until the size R is greater than j

Construct a sequence of candidate optimal subtrees $\left\{{T}_{k}^{b};k=2,\dots ,j\right\}$ as in step 2 (where we take r = j) and compute
$${\mathrm{\Lambda}}^{b}=2\mathcal{L}\left({\hat{\mathcal{M}}}_{j}\right)2\mathcal{L}\left({\hat{\mathcal{M}}}_{1}\right)$$ 
Repeat this process B times

Estimate b and m from the empirical moments of sample Λ^{1},…,Λ^{B}.
Once b and m have been estimated, a pvalue is calculated as $p=\mathbb{P}(X>m\mathrm{\Lambda})$ with X∼χ^{2}(b).
Results
Simulation study
Simulation protocol
A simulation study with a binary outcome (logit link) was conducted to evaluate and compare the performances of the proposed procedure (with the three selection criteria) to the original one proposed by Chen et al.
We have considered three different scenarios for which we used PLTR logistic model similar to the one considered in Chen et al.

In scenario 1, we simulated four Bernoulli variables G_{1}, G_{2}, G_{3}, G_{4} with probabilities 0.3, 0.25, 0.18 and 0.22 respectively, and an outcome Bernoulli variable, denoted Y, according to the following model (null hypothesis):
$$\begin{array}{ll}\text{logit}\phantom{\rule{0.3em}{0ex}}\mathbb{P}\left(Y=1{G}_{1},{G}_{2},{G}_{3},{G}_{4}\right)=& \phantom{\rule{0.3em}{0ex}}{\beta}_{1}+\theta {G}_{1}\phantom{\rule{2em}{0ex}}\end{array}$$with β_{1} = log(0.61), θ = log(2). Here G_{1} is the confounding variable and G_{2}, G_{3}, G_{4} are the explanatory variables.

In scenario 2, we introduced ten additional Bernoulli variables G_{5}, …, G_{14} with probabilities p = 0.5. The Bernoulli variable Y is simulated according to the following model:
$$\begin{array}{ll}\text{logit}\phantom{\rule{0.3em}{0ex}}\mathbb{P}\left(Y=1{G}_{1},\dots ,{G}_{14}\right)=& \phantom{\rule{0.3em}{0ex}}{\beta}_{1}\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}\theta {G}_{1}+{\beta}_{2}{\mathbf{1}}_{{G}_{2}=1,{G}_{3}=0}\phantom{\rule{2em}{0ex}}\\ \phantom{\rule{0.3em}{0ex}}+{\beta}_{3}{\mathbf{1}}_{{G}_{3}=1,{G}_{4}=0}\phantom{\rule{2em}{0ex}}\\ \phantom{\rule{0.3em}{0ex}}+{\beta}_{4}{\mathbf{1}}_{{G}_{3}=1,{G}_{4}=1}.\phantom{\rule{2em}{0ex}}\end{array}$$with β_{1} = log(0.45), θ = log(2), β_{2} = log(3.5), β_{3} = log(2) and β_{4} = log(4.5). This scenario mimics joint effects of G_{2}, G_{3}, and G_{4}. The corresponding tree is displayed in Figure (2). The variables G_{5}, …, G_{14} are noise variables unrelated to Y.

In scenario 3, we considered a deeper tree with nonindependent explanatory variables G_{2}, …, G_{5}. The model is:
$$\begin{array}{ll}\text{logit}\phantom{\rule{0.3em}{0ex}}\mathbb{P}\left(Y\phantom{\rule{0.3em}{0ex}}=\phantom{\rule{0.3em}{0ex}}1{G}_{1},\dots ,{G}_{15}\right)& \phantom{\rule{0.3em}{0ex}}=\phantom{\rule{0.3em}{0ex}}{\beta}_{1}\phantom{\rule{0.3em}{0ex}}+\theta {G}_{1}\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}{\beta}_{2}{\mathbf{1}}_{{G}_{2}=0,{G}_{3}=0,{G}_{4}=1}\phantom{\rule{2em}{0ex}}\\ \phantom{\rule{1em}{0ex}}+{\beta}_{3}{\mathbf{1}}_{{G}_{2}=0,{G}_{4}=1}\phantom{\rule{2em}{0ex}}\\ \phantom{\rule{1em}{0ex}}+{\beta}_{4}{\mathbf{1}}_{{G}_{2}=1,{G}_{5}=0}\phantom{\rule{2em}{0ex}}\\ \phantom{\rule{1em}{0ex}}+{\beta}_{5}{\mathbf{1}}_{{G}_{2}=1,{G}_{3}=0,{G}_{5}=1}\phantom{\rule{2em}{0ex}}\\ \phantom{\rule{1em}{0ex}}+{\beta}_{6}{\mathbf{1}}_{{G}_{2}=1,{G}_{3}=1,{G}_{5}=1}.\phantom{\rule{2em}{0ex}}\end{array}$$with β_{1} = log(1.8), θ = log(1.35), β_{2} = log(1.50), β_{3} = log(2), β_{4} = log(0.36), β_{5} = log(2.5) and β_{6} = log(0.36). The corresponding tree is displayed in Figure (3).
The Bernoulli variables G_{1}, …, G_{5} were generated from the following hierarchical model:
$$\begin{array}{ll}{G}_{0}\sim \mathcal{B}\left(0.2\right),\phantom{\rule{1em}{0ex}}\text{logit}\phantom{\rule{0.3em}{0ex}}\mathbb{P}({G}_{i}& =1)=\text{logit}(0.2)+{G}_{0}\phantom{\rule{2em}{0ex}}\\ \text{for}\phantom{\rule{.5em}{0ex}}i=2,\dots ,5\phantom{\rule{2em}{0ex}}\end{array}$$and
$$\text{logit}\phantom{\rule{0.3em}{0ex}}\mathbb{P}({G}_{1}=1)=\text{logit}\left(0.2\right)+log\left(2\right){G}_{0}.$$Hence the variables G_{1}, …, G_{5} are marginally dependent. The variables G_{6}, …, G_{15} are considered as noise variables and are generated independently from a Bernoulli distribution with p=0.5.
For all scenarios the sample was set to n = 2000, and 300 datasets were simulated.
Simulation results
Figures 4 and 5 display the quantilequantile plots of the observed statistics for the “naïve” theoretical χ^{2} distribution with degrees of freedom equal to the number of leaves minus 1, and for the scaled χ^{2} distribution (equation 6). These figures show that the naïve distribution is inadequate; in contrast, the scaled distribution with estimated m and b fits well the empirical distribution.
We assess whether or not the trees selected by step 3 in the sequence of nested trees have the correct number of leaves. Under scenario 1 (null hypothesis), a root tree (one leaf) is expected. As seen in Table 1, the procedure with the BIC criterion (BIC) selects the root tree for 98.3% of the simulations, whereas the Chen et al. procedure (named BOOT hereafter) succeeds for only 91% of the simulations. For the 10fold cross validation procedure (CV), this proportion goes down to 84%, and for the AIC criterion (AIC) it is only 47.3%.
Under scenario 2, the correct number is of 4 leaves. As seen from Table 2, BIC has the best performance with 44.3% of selected trees with four leaves. Moreover, it exhibits the smallest dispersion around the target value. In contrast BOOT selects a tree with only 2 leaves for all the simulations. The performances of CV are inferior to those from BIC, and the dispersion is higher. Finally, AIC selects always trees with too many leaves. Similar results are obtained with scenario 3 (where the correct number of leaves is 6), with increased quality of CV (Table 3).
For the more complex scenario (scenario 3), we computed the tenfold crossvalidation generalization error for each of the 300 simulated data sets for BIC, AIC and CV criteria. The distribution of the generalization errors are displayed in Figure 6. CV and BIC have very similar errors, while AIC have a slightly increased error.
In summary, the procedure using the BIC criterion consistently outperforms the other procedures.
We also investigated which variables are present in the splits of the trees selected by BIC under scenario 2 and 3 (Tables 4 and 5). For scenario 2, the socalled correct variables are G_{2} to G_{4}, and in scenario 3, G_{2} to G_{5}. In both scenarios, we refer as incorrect variables the ten noise variables. Under scenario 2, in 18% of the selected trees, at least one noise variable appears; however, all three correct variables are present in 44.3% of the selected trees. In all cases, at least two correct variables were selected. The BIC procedure behaves better under scenario 3, with more than 99% of trees involving all four correct variables, while noise variables appear in 20% of the trees.
Analysis of lung adenocarcinomas
Description of the data
The dataset considered in this study is based on a FrenchSingaporean study (Merlion study) of 230 patients with lung adenocarcinomas [13]. The WesternEurope series (WE) included 139 tumors and the EastAsian series (EA) included 91 tumors. Clinical characteristics were detailed in a previous published article [13]. DNA was extracted using standard protocols and stored at 80°C until use. Copy number information was issued from Affymetrix GenomeWide Human SNP 6.0 arrays. Inferences about the copy number status of each genomic segment (copy loss, copy modal, copy gain) were obtained using the modified CGHmix classification procedure [14]. In order to summarize genomic information while keeping a sufficient level of resolution, copy number status was averaged (median estimate) over the 284 main cytogenetic bands. Information about KRAS mutation was extracted from the targeted mutation profiling performed using the Sequenom Massarray 4 platform (Sequenom, San Diego, CA). Here, the KRAS mutation status was defined as the presence or absence of any mutation within KRAS gene. In this dataset, we detected 54 KRAS mutations with 44 cases (31.6%) from the WE series and 10 cases (10.9%) from the EA series.
We compared the results obtained from the Chen et al. procedure [8] to those obtained by the novel procedure with the BIC criterion. The “dependent” variable was the KRAS mutation status (mutation/wildtype). The cohort status (WE/EA) was the confounding binary variable. The 284 copynumber alterations (trinomial variable: copyloss, modal, copy gain) were considered as candidate explanatory variables. Recursive partitioning stopped as soon as the number of cases in each terminal node was below fifteen.
Results
The iterative procedure converged after 15 iterations. The trees selected by Chen’s et al. method and by our procedure with the BIC criterion are displayed in Figure 7. Chen’s et al. procedure led to two leaves that separated tumors with and without copyloss of 3q23. The global adjusted pvalue associated with the selected tree is 0.0055. This model is a simple cohortadjusted logistic regression model with 3q23 copyloss as the unique explanatory variable.
Our procedure with the BIC criterion led to seven leaves. We identified:

(i)
two pure or nearly pure wildtype KRAS leaves (with 53 tumors and only one KRAS mutation) characterized by no 3q23 copyloss and a copyloss for either 9q12 or 3p11 cytoband,

(ii)
a leave with a low rate of KRASmutated tumors (8%) characterized by no copyloss of 3q23, 3p11, 9q12, 14q23 but a copy gain of 8q23 cytoband,

(iii)
a leave with a medium rate of KRASmutated tumor (15.5%) with no copyloss of 3q23, 3p11, 14q23 and no copygain of 8q23 and 1q32 cytoband,

(iv)
the three other leaves were heterogeneous with a mixture of wildtype and KRASmutated tumors (43.5%, 52.6%, 51.2%).
For the selected tree, the split variables are copynumber aberrations of 1q32, 3p11, 3q23, 8q23, 9p12, and 14q23. The global pvalue associated with this tree is 0.011 with a tenfold crossvalidation generalization error of 0.23. These results were obtained after adjustment for a significant cohort effect (O R = 0.266, 95% Confidence interval: [ 0.120.56]) with a higher rate of KRAS for the WE series as compared to the EA series.
We also compared the characteristics of the 53 tumors arising from the two pure or nearly pure wildtype KRAS leaves as compared to the other tumors. There was no significant difference between the two groups regarding the EGFR mutation status (p = 0.94). There was a significantly higher proportion of tumors with a large fraction of genome altered (more than 50%) in the pure or nearly pure wildtype KRAS group as compared to the other groups (p = 1.7 × 10^{8}).
Discussion
Nowadays, there is a growing interest in deciphering the genomic spectrum of clinical disease entities. In this context, recursive partitioning methodology provides a powerful data mining tool for exploring complex interplay between genomic factors, with respect to a main factor, that can reveal hidden genomic patterns. The requirement of adjusting for confounding factors led Chen et al. to develop a semiparametric regression model called PLTR together with an iterative algorithm procedure to select and test the “optimal” tree. A main drawback of the procedure is that it relies on a two levels permutation strategy which can become cumbersome and computationally expensive. In this work, we propose a novel procedure with different selection criteria. As shown from the simulation study, the proposed procedure with the BIC criterion achieves good power to detect the hidden structure as compared to Chen’s et al procedure.
When investigating patterns of copynumber alterations in lung adenocarcinomas, with respect to KRAS mutation status and after adjustment for a cohort effect, our proposed strategy highlights two subgroups of pure or nearly pure wildtype KRAS tumors. These subgroups correspond to 53 lung adenocarcinomas having no 3q23 copyloss but copyloss for either 9p12 or 3p11 cytoband. It is worth noting that the 3q23 area harbors the PI3KCB gene that participates in the PI3K (Phosphatidylinositol 3kinase) signaling pathway, wellknown to be deregulated in many human cancers. Moreover, PI3K is one of the main effector pathways of RAS, regulating cell growth, cell cycle and cell survival. These wildtype KRAS subgroups are not enriched for EGFR mutation (mutually exclusive with KRAS mutation) and are composed of tumors having a proportion of copynumber changes significantly higher than expected by chance. The genomic patterns of these two wildtype KRAS subgroup are worth further investigation.
Conclusion
We have proposed a novel recursive partitioning procedure for deciphering the genomic spectrum of clinical disease entities. The proposed procedure represents a powerful and practical alternative to the partially linear treebased regression model proposed by Chen et al. [8]. Our procedure performs well, is simple to implement, less computationally demanding and can be recommended for investigating new disease taxonomy. The procedure is implemented within an R package known under the acronym ‘GPLTR’ and will be available very soon on the CRAN site.
We plan to use this novel procedure to identify new subgroups of multiple sclerosis treated with interferonbeta, with regards to the occurrence of antidrugantibody response, while adjusting for cohort effect.
References
 1.
Roberts P, Stinchcombe T: Kras mutation: should we test for it, and does it matter?. J Clin Oncol. 2013, 31 (8): 111221. 10.1200/JCO.2012.43.0454.
 2.
Rajagopalan H, Lengauer C: Aneuploidy and cancer. Nature. 2004, 432: 338341. 10.1038/nature03099.
 3.
Breiman L, Olshen JH, Stone CJ: Classification and Regression Trees. 1984, Belmont, California: Wadsworth International Group
 4.
Breiman L: Random forest. Technical Report, Department of Statistics, University of California at Berkeley. 2002
 5.
DiazUriarte R, Alvarez de Andrés S: Gene selection and classification of microarray data using random forest. BMC Bioinformatics. 2006, 7 (1): 113. 10.1186/1471210571.
 6.
Guan X, Chance MR, BarnholtzSloan JS: Splitting random forest (srf) for determining compact sets of genes that distinguish between cancer subtypes. J Clin Bioinform. 2012, 2 (1): 112.
 7.
Liaw A, Wiener M: Classification and regression by randomforest. R News. 2002, 2 (3): 1822.
 8.
Chen J, Yu K, Hsing A, Therneau TM: A partially linear treebased regression model for assessing complex joint genegene and geneenvironment effects. Genet Epidemiol. 2007, 31: 238251. 10.1002/gepi.20205.
 9.
Yu K, Wheeler W, Li Q, Bergen AW, Caporaso N, Chatterjee N, Chen J: A partially linear treebased regression model for multivariate outcomes. Biometrics. 2010, 66 (1): 8996. 10.1111/j.15410420.2009.01235.x.
 10.
Akaike H: A new look at the statistical model identification. IEEE Trans Automat Control. 1974, AC19: 716723.
 11.
Schwarz G: Estimating the dimension of a model. Ann Stat. 1978, 6: 461464. 10.1214/aos/1176344136.
 12.
Fan J, Zhang C, Zhang J: Generalized likelihood ratio statistics and wilks phenomenon. Ann Stat. 2001, 29 (1): 153193. 10.1214/aos/996986505.
 13.
Broët P, Dalmasso C, Tan E, Alifano M, Zhang S, Wu J, Lee M, Régnard J, Lim D, Koong H, Agasthian T, Miller L, Lim E, CamilleriBroët S, Tan P: Genomic profiles specific to patient ethnicity in lung adenocarcinoma. Clin Cancer Res. 2011, 17 (11): 354250. 10.1158/10780432.CCR102185.
 14.
Dalmasso C, Broët P: Detection of chromosomal abnormalities using high resolution arrays in clinical cancer research. J Biomed Inform. 2011, 44 (6): 936942. 10.1016/j.jbi.2011.06.003.
Acknowledgements
This research is supported in part by the IMIfunded ABIRISK project (http://www.abirisk.eu/).
Author information
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
CM and WT implemented the proposed procedure. PB coordinated the study. All authors participated in the design of the procedure and wrote the manuscript. All authors read and approved the final manuscript.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
About this article
Received
Accepted
Published
DOI
Keywords
 Recursive partitioning
 Treebased regression
 Lung cancer
 Disease taxonomy
 Genomic