A novel treebased procedure for deciphering the genomic spectrum of clinical disease entities
 Cyprien Mbogning^{1, 2}Email author,
 Hervé Perdry^{2, 3},
 Wilson Toussile^{2, 3} and
 Philippe Broët^{1, 2, 3, 4}
DOI: 10.1186/2043911346
© Mbogning et al.; licensee BioMed Central Ltd. 2014
Received: 23 December 2013
Accepted: 8 April 2014
Published: 16 April 2014
Abstract
Background
Dissecting the genomic spectrum of clinical disease entities is a challenging task. Recursive partitioning (or classification trees) methods provide powerful tools for exploring complex interplay among genomic factors, with respect to a main factor, that can reveal hidden genomic patterns. To take confounding variables into account, the partially linear treebased regression (PLTR) model has been recently published. It combines regression models and treebased methodology. It is however computationally burdensome and not well suited for situations for which a large number of exploratory variables is expected.
Methods
We developed a novel procedure that represents an alternative to the original PLTR procedure, and considered different selection criteria. A simulation study with different scenarios has been performed to compare the performances of the proposed procedure to the original PLTR strategy.
Results
The proposed procedure with a Bayesian Information Criterion (BIC) achieved good performances to detect the hidden structure as compared to the original procedure. The novel procedure was used for analyzing patterns of copynumber alterations in lung adenocarcinomas, with respect to Kirsten Rat Sarcoma Viral Oncogene Homolog gene (KRAS) mutation status, while controlling for a cohort effect. Results highlight two subgroups of pure or nearly pure wildtype KRAS tumors with particular copynumber alteration patterns.
Conclusions
The proposed procedure with a BIC criterion represents a powerful and practical alternative to the original procedure. Our procedure performs well in a general framework and is simple to implement.
Keywords
Recursive partitioning Treebased regression Lung cancer Disease taxonomy GenomicBackground
Recent advances in largescale genomic technologies provide an unprecedented amount of data that offer new insights into the molecular portraits of diseases. This information enables to dissect a heterogeneous disease entity into more homogeneous subentities that can be relevant for clinical purposes.
This problem is particularly appealing in oncology where molecular classification of tumors, that are based on the status of specific targeted therapy, rely mainly upon a single molecular event but overlook tumoral genomic complexity. For most solid epithelial tumors, these genomic events are primarily DNA mutations that give selective growth advantages to tumor cells.
A classical example in nonsmallcell lung cancer (NSCLC) is the activating EGFR (epidermal growth factor receptor) mutation that predicts the sensitivity to EGFR tyrosine kinase inhibitors. EGFRmutant lung adenocarcinoma is nowadays almost considered as a distinct disease entity. Such is not the case for KRAS (Kirsten Rat Sarcoma Viral Oncogene Homolog gene) mutation that represents one of the most common mutations in NSCLC. With the exception of its wellknown mutually exclusive relationship with EGFR mutation, the clinical utility of KRAS mutation status has not been clearly demonstrated [1]. Moreover, it is still unclear whether subgroups exist within KRAS wildtype or KRAS mutated tumors. The identification of more homogeneous molecular subgroups with respect to KRAS mutational status may provide new genomic taxonomy of NSCLC tumors, that may help for the advancement of personalized medicine.
The aim of the clinical study that prompted this methodological work was to decipher heterogeneity of lung adenocarcinomas with respect to KRAS mutation status based upon wholegenome copynumber alterations. Copynumber alteration (CNA) is one of the main type of genomic alterations that is linked to genome instability and represents a key feature of human carcinomas [2]. In previous cancer studies, association between specific CNAs and point mutations have been reported such as, for example, the relationship between EGFR mutations and copygains of 7p12 (which harbors EGFR gene) in lung adenocarcinomas. However, few investigations have been performed for studying the relationships between KRAS mutation and CNAs.
Identifying homogeneous subgroups, with respect to a main factor, based on the complex interplay among genomic alterations is a difficult task that cannot be easily done with standard regression models.
In contrast, recursive partitioning (or treebased) methods such as CART (Classification And Regression Tree) [3] is a flexible and powerful alternative for exploring highorder interaction between explanatory variables. From a data mining perspective, the purpose of such approach is to decompose a data space recursively into smaller areas that are defined by the set of explanatory variables and treestructured. The hypothesis space is the set of all possible hyperrectangular areas. These areas are more homogeneous with respect to the main factor as compared to the whole population. The analysis of the patterns of these areas, that are defined by the explanatory variables, can provide meaningful biological insights. In the context of nonparametric statistical methods, random forests [4] is the classical extension to treebased methods with many available R packages (for a few: VarSelRF [5], SRF [6], RF [7]). As compared to treebased methods, a forest that consists of thousands of unpruned trees is more stable and well suited for prediction. However, random forests loose the easy interpretability of CART, which represents the key objective when dissecting the clinicobiological spectrum of clinical disease entities.
For clinical epidemiology studies, an important drawback of classical treebased methodology is that it does not provide a straightforward way of adjusting for confounding variables. In practice, confounding and explanatory variables are considered in the same way. Thus, the final tree is a mixture of confounder and explanatory variables lacking of clear interpretation and whose joint effects are distorted. This problem was of particular concern in our clinical studies since our series was composed of two different subpopulations (Asian and Caucasian patients). In lung adenocarcinomas, KRAS mutation is found in about one third of the tumours in Caucasian populations, as opposed to less than one tenth is Asian populations. Thus, in our study, it was mandatory to adjust for this confounding factor.
In a pioneering a work, Chen et al. [8] have introduced a new class of regression models, called partially linear treebased regression models (PLTR). This new framework has been proposed for genetic epidemiology studies in order to assess complex joint genegene and geneenvironment effects taking into account confounding variables. In practice, PLTR models represent a new class of semiparametric regression models that integrates the advantages of generalized linear regression and treestructure models. The linear part is used to model the main effects of confounder variables and the nonparametric tree part is used to capture the distributional shape of the data relying on the complex joint effects of multiple explanatory variables. In their article, Chen et al. have proposed a fourstep selection and testing procedure for identifying the optimal tree while adjusting for linear (on the generalized linear scale) confounding variables. This methodology has been recently extended for considering multivariate outcomes [9]. However, Chen’s et al. procedure heavily relies on resampling, which is computationally burdensome and not wellsuited to situations for which a large number of explanatory variables is expected. In the present work, we propose and evaluate an alternative procedure with different selection criteria, which considers separately the objectives of selection and testing.
We first describe the novel procedure with three different selection criteria. It corresponds to a modified PLTR procedure with four steps, of which the two first are common to the one proposed by Chen et al. A simulation study with different scenarios is presented that compares the power of the proposed procedure to the original PLTR strategy. The proposed procedure is used to decipher heterogeneity of lung adenocarcinomas, with respect to KRAS mutation, based on copynumber alterations.
Methods
where g(·) is a known link function (generalized linear model), F(T(Z)) is a vector of indicator variables representing the leaves of the tree T(Z).
Step 1: Construction of a maximal tree
The maximal tree is constructed as follows:

Initialization: fit the generalized linear model (GLM) $g\left(\mathbb{E}\left(\mathbf{Y}\mathbf{X},\mathbf{Z}\right)\right)={\mathbf{X}}^{\prime}{\theta}^{\left(0\right)}+{\beta}^{\left(0\right)}$

Iterations: iterate the following steps starting with k = 1.

fit the tree part: construct a maximal tree model T^{(k)} based on Z, using X^{′}θ^{(k1)} as offset

fit the leaves of the tree: fit the GLM $g\left(\mathbb{E}\left(\mathbf{Y}\mathbf{X},\mathbf{Z}\right)\right)={\beta}_{T}^{\left(k\right)}F\left({T}^{\left(k\right)}\left(\mathbf{Z}\right)\right)$ using X^{′}θ^{(k1)} as offset

fit the parametric part: fit the GLM $g\left(\mathbb{E}\left(\mathbf{Y}\mathbf{X},\mathbf{Z}\right)\right)={\mathbf{X}}^{\prime}{\theta}^{\left(k\right)}$, with ${\beta}_{T}^{\left(k\right)}F\left({T}^{\left(k\right)}\left(\mathbf{Z}\right)\right)$ using as offset


Ending conditions: the algorithm stops when the estimates of θ stabilize within a prespecified range or after a prespecified number of iterations.
In the above procedure, an offset is a predictor variable included in the model with coefficient fixed equal to one.
In the construction of the tree, the goodness of a candidate split is assessed for each node by the deviance of a generalized linear model fitted in the node by considering X^{′}θ as the offset. More precisely, the goodness of a candidate split is the deviance of the parent node minus the sum of the deviance of the two child nodes. The recursive partitioning stops when the number of cases in each terminal node is smaller than a preassigned threshold.
Step 2: Construction of the sequence of nested subtrees
At the end of the previous step, the estimated tree T_{ R } is a maximal tree which generally overfits the data. The second step constructs a sequence of nested subtrees of T_{ R }.
where T_{ R }(Z) represents the maximal tree at convergence, R being its size (number of terminal nodes or leaves).
with a tree T(Z) ⊆ T_{ R }(Z).
Let r ≤ R be the prespecified maximal size of subtrees to be considered. A sequence of nested candidates subtrees T_{2}(Z)⊂ ⋯ ⊂T_{ r }(Z) of T_{ R }(Z) is constructed as follows:

The procedure is forward with T_{1}(Z) representing the root of the tree T_{ R }(Z). Let $\left\{{T}_{j}^{m}\left(\mathbf{Z}\right),\phantom{\rule{0.3em}{0ex}}m=1,\dots ,{n}_{j}\right\}$ be the set of subtrees of T_{ R }(Z) with j leaves, such that for all m = 1, …, n_{ j }, T_{j1} is a subtree of ${T}_{j}^{m}$ : ${T}_{j1}\left(\mathbf{Z}\right)\subset {T}_{j}^{m}\left(\mathbf{Z}\right)$.

T_{ j }(Z) is the subtree of T_{ R }(Z) with j leaves such that T_{j1}(Z) ⊂ T_{ j }(Z), chosen as ${T}_{j}={T}_{j}^{{m}^{\ast}}$ with${m}^{\ast}=arg\underset{m=1,\dots ,{n}_{j}}{\text{max}}\left[D\left(X;{\widehat{\theta}}_{0}\right)\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}D\left(X,\underset{j}{\overset{m}{T}}\left(\mathbf{Z}\right);{\widehat{\theta}}_{\underset{j}{\overset{m}{T}}}\right)\right].$
Step 3: tree selection
We select one of the trees of the sequence T_{1} ⊂ T_{2}⊂ ⋯ ⊂T_{ r }. For this selection step, we use either
with F(T_{1}(Z)) ≡ 1 representing the situation where the tree is reduced to the root node, that is the null model (3).
BIC and AIC criteria
N being the sample size, δ_{ j } the number of free parameters involved in the model $\hat{{\mathcal{M}}_{j}}$ (δ_{ j } = dim(θ) + j) and $\mathcal{L}\left(\hat{{\mathcal{M}}_{j}}{\widehat{\theta}}_{{T}_{j}},{\widehat{\beta}}_{{T}_{j}}\right)$ the loglikelihood for the model $\hat{{\mathcal{M}}_{j}}$.
We denote ${T}^{\text{bic}}={T}_{{j}^{\text{bic}}}$ the tree used in the model $\hat{{\mathcal{M}}^{\text{bic}}}$.
We denote ${T}^{\text{aic}}={T}_{{j}^{\phantom{\rule{0.3em}{0ex}}\text{aic}}}$ the tree used in the model $\hat{{\mathcal{M}}^{\text{aic}}}$.
Crossvalidation criterion
As an alternative to the penalized maximum likelihood criteria presented above, we propose a crossvalidation procedure on the global PLTR model for selecting the optimal tree. The competing models $\hat{{\mathcal{M}}_{j}}$ are those defined in (5).
For ℓ = 1, …, K, denotes by ${\mathcal{A}}_{\ell}=\bigcup _{m\ne \ell}{\mathcal{A}}_{m}$ the ℓ^{th} training set, while ${\mathcal{A}}_{\ell}$ is the corresponding validation set.
For each ℓ = 1, …, K, the following steps are performed:

fit the PLTR model (1) with the sample ${\mathcal{A}}_{\ell}$. At the end of step 1, the fitted PLTR model is$g\left(\mathbb{E}\left(\mathbf{Y}\mathbf{X},\mathbf{Z}\right)\right)={\mathbf{X}}^{\prime}{\widehat{\theta}}_{{T}_{R}^{\ell}}+{\widehat{\beta}}_{{T}_{R}^{\ell}}F\left({T}_{R}^{\ell}\left(\mathbf{Z}\right)\right),$
where ${T}_{R}^{\ell}\left(\mathbf{Z}\right)$ represents the maximal tree at convergence.

Construct a sequence of r  1 nested subtrees ${T}_{2}^{\ell}\left(\mathbf{Z}\right),\dots ,{T}_{r}^{\ell}\left(\mathbf{Z}\right)$ as in step 2, and determine the underlying PLTR models sequence:$\phantom{\rule{12.0pt}{0ex}}\hat{{\mathcal{M}}_{j}^{\ell}}:g\left(\mathbb{E}\left(\mathbf{Y}\mathbf{X},\mathbf{Z}\right)\right)\phantom{\rule{0.3em}{0ex}}=\phantom{\rule{0.3em}{0ex}}{\mathbf{X}}^{\prime}{\widehat{\theta}}_{{T}_{j}^{\ell}}+{\widehat{\beta}}_{{T}_{j}^{\ell}}F\phantom{\rule{0.3em}{0ex}}\left(\phantom{\rule{0.3em}{0ex}}{T}_{j}^{\ell}\phantom{\rule{0.3em}{0ex}}\left(\mathbf{Z}\right)\phantom{\rule{0.3em}{0ex}}\right)\phantom{\rule{0.3em}{0ex}},\phantom{\rule{0.3em}{0ex}}j\phantom{\rule{0.3em}{0ex}}=\phantom{\rule{0.3em}{0ex}}1,\dots ,r$

For each j = 1, …, r, use the validation sample ${\mathcal{A}}_{\ell}$ to compute the crossvalidation error ${\text{CV}}_{j}^{\phantom{\rule{0.3em}{0ex}}\ell}$ of the model $\hat{{\mathcal{M}}_{j}^{\ell}}$.
We denote ${T}^{\text{cv}}={T}_{{j}^{\text{cv}}}$ the tree used in the model $\hat{{\mathcal{M}}^{\text{cv}}}$.
Step 4: Testing
As the theoretical determination of m and b is cumbersome, Fan et al. propose to simulate the null distribution for estimating the constants m and b. In the following, we use the conditional parametric bootstrap procedure described below:

Generate a new outcome Y^{ b } from the fitted model $g\left(\mathbb{E}\left(\mathbf{Y}\mathbf{X}\right)\right)={\mathbf{X}}^{\prime}{\widehat{\theta}}_{0}+{\widehat{\beta}}_{0}$

Fit the complete model (2) with Y^{ b } as the outcome (as in step 1)$g\left(\mathbb{E}\left({\mathbf{Y}}^{b}\mathbf{X},\mathbf{Z}\right)\right)={\mathbf{X}}^{\prime}{\widehat{\theta}}_{b}+{\widehat{\beta}}_{{T}_{R}^{b}}F\left({T}_{R}^{b}\left(\mathbf{Z}\right)\right)$

Repeat the previous step until the size R is greater than j

Construct a sequence of candidate optimal subtrees $\left\{{T}_{k}^{b};k=2,\dots ,j\right\}$ as in step 2 (where we take r = j) and compute${\mathrm{\Lambda}}^{b}=2\mathcal{L}\left({\hat{\mathcal{M}}}_{j}\right)2\mathcal{L}\left({\hat{\mathcal{M}}}_{1}\right)$

Repeat this process B times

Estimate b and m from the empirical moments of sample Λ^{1},…,Λ^{ B }.
Once b and m have been estimated, a pvalue is calculated as $p=\mathbb{P}(X>m\mathrm{\Lambda})$ with X∼χ^{2}(b).
Results
Simulation study
Simulation protocol
A simulation study with a binary outcome (logit link) was conducted to evaluate and compare the performances of the proposed procedure (with the three selection criteria) to the original one proposed by Chen et al.
We have considered three different scenarios for which we used PLTR logistic model similar to the one considered in Chen et al.

In scenario 1, we simulated four Bernoulli variables G_{1}, G_{2}, G_{3}, G_{4} with probabilities 0.3, 0.25, 0.18 and 0.22 respectively, and an outcome Bernoulli variable, denoted Y, according to the following model (null hypothesis):$\begin{array}{ll}\text{logit}\phantom{\rule{0.3em}{0ex}}\mathbb{P}\left(Y=1{G}_{1},{G}_{2},{G}_{3},{G}_{4}\right)=& \phantom{\rule{0.3em}{0ex}}{\beta}_{1}+\theta {G}_{1}\phantom{\rule{2em}{0ex}}\end{array}$
with β_{1} = log(0.61), θ = log(2). Here G_{1} is the confounding variable and G_{2}, G_{3}, G_{4} are the explanatory variables.

In scenario 2, we introduced ten additional Bernoulli variables G_{5}, …, G_{14} with probabilities p = 0.5. The Bernoulli variable Y is simulated according to the following model:$\begin{array}{ll}\text{logit}\phantom{\rule{0.3em}{0ex}}\mathbb{P}\left(Y=1{G}_{1},\dots ,{G}_{14}\right)=& \phantom{\rule{0.3em}{0ex}}{\beta}_{1}\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}\theta {G}_{1}+{\beta}_{2}{\mathbf{1}}_{{G}_{2}=1,{G}_{3}=0}\phantom{\rule{2em}{0ex}}\\ \phantom{\rule{0.3em}{0ex}}+{\beta}_{3}{\mathbf{1}}_{{G}_{3}=1,{G}_{4}=0}\phantom{\rule{2em}{0ex}}\\ \phantom{\rule{0.3em}{0ex}}+{\beta}_{4}{\mathbf{1}}_{{G}_{3}=1,{G}_{4}=1}.\phantom{\rule{2em}{0ex}}\end{array}$
with β_{1} = log(0.45), θ = log(2), β_{2} = log(3.5), β_{3} = log(2) and β_{4} = log(4.5). This scenario mimics joint effects of G_{2}, G_{3}, and G_{4}. The corresponding tree is displayed in Figure (2). The variables G_{5}, …, G_{14} are noise variables unrelated to Y.

In scenario 3, we considered a deeper tree with nonindependent explanatory variables G_{2}, …, G_{5}. The model is:$\begin{array}{ll}\text{logit}\phantom{\rule{0.3em}{0ex}}\mathbb{P}\left(Y\phantom{\rule{0.3em}{0ex}}=\phantom{\rule{0.3em}{0ex}}1{G}_{1},\dots ,{G}_{15}\right)& \phantom{\rule{0.3em}{0ex}}=\phantom{\rule{0.3em}{0ex}}{\beta}_{1}\phantom{\rule{0.3em}{0ex}}+\theta {G}_{1}\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}{\beta}_{2}{\mathbf{1}}_{{G}_{2}=0,{G}_{3}=0,{G}_{4}=1}\phantom{\rule{2em}{0ex}}\\ \phantom{\rule{1em}{0ex}}+{\beta}_{3}{\mathbf{1}}_{{G}_{2}=0,{G}_{4}=1}\phantom{\rule{2em}{0ex}}\\ \phantom{\rule{1em}{0ex}}+{\beta}_{4}{\mathbf{1}}_{{G}_{2}=1,{G}_{5}=0}\phantom{\rule{2em}{0ex}}\\ \phantom{\rule{1em}{0ex}}+{\beta}_{5}{\mathbf{1}}_{{G}_{2}=1,{G}_{3}=0,{G}_{5}=1}\phantom{\rule{2em}{0ex}}\\ \phantom{\rule{1em}{0ex}}+{\beta}_{6}{\mathbf{1}}_{{G}_{2}=1,{G}_{3}=1,{G}_{5}=1}.\phantom{\rule{2em}{0ex}}\end{array}$
with β_{1} = log(1.8), θ = log(1.35), β_{2} = log(1.50), β_{3} = log(2), β_{4} = log(0.36), β_{5} = log(2.5) and β_{6} = log(0.36). The corresponding tree is displayed in Figure (3).
The Bernoulli variables G_{1}, …, G_{5} were generated from the following hierarchical model:$\begin{array}{ll}{G}_{0}\sim \mathcal{B}\left(0.2\right),\phantom{\rule{1em}{0ex}}\text{logit}\phantom{\rule{0.3em}{0ex}}\mathbb{P}({G}_{i}& =1)=\text{logit}(0.2)+{G}_{0}\phantom{\rule{2em}{0ex}}\\ \text{for}\phantom{\rule{.5em}{0ex}}i=2,\dots ,5\phantom{\rule{2em}{0ex}}\end{array}$and$\text{logit}\phantom{\rule{0.3em}{0ex}}\mathbb{P}({G}_{1}=1)=\text{logit}\left(0.2\right)+log\left(2\right){G}_{0}.$Hence the variables G_{1}, …, G_{5} are marginally dependent. The variables G_{6}, …, G_{15} are considered as noise variables and are generated independently from a Bernoulli distribution with p=0.5.
For all scenarios the sample was set to n = 2000, and 300 datasets were simulated.
Simulation results
Number of trees by number of leaves, for the 300 trees selected by the different methods under scenario 1
Leaves  1  2  3  4  5  6 

BOOT  273  8  8  7  1  3 
CV  252  0  18  10  10  10 
BIC  295  4  1  0  0  0 
AIC  142  64  54  30  5  5 
Number of trees by number of leaves, for the 300 trees selected by the different methods under scenario 2
Leaves  1  2  3  4  5  6  7  8  9  10 

BOOT  0  300  0  0  0  0  0  0  0  0 
CV  0  18  83  61  36  32  21  19  13  17 
BIC  0  0  112  133  46  7  2  0  0  0 
AIC  0  0  0  0  0  0  3  8  24  265 
Number of trees by number of leaves, for the 300 trees selected by the different methods under scenario 3
Leaves  1  2  3  4  5  6  7  8  9  10 

BOOT  0  300  0  0  0  0  0  0  0  0 
CV  0  0  41  16  22  154  22  12  17  16 
BIC  0  0  0  1  89  162  36  9  3  0 
AIC  0  0  0  0  0  0  0  1  2  297 
In summary, the procedure using the BIC criterion consistently outperforms the other procedures.
Variables selected by the procedure using BIC criterion under scenario 2, with global percentages between brackets
Incorrect variables  

0  1  2  3  4  5  
Correct  0  0  0  0  0  0  0 
Variables  1  0  0  0  0  0  0 
2  134 (44.66%)  22 (7.33%)  10 (3.33%)  0  0  1 (0.33%)  
3  112 (37.33%)  19 (6.33%)  2 (0.66%)  0  0  0 
Variables selected by the procedure using BIC criterion under scenario 3, with global percentages between brackets
Incorrect variables  

0  1  2  3  4  5  
Correct  0  0  0  0  0  0  0 
variables  1  0  0  0  0  0  0 
2  0  0  0  0  0  0  
3  1 (0.33%)  1 (0.33%)  0  0  0  0  
4  239 (79.66%)  50 (16.66%)  8 (2.66%)  1 (0.33%)  0  0 
Analysis of lung adenocarcinomas
Description of the data
The dataset considered in this study is based on a FrenchSingaporean study (Merlion study) of 230 patients with lung adenocarcinomas [13]. The WesternEurope series (WE) included 139 tumors and the EastAsian series (EA) included 91 tumors. Clinical characteristics were detailed in a previous published article [13]. DNA was extracted using standard protocols and stored at 80°C until use. Copy number information was issued from Affymetrix GenomeWide Human SNP 6.0 arrays. Inferences about the copy number status of each genomic segment (copy loss, copy modal, copy gain) were obtained using the modified CGHmix classification procedure [14]. In order to summarize genomic information while keeping a sufficient level of resolution, copy number status was averaged (median estimate) over the 284 main cytogenetic bands. Information about KRAS mutation was extracted from the targeted mutation profiling performed using the Sequenom Massarray 4 platform (Sequenom, San Diego, CA). Here, the KRAS mutation status was defined as the presence or absence of any mutation within KRAS gene. In this dataset, we detected 54 KRAS mutations with 44 cases (31.6%) from the WE series and 10 cases (10.9%) from the EA series.
We compared the results obtained from the Chen et al. procedure [8] to those obtained by the novel procedure with the BIC criterion. The “dependent” variable was the KRAS mutation status (mutation/wildtype). The cohort status (WE/EA) was the confounding binary variable. The 284 copynumber alterations (trinomial variable: copyloss, modal, copy gain) were considered as candidate explanatory variables. Recursive partitioning stopped as soon as the number of cases in each terminal node was below fifteen.
Results
 (i)
two pure or nearly pure wildtype KRAS leaves (with 53 tumors and only one KRAS mutation) characterized by no 3q23 copyloss and a copyloss for either 9q12 or 3p11 cytoband,
 (ii)
a leave with a low rate of KRASmutated tumors (8%) characterized by no copyloss of 3q23, 3p11, 9q12, 14q23 but a copy gain of 8q23 cytoband,
 (iii)
a leave with a medium rate of KRASmutated tumor (15.5%) with no copyloss of 3q23, 3p11, 14q23 and no copygain of 8q23 and 1q32 cytoband,
 (iv)
the three other leaves were heterogeneous with a mixture of wildtype and KRASmutated tumors (43.5%, 52.6%, 51.2%).
For the selected tree, the split variables are copynumber aberrations of 1q32, 3p11, 3q23, 8q23, 9p12, and 14q23. The global pvalue associated with this tree is 0.011 with a tenfold crossvalidation generalization error of 0.23. These results were obtained after adjustment for a significant cohort effect (O R = 0.266, 95% Confidence interval: [ 0.120.56]) with a higher rate of KRAS for the WE series as compared to the EA series.
We also compared the characteristics of the 53 tumors arising from the two pure or nearly pure wildtype KRAS leaves as compared to the other tumors. There was no significant difference between the two groups regarding the EGFR mutation status (p = 0.94). There was a significantly higher proportion of tumors with a large fraction of genome altered (more than 50%) in the pure or nearly pure wildtype KRAS group as compared to the other groups (p = 1.7 × 10^{8}).
Discussion
Nowadays, there is a growing interest in deciphering the genomic spectrum of clinical disease entities. In this context, recursive partitioning methodology provides a powerful data mining tool for exploring complex interplay between genomic factors, with respect to a main factor, that can reveal hidden genomic patterns. The requirement of adjusting for confounding factors led Chen et al. to develop a semiparametric regression model called PLTR together with an iterative algorithm procedure to select and test the “optimal” tree. A main drawback of the procedure is that it relies on a two levels permutation strategy which can become cumbersome and computationally expensive. In this work, we propose a novel procedure with different selection criteria. As shown from the simulation study, the proposed procedure with the BIC criterion achieves good power to detect the hidden structure as compared to Chen’s et al procedure.
When investigating patterns of copynumber alterations in lung adenocarcinomas, with respect to KRAS mutation status and after adjustment for a cohort effect, our proposed strategy highlights two subgroups of pure or nearly pure wildtype KRAS tumors. These subgroups correspond to 53 lung adenocarcinomas having no 3q23 copyloss but copyloss for either 9p12 or 3p11 cytoband. It is worth noting that the 3q23 area harbors the PI3KCB gene that participates in the PI3K (Phosphatidylinositol 3kinase) signaling pathway, wellknown to be deregulated in many human cancers. Moreover, PI3K is one of the main effector pathways of RAS, regulating cell growth, cell cycle and cell survival. These wildtype KRAS subgroups are not enriched for EGFR mutation (mutually exclusive with KRAS mutation) and are composed of tumors having a proportion of copynumber changes significantly higher than expected by chance. The genomic patterns of these two wildtype KRAS subgroup are worth further investigation.
Conclusion
We have proposed a novel recursive partitioning procedure for deciphering the genomic spectrum of clinical disease entities. The proposed procedure represents a powerful and practical alternative to the partially linear treebased regression model proposed by Chen et al. [8]. Our procedure performs well, is simple to implement, less computationally demanding and can be recommended for investigating new disease taxonomy. The procedure is implemented within an R package known under the acronym ‘GPLTR’ and will be available very soon on the CRAN site.
We plan to use this novel procedure to identify new subgroups of multiple sclerosis treated with interferonbeta, with regards to the occurrence of antidrugantibody response, while adjusting for cohort effect.
Declarations
Acknowledgements
This research is supported in part by the IMIfunded ABIRISK project (http://www.abirisk.eu/).
Authors’ Affiliations
References
 Roberts P, Stinchcombe T: Kras mutation: should we test for it, and does it matter?. J Clin Oncol. 2013, 31 (8): 111221. 10.1200/JCO.2012.43.0454.View ArticlePubMedGoogle Scholar
 Rajagopalan H, Lengauer C: Aneuploidy and cancer. Nature. 2004, 432: 338341. 10.1038/nature03099.View ArticlePubMedGoogle Scholar
 Breiman L, Olshen JH, Stone CJ: Classification and Regression Trees. 1984, Belmont, California: Wadsworth International GroupGoogle Scholar
 Breiman L: Random forest. Technical Report, Department of Statistics, University of California at Berkeley. 2002
 DiazUriarte R, Alvarez de Andrés S: Gene selection and classification of microarray data using random forest. BMC Bioinformatics. 2006, 7 (1): 113. 10.1186/1471210571.View ArticleGoogle Scholar
 Guan X, Chance MR, BarnholtzSloan JS: Splitting random forest (srf) for determining compact sets of genes that distinguish between cancer subtypes. J Clin Bioinform. 2012, 2 (1): 112.Google Scholar
 Liaw A, Wiener M: Classification and regression by randomforest. R News. 2002, 2 (3): 1822.Google Scholar
 Chen J, Yu K, Hsing A, Therneau TM: A partially linear treebased regression model for assessing complex joint genegene and geneenvironment effects. Genet Epidemiol. 2007, 31: 238251. 10.1002/gepi.20205.View ArticlePubMedGoogle Scholar
 Yu K, Wheeler W, Li Q, Bergen AW, Caporaso N, Chatterjee N, Chen J: A partially linear treebased regression model for multivariate outcomes. Biometrics. 2010, 66 (1): 8996. 10.1111/j.15410420.2009.01235.x.PubMed CentralView ArticlePubMedGoogle Scholar
 Akaike H: A new look at the statistical model identification. IEEE Trans Automat Control. 1974, AC19: 716723.View ArticleGoogle Scholar
 Schwarz G: Estimating the dimension of a model. Ann Stat. 1978, 6: 461464. 10.1214/aos/1176344136.View ArticleGoogle Scholar
 Fan J, Zhang C, Zhang J: Generalized likelihood ratio statistics and wilks phenomenon. Ann Stat. 2001, 29 (1): 153193. 10.1214/aos/996986505.View ArticleGoogle Scholar
 Broët P, Dalmasso C, Tan E, Alifano M, Zhang S, Wu J, Lee M, Régnard J, Lim D, Koong H, Agasthian T, Miller L, Lim E, CamilleriBroët S, Tan P: Genomic profiles specific to patient ethnicity in lung adenocarcinoma. Clin Cancer Res. 2011, 17 (11): 354250. 10.1158/10780432.CCR102185.View ArticlePubMedGoogle Scholar
 Dalmasso C, Broët P: Detection of chromosomal abnormalities using high resolution arrays in clinical cancer research. J Biomed Inform. 2011, 44 (6): 936942. 10.1016/j.jbi.2011.06.003.View ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.