Combined analysis of chromosomal instabilities and gene expression for colon cancer progression inference

Cava, Claudia; Zoppis, Italo; Gariboldi, Manuela; Castiglioni, Isabella; Mauri, Giancarlo; Antoniotti, Marco

doi:10.1186/2043-9113-4-2

Research
Open access
Published: 24 January 2014

Combined analysis of chromosomal instabilities and gene expression for colon cancer progression inference

Claudia Cava¹,
Italo Zoppis²,
Manuela Gariboldi^3,4,
Isabella Castiglioni¹,
Giancarlo Mauri² &
…
Marco Antoniotti²

Journal of Clinical Bioinformatics volume 4, Article number: 2 (2014) Cite this article

5552 Accesses
12 Citations
Metrics details

Abstract

Background

Copy number alterations (CNAs) represent an important component of genetic variations. Such alterations are related with certain type of cancer including those of the pancreas, colon, and breast, among others. CNAs have been used as biomarkers for cancer prognosis in multiple studies, but few works report on the relation of CNAs with the disease progression. Moreover, most studies do not consider the following two important issues. (I) The identification of CNAs in genes which are responsible for expression regulation is fundamental in order to define genetic events leading to malignant transformation and progression. (II) Most real domains are best described by structured data where instances of multiple types are related to each other in complex ways.

Results

Our main interest is to check whether the colorectal cancer (CRC) progression inference benefits when considering both (I) the expression levels of genes with CNAs, and (II) relationships (i.e. dissimilarities) between patients due to expression level differences of the altered genes. We first evaluate the accuracy performance of a state-of-the-art inference method (support vector machine) when subjects are represented only through sets of available attribute values (i.e. gene expression level). Then we check whether the inference accuracy improves, when explicitly exploiting the information mentioned above. Our results suggest that the CRC progression inference improves when the combined data (i.e. CNA and expression level) and the considered dissimilarity measures are applied.

Conclusions

Through our approach, classification is intuitively appealing and can be conveniently obtained in the resulting dissimilarity spaces. Different public datasets from Gene Expression Omnibus (GEO) were used to validate the results.

Background

Colorectal cancer (CRC) is the third most common cancer worldwide. The life expectancy of individuals with CRC is mainly dependent on the clinical stage which may characterize the disease according e.g., to the following tumor progression (Duke’s stage classification) system [1].

Stage I: CRC is only in the innermost lining of the colon or rectum or slightly growing into the muscle layer;
Stage II: CRCs are extended through the muscular wall of the colon but do not affect the lymph nodes;
Stage III: CRCs have spread outside the colon to one or more lymph;
Stage IV: CRCs have spread outside the colon to other parts of the body commonly the liver or the lungs;

Stage-I patients have a 5-year survival rate of approximately 93% which decreases to approximately 80% for patients with stage II, 60% for patients with stage III and, 8% for stage IV [2]. The development and progression of CRC (as for most other solid cancers) is a multi-step process also leading to the accumulation of chromosomal instability (CIN) that occurs over the lifetime of a tumor. Three major forms of genetic instability in CRC have been described: microsatellite instability (MIN), epigenetic changes (as DNA methylation) and chromosomal instability which leads to gains and losses of chromosomal segments [3–5]. CINs include DNA copy number alterations (CNAs), i.e., regions of aberrantly increased or decreased DNA (see Figure 1). Such alterations ultimately leads to malignant transformation and progression [6].

The need to better understand tumor genesis and its relationship with CNAs has led many studies to attack the problem from different prospectives; many of which have been enabled recently by an increasing and multifarious set of tools and techniques in cancer research [7]. For example, Leslie et al. [8] investigated on the aberration frequency of the colorectal neoplasia providing significant evidence of both (aberration) gain at chromosomes 20 q, 13 q, 7 p, 8 q and (aberration) loss at 18 q, 17 p, 8 p.

Differently, Bomme et al. [9] showed the relationship between tumor progression and metastases with CNA positions over the chromosomes. They observed one of the earliest gathered genetic abnormalities related to chromosome 7 amplification during the colorectal cancer (CRC) progression. Moreover, Ghadimi et al. [10] reported the potential role of chromosome 8 q amplification for the development of lymph node metastases.

Most studies concerning CNAs investigate the use of aberrations as biomarkers for cancer prognosis (e.g., [11, 12]), but few works report on the relationship of CNAs with the disease progression [13–17]. Moreover, most of these studies do not consider the following two important issues.

The identification of CNAs in genes which are responsible for expression regulation is fundamental in order to define key genetic events leading to malignant transformation and disease progression. By combining gene expression and copy number data these regulators can be revealed. Only a limited number of studies apply this approach, for instance in breast cancer prognosis [18, 19]. Other authors used high resolution oligonucleotide comparative genomic hybridization arrays, and by matching gene expression array data showed correlation between DNA copy number alteration and mRNA levels [20].
Most real domains are best described by structured data where instances of multiple types are related to each other in complex ways. For example, scientific papers are related through citations and authors, web pages are interconnected by hyperlinks, telephone accounts are linked by calls. Nevertheless, in clinical investigation, classification is generally obtained assuming that case or control subjects are independent and identically distributed (IID). Numerous algorithms have been designed to work on such (as we will call in this paper) “standard approach”, where instances (e.g. patients) can be represented as fixed-length vectors of attribute values (see [21] for a survey). Actually, the CNAs within a patient group might be related each other, and this property in turn may change when the relationship is defined over different groups. Moreover, when the relationships are addressed through dissimilarities [22], the resulting patient representation (i.e., dissimilarity representation) is intuitively appealing and is supported by the fact that classification (and clustering) methods can be suitably applied in the resulting “dissimilarity space” [22].

The main issue of our investigation is to check whether the accuracy of the CRC progression inference benefits when considering the following types of information.

(1)
Expression levels of altered genes, and
(2)
relationships (i.e., dissimilarities) among patients due to expression level differences of the altered genes.

In the first case only the expression level of altered genes is used with standard inference mechanisms (here, we call this approach “combined approach”, shortly COMB). In the second case we define dissimilarities among patients due to differences among the COMB data associated to each subject, and evaluate the “inference accuracy” when using this new type of representation; we call this approach “relational approach” (shortly RA). Specifically, our inference is based on “control vs. case” classification tasks. In other words, given a patient x, whose stage is, e.g., stage(x), we evaluate the ability of an inference mechanism to classify that patient either in the same stage (i.e., stage(x)) or in an advanced stage, say stage^′>stage(x). Our evaluation (provided through comparisons) is empirical: we first observe the accuracy performance of a state-of-the-art inference method (for instance Support Vector Machine) to forecast the CRC stage progression when patients are represented through the set of available attribute values only given by the gene expression levels. As mentioned above, we call this approach standard (shortly SA) since this reflects a typical way of representing IID subjects. Then we check whether the inference accuracy improves when explicitly exploiting both the information provided respectively through COMB and RA.

In order to obtain the expression level of genes with CNAs, we first identify differentially expressed genes by evaluating their expression levels from different datasets (see below in the text). Similarly, altered genes (i.e., genes with amplification or deletion) are identified by analyzing their CNAs from different datasets. Then, by considering the results of both the gene expression analysis and the CNA analysis, we obtain up-regulated genes with CNA gains and down-regulated genes with CNA losses.

Moreover, in order to quantify relationships between patients which can express, as stated above, the CRC progression, we define a dissimilarity over both an “advanced-stage” patient group and a specific “representative” base group, e.g. patients with the lowest stage (which we will refer to as “prototype” group). As previously mentioned, the considered dissimilarities quantify, by construction, subject differences due to different expression levels of altered genes (as obtained via the previous analysis) belonging to each subject.

While in a SA, subjects are discriminated on their own set of attribute values, in the dissimilarity-based classification we consider, we employ pairwise comparisons (between patients), i.e., a N×N dissimilarities matrix D(T,P). Each entry of D(T,P) is a dissimilarity value computed between pairs of patients that is, each patient x within the group T is represented by a vector of dissimilarities D(x,P) to patients of a representative (prototype) group P.

Dissimilarities have been used in pattern recognition for many years, leading to many different known algorithms and important questions. For example, the idea of “template matching” is based on dissimilarities: objects are given the same class label if their difference is sufficiently small [23]. This is identical to the nearest neighbor rule used in vector spaces [21]. Also many procedures for cluster analysis make use of dissimilarities instead of the standard feature space representation [24]. A use of dissimilarity measures to reconstruct dynamic temporal models of biological processes can be found in [25] A detailed description, providing mathematical foundation, designed procedures, and real world examples for building pattern recognition systems based on dissimilarity representation may also be found in [22].

Materials and methods

The description of the material and methods we used in our study can be conveniently organized according to the type of analysis conducted, as listed hereafter.

1.
Gene expression analysis.
2.
Copy number analysis.
3.
Combined gene expression and CNA analysis.
4.
Dissimilarity-based representation.
5.
Inference procedure.
6.
Statistical evaluations.

Table 1 shows the classification tasks that we defined as the “drivers” of our study.

Table 1 Inference tasks

Full size table

I.e., the disease progression inference is based on control vs. case classification tasks. Please note that we used as control group the patients with the lowest stage in the considered tasks (e.g., stage II, when considering stage-II vs. stage-III). In this work all the control groups (i.e., tumor progression negatives) are labeled by 0, while the remaining (i.e., positive) are labeled by 1. Moreover, we point out that the dissimilarity-based representation is based on the work of Pekalska et al. [22] and is adapted here to conclusively provide the results. For this reason, we will detail the description (i.e. formulation) of this representation.

Gene expression analysis

In this phase, differentially expressed genes (up or down–regulated) were selected by evaluating their expression levels on different datasets [26, 27]. For this, we used two public CRC microarray data from Gene Expression Omnibus (GEO) [28]: GSE27854 and GSE17536. From the first dataset three groups of patients were selected: 41 patients with stage II, 35 patients with stage III, and 23 with stage IV. Similarly, from the second dataset the following three groups of patients were selected: 57 patients with stage II, 57 with stage III, and 39 with stage IV.

Given any dataset and a specific task in Table 1, we say that a gene is differentially expressed for that dataset if it is up- (down-) expressed in the highest stage patients in comparison to the lowest stage patients of that dataset. When a gene is differentially expressed in both datasets (i.e., GSE27854 and GSE17536), we conclusively consider that genes as differentially expressed and apply it to the combined data analysis as we will report in the following paragraphs. In other words, we use more than one dataset to give more evidence for a gene to be up/down-regulated. This procedure is summarized as follows (we also represent this analysis in Figure 2):

Expression values from Affymetrix Human Genome U133 Plus 2.0 array were calculated for both datasets. For this, we used a robust multi-array average (RMA) [29] method present in the R statistical software. Our aim was to select significant genes based on differential expression between patient stages.
RankProd [30] was applied for identifying differentially expressed (up/down-regulated) probes based on the estimated percentage of false predictions (pfp). We fixed the significance cut-off using p-values by setting the (default) α parameter required by the software to 0.01, cfr., [31]. More specifically, the RankProd analysis was used as a first step in both datasets. Thus we obtained DNA probes which are up/down expressed in the highest stage patients in w.r.t. the lowest stage patients.
Finally, up/down expressed genes were identified by submitting IDs probes (obtained through RankProd) to the Netaffx tool [32].

Copy number analysis

As in the previous analysis, in this phase we use more than one dataset to obtain more supporting evidence for a gene amplification/deletion. To this aim, we used three public CRC microarray (GEO) data: GSE16125, GSE11417 and GSE27910.

The first dataset was provided by the Fondazione IRCCS Istituto Nazionale dei Tumori (INT) and deposited on GEO (GEO16125) [6]. In this dataset, tissue specimens from 53 consecutive sporadic CRCs were obtained from previously untreated patients who underwent surgical resection at INT between 1998 and 2000. 51 DNA samples were hybridized to Affymetrix GeneChipVR Human Mapping 250 K NspI (SNP arrays). Some samples were excluded due to poor quality hybridizations and unknown stage tumor progression. Also, stage-I patients were excluded because of the lack of instances in the considered data. The analyzed samples can be summarized as follow: 10 stage-II patients, 10 stage-III patients and 23 stage-IV patients.

The second dataset was the GEO CRC GSE11417 [33]. Tumor samples and paired normal tissues were hybridized to Affymetrix Mapping 50 K Xba 240 arrays. CNAs for each sample are obtained between pairs of tumors and normal samples. The dataset is composed of 94 patients (42 with lymph node metastasis): 3 patients with stage 1 (Duke system), 46 patients with stage 2, 37 patients with stage 3 and 8 patients with stage 4.

Further analysis was conducted on the GEO CRC GSE27910 [34]. We investigated 122 patients with CRC from Affymetrix DNA Sty array: 18 patients with stage 1, 42 with stage 2, 37 with stage 3 and 25 with stage 4.

We summarize the CNA analysis procedure (see Figure 3) as follows.

For each dataset, we applied CNAG [35] to identify both the sets of amplified and deleted genes.
Finally, we selected those genes whose alterations were verified on at least two input datasets. Such genes were considered as altered.

Combination of gene expression levels and copy number alterations

In this phase, we obtained identification of differentially expressed genes with CNAs gains/losses (see Figure 4).

In particular, by considering the results of the gene expression analysis (i.e., up and down-regulated genes) and the CNA analysis (i.e., amplified and deleted genes), we selected the following genes.

Up-regulated genes with CNA gains (by selecting genes common to the set of up–regulated and the set of amplified genes).
Down-regulated genes with CNA losses (by selecting genes common to the set of down–regulated and the set of deleted genes).

Dissimilarity-based representation

In the previous sections, we selected differentially expressed genes with CNAs over the chromosomes. Here, we consider relationships among patients: i.e., we define the dissimilarity representation among patient.

As noted above, a typical way of representing instances (to be classified) is through the selection of a vector of available attribute values (e.g., gene expression levels). Our goal is to give a dissimilarity representation which can express, through a function D(x,y), the dissimilarity between the expression levels of altered genes for the pair of patients x and y. By extending D(x,y) for all patient pairs, we can construct a dissimilarity matrix whose rows can also be assessed by representing any patient $x \in X$ through the mapping $(X, P) \to R^{n}$ defined as $φ (x, P) = [D (x, y_{1}), D (x, y_{2}), \dots, D (x, y_{n})]$ , where and respectively denote a set of case/control patients and a set of n prototype patients. Here the difference between and reflects the need to discriminate case/control patients in as compared to a common set of n prototype patients in . For instance, this function should be applied to discriminate a stage-III patient $x_{1} \in X$ from a stage-IV patient $x_{2} \in X$ , mainly on the basis of the sequences of differences $φ (x_{1}, P) = [D (x_{1}, y_{1}), D (x_{1}, y_{2}), \dots, D (x_{1}, y_{n})]$ and $φ (x_{2}, P) = [D (x_{2}, y_{1}), D (x_{2}, y_{2}), \dots, D (x_{2}, y_{n})]$ concerning respectively, (i) dissimilarities between the patient $x_{1} \in X$ from the other prototype patients $y_{i} \in P$ , and (ii) dissimilarity between the patient $x_{2} \in X$ from the other prototype patients $y_{i} \in P$ . The choice of a correct prototype set can be critical in this approach, and may change the results being investigated. Here we do not study the best possible prototype, instead we employ the group with the lowest stage. As our data does not provide a sufficient number of stage-I patients, we use the stage-II patients as the prototype set. Another critical aspect of this representation concerns the definition of a well-discriminating dissimilarity function D for a non-trivial learning problem. The following ordinary distances (from the R bioDistance package [36]) are considered: Euclidean distance, Manhattan distance, Kendall’s τ-distances and Kullback-Leibler distance.

Using this formulation, classification (or clustering) algorithms can be applied to the resulting dissimilarity space ( $R^{n}$ ), in which each dimension expresses a dissimilarity with a prototype patient. Figure 5 gives a simple example of the representation for the Euclidean plane (n=2).

Inference procedure and validation datasets

In order to construct the disease progression inference on the basis of the classification tasks listed in Table 1, we designed a Rapid Miner (RM) workflow (WF) [37]. RM is a software environment for rapid prototyping of machine learning and knowledge discovery (KD) processes. It is currently used for classification, clustering, and also data integration tasks, c.f.r., [38]. RM is modeled by a complex nested chain of objects called operators. These operators implement several KD processes, like data pre-processing, performance evaluation, learning algorithms, etc. The user is supported with graphical interfaces, where operators can be dropped as nodes onto the working pane and the data-flow is specified by connecting the operator nodes. In other words, RM workflows represent conceptual sequences of operational steps used for specific data mining experiments. Figure 6 shows the RM workflow designed for our evaluation and inference procedures. Basically, it implements standard Support Vector Machine (SVM) algorithms to forecast the patient stage. SVMs are used as “black box” inference processes to score each input dataset according to the inference performance of the algorithm [39].

The main components of the WF encode the following processes, expressed as “RapidMiner operators” are:

Parameter optimization operator. Often different learning models have many parameters and it is not clear which values are best for the learning task at hand. In order to perform the best and homogeneously as possible we optimized the AUC index over a space of given SVM feasible learning parameters. Thus, for each input, the best SVM learning parameters are found over the same space of values. The Parameter Optimization operator allows us to iteratively cycle its nested operators and change their parameters to optimize the performance of the learning scheme. In our case, the nested operator is a cross-validation process, which in turn trains and tests the SVM algorithm. In other words, we used this technique to find the best parameter combination for the SVM learning process.
Cross-validation operator. This operator encapsulates a 10-fold cross-validation process. Cross-validation is a two-step process: in the first step a classifier is built describing a predetermined set of data classes. In the second step, the model (a trained SVM) is used for testing new classification examples; the generalization performance of the classifier is estimated using a new test set. The input data set S is split into subsets {S₁,S₂,…,S_k} - in our case k=10. The first inner operator (SVM) realizes the learning step described above. SVM is applied 10 times using at each iteration i the set S_i as the test set and S-S_i as the training set. The second inner operator (model applier) realizes the second step described above. The predictive accuracy (and the other performance measures) of the classifier are then estimated using the performance operator.

In this analysis we used the following (expression level) datasets:

GSE27854: previously described in Section Materials and methods, Subsection Gene expression analysis.
GSE17536: ibid.
GSE14333: Expression values from Affymetrix Human Genome U133 Plus 2.0 array were calculated using robust multi-array average (RMA) [29]. Three groups of patients were selected: 94 patients with stage II, 91 patients with stage III, and 61 with stage IV.

From these datasets, we obtained the following datatypes^a, according to the analysis provided in the previous paragraphs.

Standard data (referred to as SA datatype): from each dataset, the expression levels of selected up/down-regulated genes (provided through the gene expression analysis) are considered.
Combined data (referred to as COMB datatype): from each dataset, the expression levels of selected up-regulated genes with amplification and down-regulated genes with deletion (provided through the combined gene expression and CNA analysis) are considered.
Relational data (referred to as RA datatype): from each dataset, the dissimilarities (provided through the dissimilarity representation) between the expression levels of both the up-regulated genes with amplification and the down-regulated genes with deletion are considered.

In order to evaluate the inference performance of each datatype (thus providing an evaluation of the tumor progression inference when different information are used), we finally applied the RM-WF as reported above.

Statistical evaluation

In order to statistically evaluate the results of combined and/or relational information for this application we divided AUC values according to cutoff points (60% and 80%). We then evaluated two sets:

set S0: observed successes (AUC value >60% and AUC value >80%), and
set F0: observed failures (AUC value ≤60% and AUC value ≤80%), as reported in Figure 7.

We then defined other two sets:

set Se: expected successes (AUC value ≥75%), and
set Fe: expected failure (AUC value <25%)

We compared observed (S0 and F0) and expected (Se and Fe) frequencies with the χ² “Goodness of Fit” test, in order to answer the question whether two models (e.g., COMB and NOCOMB) are different with respect to a successes/failures composition with a defined probability of success (75%) or failures (25%).

We finally computed the residuals for each comparison criteria (|S e-S0|, |F e-F0|).

Ethical approval

This study was approved by the institutional review board of the Fondazione IRCCS Istituto Nazionale dei Tumori of Milan, Italy, and each patient provided written informed consent to donate the tissues left over after diagnostic procedures.