Multiple samples aCGH analysis for rare CNVs detection

Background DNA copy number variations (CNV) constitute an important source of genetic variability. The standard method used for CNV detection is array comparative genomic hybridization (aCGH). Results We propose a novel multiple sample aCGH analysis methodology aiming in rare CNVs detection. In contrast to the majority of previous approaches, which deal with cancer datasets, we focus on constitutional genomic abnormalities identified in a diverse spectrum of diseases in human. Our method is tested on exon targeted aCGH array of 366 patients affected with developmental delay/intellectual disability, epilepsy, or autism. The proposed algorithms can be applied as a post–processing filtering to any given segmentation method. Conclusions Thanks to the additional information obtained from multiple samples, we could efficiently detect significant segments corresponding to rare CNVs responsible for pathogenic changes. The robust statistical framework applied in our method enables to eliminate the influence of widespread technical artifact termed ‘waves’.

logratio signal relative to other samples, i.e. for any fixed probe to replace the logratios by their ranks in all samples. The highly beneficial effect of the algorithm is illustrated on Figure 1 (a) and (b), which present the fragment of the genome with hybridization signal coded by logratios and their rankings, respectively. One can observe that both wave pattern (causing spurious segment calls) and disrupted probes are eliminated, while keeping the true positive segments (in this genome fragment one large deletion is visible).
Our procedure analyzes aCGH data from all samples (logratio matrix) to detect short fragments of k consecutive probes (k-mers) being the markers of rare CNVs. The idea of markers is based on the definition of rare pathogenic CNVs, which are nearly absent in control population and present in 1% or less of affected individuals. Hence, we seek for markers in the set of k-mers for all samples (presented results were obtained for a parameter k = 7). Outlier detection in high dimensional spaces is a non-trivial task. In our solution, we follow the recommendation from a survey of outlier detection methods by Gogoi, et al. to use a distancebased approach with a suitable choice of metrics [2].
We apply sliding window approach on a ranking transformed logratios matrix. For each window spanning the range of k columns, we calculate the distances between the k-mers from all samples. For each k-mer, we compare the average distance to all others in the same window. Then we approximate the distribution of average distances and classify the k-mer as a marker if it lies in a 1% tail of this distribution.
More formally, consider a log 2 ratio matrix L and one of its k-windows L S Q , containing log 2 ratio data coming from a set of patients S = {1, . . . , n}, and from consecutive probes from the set Q = {p, . . . , p+k −1} (here probe ordering respects probes positions on the reference genome). The transformation of each of k columns into ranks and division of resulting ranks by |S| + 1 yields pseudo-ranks matrix R S Q with elements: Let us consider, that S is a patient group sampled from a large group of all patients S, and that rows of R S contained in [0, 1] k , are in fact pseudo-ranks in columns of S, respectively. Now, R s q , taken from a random patient s and probe q, has uniform distribution. Hence, However, observe, that if one or more patients in the sample exhibit CNV segment, columns R S q∈Q are correlated with each other, hence D p is not uniform on [0, 1] k . In statistics, distributions with uniform marginals on a hyper-cube [0, 1] k are commonly described using copulas. C is a k-dimensional copula if C is a joint cumulative distribution function of a k-dimensional random vector on the unit cube [0, 1] k with uniform marginals. Several families of copulas (Gaussian copulas, t-copulas, Archimedean copulas), and their properties, were thoroughly studied in literature.
Our method for discriminating outliers is based on a statistics computed for each of n patients: mean L q distance to other rank vectors.
For the purpose of this work we selected L 1 distance measure, both for simplicity and greater robustness than L 2 .
In the case of one dimension k = 1 and in the continuous limit |S| → inf, the value of the µ 1 statistics for a patient with pseudo-rank z ∈ [0, 1] is given by: 1 2 ], and symmetric with respect to 1 2 , z has uniform distribution. Substituting u = 2|z − 1 2 | we obtain the inverse cumulative distribution function, and further the cdf and the density of the null distribution for k = 1.
For k > 1 the value of the µ 1 statistics for a patient with pseudo-ranks z = (z 1 , . . . , z k ) ∈ [0, 1] k is given by: This signifies that the µ 1 statistics converges in limit |S| → ∞ to the sum of squared euclidean distances from two extreme corners of hypercube: 0 k and 1 k (a k-mer in each of these corners has extreme ranks on every probe).
For k > 1 if we undertake the independence of pseudo-ranked columns the null distribution D k µ 1 of µ 1 can be computed as a sum of independent variables. This underlines the adequacy of statistics µ 1 as it converges to the sum of squared euclidean distances from two extreme corners of hypercube: 0 k and 1 k (a k-mer in each of the corners has extreme ranks on every probe). Figure 2 presents µ 1 limit |S| → ∞ null distributions for various dimensions k for the dimension independence case.
On the other hand, the null hypothesis may assume a certain structure of column correlations, e.g. corresponding to a larger group of patients with CNV segments inside a particular window, and a null distribution may reflect that. First approach we've taken is to fit as a null distribution Beta(α, β) shifted to the appropriate interval (min(µ 1 ), max(µ 1 )). This outlier detection procedure is considered less conservative since Beta has a lighter tail than the D k µ 1 for small k. Second approach presupposes that the distribution of k-mers of pseudo-ranks is described by a certain copula C. In case the rank distribution is a certain copula D p = C, the c.d.f. of the null distribution F µ is estimated through approximation of the following integral, by either computing it numerically, or through sampling from the fitted copula C: Parameters of copula C are fitted for each window, the null distribution is obtained by integration of the µ 1 statistics over copula C. However, classical families of copulas (Gaussian, t-copula, Archimedean) are not suited to model multidimensional k-mers with asymmetric dimensional dependencies, a copulas mixture approach is more adequate [3]. Then, the mixture approach suffers from huge dimensionality -obtained solutions are only locally optimal, dependent on a mixture fitting starting point.
In either approach, k-mers with p-value less than 0.01 (suggested frequency of pathogenic CNVs) are selected as markers. Results presented in this paper originate from the first, Beta fit, approach.
Selected markers are lined up on the considered segmentation. We sieve out segments without any markers inside and sort segments that remain according to the density of coverage by markers (best scoring segments are most densely covered). We call the score assigned to reported segments density score in the sequel, as it corresponds to the percent of the segment covered by markers.