### Datasets

The dataset comes from 366 arrays hybridized with DNA from subjects with epilepsy, autism, or other neurodevelopmental disorders (developmental delay/intellectual disability) examined at the IMC Cytogenetics Labs. Each experiment was performed on the 180 K custom whole–genome microarray with an exonic–coverage for over 1700 known and candidate genes for neurodevelopmental disorders [5].

Microarrays were prepared on Agilent platform, hybridized and scanned by Agilent scanner. We used Agilent Feature Extraction software with default settings, which performs back-ground subtraction, array spatial detrending, dye normalization and logratio calculations from each microarray [23, 24].

For further analysis we used outputted logratios – each sample consists of a set of ∼180 K logratio intensities mapped to loci in the reference genome hg18 human assembly.

FISH, Multiplex ligation-dependent probe amplification (MLPA), or Polymerase chain reaction (PCR) methods were used for experimental validation.

### Overview

The overall architecture of our proposed method consists of three main phases: segmentation call, rare CNVs filtering, and segment significance testing.

For the segmentation phase, we decided to use DNACopy algorithm, based on recursively applying a statistical test to detect significant CNV breakpoints [25]. The algorithm, often proposed as a standard approach, processes aCGH data obtained from only a single experiment but is highly efficient and enables to analyze a new sample within a few minutes.

During the second phase, we apply the robust outliers detection procedure to identify aberration associated segment corresponding to the potentially pathogenic changes. In this stage, we simultaneously process all accessible samples to strengthen the information about the rearrangement patterns. We analyze the matrix of logratios, wherein the columns correspond to probes and rows to samples. The logratio signals are transformed into rankings in each column separately; this eliminates the influence of low quality probes and experimenter’s bias. Then, we calculate the distances between vectors from contiguous ranges of probes (called *k*-mers, i.e. windows of size *k*), based on the *L*_{1} (Manhattan) metrics. We model the distribution of the mean distances to all other *k*-mers in each window separately. We classify given *k*-mer as an outlier when it belongs to the distribution tail. Outliers correspond to markers of significantly outstanding DNA rearrangements and are mapped to the pre-existing segmentation. In the last step, we filter out polymorphic segments. If a specific threshold (based on density of coverage by markers and absolute value of segment’s mean logratio) is met, a segment along the patient genome is reported and selected to validation.

During the validation phase, following the methodology from Koolen, et al, we conduct exhausting comparison with three main databases storing information about genomic rearrangements and diseases: International Standards for Cytogenomic Arrays database (ISCA), Genetic Association Database (GAD) and Database of Genomic Variants (DGV) [7, 20–22].

### Outstanding CNVs detection

Although logratio data is already normalized by microarray extraction software, we observe noisy patterns in it: wave bias and experimenter’s bias (Figure 1, also see Results and discussion). Wave bias has been documented in the literature before [19].

To overcome these two pertaining obstacles we propose an intuitive solution: the idea is to work with logratio signal relative to other samples, i.e. for any fixed probe to replace the logratios by their ranks in all samples. The highly beneficial effect of the algorithm is illustrated on Figure 1(a) and (b), which present the fragment of the genome with hybridization signal coded by logratios and their rankings, respectively. One can observe that both wave pattern (causing spurious segment calls) and disrupted probes are eliminated, while keeping the true positive segments (in this genome fragment one large deletion is visible).

Our procedure analyzes aCGH data from all samples (logratio matrix) to detect short fragments of *k* consecutive probes (*k*–mers) being the markers of rare CNVs.

The idea of markers is based on the definition of rare pathogenic CNVs, which are nearly absent in control population and present in 1% or less of affected individuals. Hence, we seek for markers in the set of *k*–mers for all samples (presented results were obtained for a parameter *k*=7). Outlier detection in high dimensional spaces is a non–trivial task. In our solution, we follow the recommendation from a survey of outlier detection methods by Gogoi, et al. to use a distance-based approach with a suitable choice of metrics [26].

We apply sliding window approach on a ranking transformed logratios matrix. For each window spanning the range of *k* columns, we calculate the distances between the *k*–mers from all samples. For each *k*–mer, we compare the average distance to all others in the same window. Then we approximate the distribution of average distances and classify the *k*–mer as a marker if it lies in a 1% tail of this distribution.

More formally, consider a logratio matrix

*L* and one of its

*k*-windows

${L}_{Q}^{S}$, containing logratio data coming from a set of patients

$S=\{1,\dots ,n\}$, and from consecutive probes from the set

*Q*={

*p*,…,

*p*+

*k*−1} (here probe ordering respects probes positions on the reference genome). The transformation of each of

*k* columns into ranks and division of resulting ranks by |

*S*|+1 yields

*pseudo–ranks* matrix

${R}_{Q}^{S}$ with elements:

${R}_{q}^{s}=\frac{\text{rank of}\phantom{\rule{1em}{0ex}}{L}_{q}^{s}\phantom{\rule{1em}{0ex}}\text{in}\phantom{\rule{1em}{0ex}}{L}_{q}^{S}}{\left|S\right|+1},\phantom{\rule{1em}{0ex}}s\in S,q\in Q$

Notice, that each *k*-mer (row of ${R}_{Q}^{S}$) belongs to *k*-dimensional hypercube, i.e. ${R}_{Q}^{s}\in {(0,1)}^{k}$. The marginal distributions of pseudo-ranks for each one of *k* dimensions can be modeled as uniform^{1}, while the join distribution of *k*-mers is not uniform on hypercube (0,1)^{
k
}, because of correlated logratios from consecutive probes (e.g. the analyzed window covers a region of rearrangements).

Distributions with uniform marginals on hypercube (0,1)^{
k
} are commonly described using copulas [27]. *C* is a *k*-dimensional copula if it is a joint cumulative distribution function of a *k*-dimensional random vector on (0,1)^{
k
} with uniform marginals.

Our method for discriminating outliers is based on a statistics computed for each

*k*-mer:

*mean* *L*_{
d
}*distance to other k-mers*. For

*s*∈

*S*,

*d*∈(0, inf]:

${\mu}^{d}\left(s\right)=\frac{1}{\left|S\right|}\sum _{j\in S}{\left(\sum _{l\in Q}{\left|{R}_{l}^{s}-{R}_{l}^{j}\right|}^{d}\right)}^{\frac{1}{d}}$

For Manhattan distance

*d*=1 and in the simplified case of one dimension

*k*=1 the value of the

*μ* statistics for a patient with pseudo-rank

*z*∈[ 0,1] converges with |

*S*|→

*∞* to:

${\mu}^{1}\left(z\right)={\int}_{0}^{1}\phantom{\rule{0.3em}{0ex}}\left|t-z\right|\mathit{\text{dt}}={z}^{2}+{(1-z)}^{2}$

For *k*>1 if we undertake the independence of pseudo-ranked columns the null distribution ${D}_{{\mu}^{1}}^{k}$ of *μ*^{1} can be computed as a sum of independent variables. This underlines the adequacy of statistics *μ*^{1} as it converges to the sum of squared euclidean distances from two extreme corners of hypercube: 0^{
k
} and 1^{
k
} (a *k*-mer in each of the corners has extreme ranks on every probe, see Additional file 1 for details).

On the other hand, the null hypothesis may assume a certain structure of column correlations, e.g. corresponding to a larger group of patients with CNV segments inside a particular window, and a null distribution may reflect that. First approach we’ve taken is to fit as a null distribution Beta(*α*,*β*) shifted to the appropriate interval (min(*μ*^{1}),max(*μ*^{1})). This outlier detection procedure is considered less conservative since Beta has a lighter tail than the ${D}_{{\mu}^{1}}^{k}$ for small *k* (see Additional file 1).

Second approach presupposes that the distribution of *k*-mers of pseudo-ranks is described by a certain copula *C*. Parameters of copula *C* are fitted for each window, the null distribution is obtained by integration of the *μ*^{1} statistics over copula *C* (see Additional file 1 for details). However, classical families of copulas (Gaussian, t-copula, Archimedean) are not suited to model multidimensional *k*-mers with asymmetric dimensional dependencies, a copulas mixture approach is more adequate [28]. Then, the mixture approach suffers from huge dimensionality – obtained solutions are only locally optimal, dependent on a mixture fitting starting point.

In either approach, *k*-mers with p-value less than 0.01 (suggested frequency of pathogenic CNVs) are selected as markers. Results presented in this paper originate from the first, Beta fit, approach.

Selected markers are lined up on the considered segmentation. We sieve out segments without any markers inside and sort segments that remain according to the density of coverage by markers (best scoring segments are most densely covered). We call the score assigned to reported segments *density score* in the sequel, as it corresponds to the percent of the segment covered by markers.

### Polymorphic regions filtering

Outlier statistic based method returns markers of rare CNVs, but also of some segments in highly polymorphic regions, where background distribution strongly diverges from the null distribution. Thus, from the set of outstanding segments covered by markers, we sieve out those corresponding to the non–pathogenic polymorphisms. To this aim we construct a so called polymorphic profile as follows.

The segmentation is mapped into probes, i.e. to each probe we assign the value of the mean logratio in the segment containing it. The signal is considered as not–noisy if its absolute value exceeds 0.07. This corresponds to log2ratio=0.24 which is a conservative (high) threshold for aberration detection. We count the number of signals above this threshold in each column *i* and if there are more than three (1% of 366 samples) we set the *i*-th coordinate in the polymorphic profile to 1, otherwise it is set to 0.

Next, we scan the profile vector and identify all *k*-mers (*k*=7) of consecutive ones. At the end, the set of outlier *k*-mers from the previous step is intersected with the full set of *k*-mers, that indicates the polymorphism. Outlier *k*-mers that overlap with polymorphic profile are excluded from further analysis.

### Validation

To validate segments selected as rare CNVs according to our density score we automate the process of a manual validation of segments based on UCSC [29], i.e. the protocol by which geneticians usually act. Lastly, we compare resulting sets of segments with the set produced manually by geneticists from IMC.

Manual validation by geneticists involves inspecting reported CNVs segments, overlaying them on UCSC tracks. This purposes to filter out known polymorphisms and, by interrogation of all known syndrome regions, to try to narrow down the segment set to only those clinically relevant. This step is followed by FISH or PCR confirmation of the CNVs existence in patient’s DNA [30, 31].

For the automated process we decided to focus on three main databases storing the information related to genomic variations and diseases resulting from it: ISCA, DGV and GAD.

ISCA is a group of clinical cytogenetics laboratories committed to improve the quality of patients care related to clinical genetic testing using aCGH [20]. ISCA database contain very high quality copy number data (i.e. deletions and duplications) from clinical laboratories. Currently, ISCA stores 1.7 Gb of CNVs data classified as pathogenic.

The objective of the DGV is to provide a comprehensive summary of structural variation in the human genome identified in healthy control samples [22]. Currently DGV is used to justify the rarity of identified rearrangement. However, this approach is criticized because DGV contain many false positives (e.g. small aberrations detected by poor–resolution technology are considered to be larger) [7]. On the other hand, its content is not representative for analysis of specific group of patients. Therefore, our procedure for polymorphisms filtering is based on in house database, as suggested [7].

The GAD is an archive of human genetic association studies of complex diseases and disorders allowing for rapid identification of medically relevant polymorphisms from the large volume of polymorphism data [21].

In our validation approach, we investigate the correlation of marker coverage density score for segments with the contents of described databases. For medically relevant CNVs, we expect insignificant intersection with DGV and a non–empty intersection with ISCA and/or GAD.

### Ethical approval

Informed consents approved by the Institutional Review Board of the Bioethics Committee at the IMC (12/2007) were obtained in all cases.