Parametric modeling of whole-genome sequencing data for CNV identification.

Biostatistics Advance Access published January 28, 2014 Biostatistics (2014), pp. 1–15 doi:10.1093/biostatistics/kxt060

Parametric modeling of whole-genome sequencing data for CNV identification Harvard School of Public Health, 651 Huntington Avenue, Boston, MA 02115, USA X. JESSIE JENG Department of Statistics, North Carolina State University, Raleigh, NC 27695, USA YINGHUA WU, HONGZHE LI∗ Division of Biostatistics, University of Pennsylvania, Philadelphia, PA 19104, USA [email protected] SUMMARY Copy number variants (CNVs) constitute an important class of genetic variants in human genome and are shown to be associated with complex diseases. Whole-genome sequencing provides an unbiased way of identifying all the CNVs that an individual carries. In this paper, we consider parametric modeling of the read depth (RD) data from whole-genome sequencing with the aim of identifying the CNVs, including both Poisson and negative-binomial modeling of such count data. We propose a unified approach of using a mean-matching variance stabilizing transformation to turn the relatively complicated problem of sparse segment identification for count data into a sparse segment identification problem for a sequence of Gaussian data. We apply the optimal sparse segment identification procedure to the transformed data in order to identify the CNV segments. This provides a computationally efficient approach for RD-based CNV identification. Simulation results show that this approach often results in a small number of false identifications of the CNVs and has similar or better performances in identifying the true CNVs when compared with other RD-based approaches. We demonstrate the methods using the trio data from the 1000 Genomes Project. Keywords: Natural exponential family; Sparse segment identification; Variance stabilization.

1. INTRODUCTION One important genome structural variation is the copy number variants (CNVs) that include insertions and deletions with size greater than 1 kb. In various genetic studies, CNVs are found to be associated with diseases such as cancer (Diskin and others, 2009). There are a number of experimental methods for CNV discovery including array comparative genomic hybridization (Urban and others, 2006), single nucleotide polymorphism array (Redon and others, 2006), and sequencing-based approaches ∗ To

whom correspondence should be addressed. c The Author 2014. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: [email protected].

Downloaded from http://biostatistics.oxfordjournals.org/ at New York University on May 14, 2015

SARAN VARDHANABHUTI

2

S. VARDHANABHUTI AND OTHERS


(Mills and others, 2011). Each of these technologies has their advantages and disadvantages and each requires different computational and statistical methods for identifying the CNVs. Most of the methods for CNV identification are developed for data that are assumed to be normally distributed. These include methods based on circular binary segmentation (Olshen and others, 2004), hidden Markov models (Wang and others, 2007), and scan statistics (Siegmund and others, 2010). Jeng and others (2010) developed an optimal sparse segment identification procedure for CNV identification for data with Gaussian noise. In this paper, we focus on CNV identification based on whole-genome sequencing data generated from the next generation sequencing (NGS) platforms. NGS enables researchers to obtain high coverage short reads data for the whole human genomes and provides better resolution for CNVs than intensity-based array data. In the analysis of NGS data, there are a few mapping features that are informative for detecting the presence of CNVs. One feature is the read depth (RD) along the genomic locations. The basic idea is that regions with more reads are likely to indicate copy gains and regions with fewer reads, on the other hand, are indicative of copy number losses in the sequenced genome compared with the reference. Methods based on the RD data usually partition the genome into non-overlapping windows and the read counts in the windows are taken as measures of RD. The approach is then to identify consecutive read counts windows that significantly deviate from the expected distribution. One basic assumption of RD methods is that read density of a given genomic window should be proportional to the local copy number. Yoon and others (2009) developed an event-wise-testing (EWT) algorithm where they summarized read counts based on non-overlapping windows and converted them into Z -scores by subtracting the mean of all windows and dividing by the standard deviation. CNVs are identified based on computing uppertail and lower-tail probabilities based on a normality assumption on the RD data. The windows are then selected based on the extreme values of these probabilities controlling for the genome-wide false-positive rates. Another approach, CNVnator (Abyzov and others, 2011), also partitioned the genome into nonoverlapping windows and used a mean-shift approach for CNV discovery and genotyping. However, RD data are rarely normally distributed (Cai and others, 2012). Violation of the normality assumption can lead to false CNV identification. We consider parametric modeling of RD data from the whole-genome sequencing for the problem of identifying sparse segments. One such a model assumes a constant read sampling rate across the genome and a Poisson distribution for the read counts data (Xie and Tammi, 2009). Alternatively, the negative binomial (NB) distribution can be used to model the RD data in CNV detection (Miller and others, 2011). We consider to model the sequencing RD data using a natural exponential family with a quadratic variance function. This class of distributions includes the Poisson and NB distributions as special cases. The key of our approach is to use a mean-matching variance stabilizing transformation to turn the relatively complicated problem of sparse segment identification for counts data into a sparse segment identification problem for a sequence of data with Gaussian noises. We propose to apply the optimal sparse segment identification procedure of Jeng and others (2010) to the transformed data in order to identify the segments. This provides a computationally efficient approach to identify the CNVs using whole-genome sequencing data. The proposed method is similar in spirit to the approach of Cai and others (2012) where they used a local median transformation to transform the count data into approximately normal distributed data. However, the method of Cai and others (2012) assumes additive and symmetric noises with a constant variance throughout the genome, an assumption that can be violated due to different guanine-cytosine content (GC) contents, repetitive DNA elements, and sequencing errors. The approach proposed in this paper allows the variances of RDs to depend on the their mean values and effectively eliminates the constant variance assumption. In addition, the new parametric transformation does not require symmetric error terms. The rest of the paper is organized as follows. We present a natural exponential family with a quadratic variance function for modeling RD data and the mean-matching variance stabilizing transformation in Section 2. We also present details for Poisson and NB transformations. We then present the procedure of

Parametric models for CNV identification

3

applying the likelihood ratio selector (LRS) of Jeng and others (2010) for identifying the CNVs based on the transformed data in Section 3. We conduct simulations in Section 4 and present results of application to a CEU trio data from the 1000 Genomes Project in Section 5. We conclude the paper with a brief discussion in Section 6.

2. PARAMETRIC MODELING OF RD DATA

AND TRANSFORMATION

2.1 Natural exponential family with a quadratic variance function Before introducing the parametric model for the RD data, we first briefly review the natural exponential family with a quadratic variance function (the NQ family). We say that X ∼ NQ(γ ) if its density function can be written as f (x|η) = eηx−ψ(η) h(x), where η is called the natural parameter, and the mean γ and variance V of X i satisfy V (γ ) = a0 + a1 γ + a2 γ 2 . Both NB and Poisson distributions belong to the NQ family. m X i . Let G Suppose that we have m i.i.d observations X 1 , . . . , X m from the NQ(γ ). Define Y = i=1 be a variance stabilizing transformation function such that G (γ ) = V −1/2 (γ ). The standard delta method and central limit theorem then yield m √ i=1 X i m G − G(γ ) → L N (0, 1) m

(2.1)

as m → ∞.

The variance stabilizing properties can often be further improved by using a transformation of the form m i=1 X i + a Y =G , (2.2) m+b with suitable choices of constants a and b. It is shown in Brown and others (2010) that the optimal choices of a and b can be achieved by matching the mean of Y and G(γ ), which results in a = 14 a1

and

b = − 12 a2 .

(2.3)

The following lemma summarizes part of the results in Brown and others (2010, Lemmas 1–3). i.i.d.

LEMMA 2.1 Assume X i ∼ NQ with mean γ and variance V = a0 + a1 γ + a2 γ 2 for i = 1, . . . , m. Define G and Y as in (2.1) and (2.2) with (a, b) chosen as in (2.3), and let = E(Y ) − G(γ ). Then Y


Suppose that the genome of a given person is sequenced and the short reads are mapped to the reference genome. Let X 1 , . . . , X n be a long sequence of observed RD counts at genomic location i for i = 1, . . . , n. Suppose that there are q = qn CNVs with q possibly increasing with n, and let I = {I1 , . . . , Iq } be the collection of the corresponding true CNV segments. We assume that the number, locations, and mean values of the CNV segments are unknown. We aim to model the data X 1 , . . . , X n parametrically using a natural exponential family distribution with a quadratic variance and to develop methods for identifying the CNV segments I based on a variance-stabilization transformation. We particularly consider the Poisson and NB distributions for the RD data.

4


can be written as Y = G(γ ) + + m −1/2 Z + ξ, where Z is a standard normal random variable, || cm −2 , and ξ is a stochastically small random error with E|ξ |l Cl m −l for any integer l 1 and some constant Cl > 0. Lemma 2.1 says that after the mean-matching variance stabilization transformation, the transformed data Y are approximately normal.

For RD data along the chromosome X 1 , . . . , X n , we assume that X i ∼ NQ(γi ),

γi = μ0 +

q

μ j 1(i∈I j ) ,

(2.4)

j=1

where μ0 is the baseline mean RD that depends on the sequencing depth in the normal genomic regions, and μ j is the mean RD that corresponds to the jth CNV region, where μ j > 0 implies a duplication and μ j < 0 implies a deletion. We assume that all the parameters μ0 , q, I1 , . . . , Iq , μ1 , . . . , μq are unknown. For sequencing-based CNV identification problem, n is usually of very high dimension, and the methods developed for Gaussian data cannot be directly applied. We propose to perform the following data transformation. We first equally divide the n observations into T = Tn groups with m = m n observations in each group. Define the set of indices in the kth group as Jk = {i : (k − 1)m + 1 i km}. We then generate the transformed dataset as Yk = G

Xi + a m+b

i∈Jk

,

1 k T.

(2.5)

By Lemma 2.1, Yk has the following distribution: Yk = G(θk ) + k + m −1/2 Z k + ξk , where

⎧ ⎪ Jk ⊆ I j for some I j , ⎨= μ0 + μ j , θk ∈ [μ0 , μ0 + μ j ], Jk ∩ I j = | ∅ for some I j and Jk I j , ⎪ ⎩ otherwise. = μ0 ,

(2.6)

These results indicate that the transformed data Yk follows approximately normal distribution with its mean depending on whether the transformed bins sit within CNVs, overlap with parts of CNVs or not in CNVs at all and with variance of 1/m. We can therefore apply the LRS of Jeng and others (2010) on Yk assuming the baseline distribution as N (G(μ0 ), 1/m). The optimality of LRS on identifying short CNV segments for Gaussian data has been established in Jeng and others (2010). Note that the function G depends on the specific distribution of the original RD data.


2.2 NQ model for RD data and transformation


5

2.3 Poisson and NB models for RD data To model the RD data using a Poisson distribution, we assume the RD over n nucleotide positions, X 1 , . . . , X n to follow a distribution as, X i ∼ Poisson(λi ),

λi = λ0 +

q

d j 1(i∈I j ) ,

j=1

k

We can also assume that the RD data X 1 , . . . , X n follow an NB distribution as X i ∼ Negative Binomial(r, pi ),

pi = p0 +

q

d j 1(i∈I j ) ,

(2.8)

j=1

where the parameters r and pi are related to the mean E[X i ] = μi and variance Var[X i ] = σi2 of the NB random variable through μi =

r pi , 1 − pi

σi2 =

μi2 + μi . r

The NB distribution allows us to√model √ the frequently observed over-dispersion of the RD data. Under the NB model, G(θ ) = 2 ln( θ + r + θ ), a = 14 , and b = 1/(2r ). The data transformation (2.5) becomes ⎞ ⎛ X + 1/4 X + 1/4 √ i i i∈J i∈J k k ⎠ , 1 k T. + 1+ (2.9) Yk = 2 r ln ⎝ mr − 1/2 mr − 1/2 For the NB transformation, note that r in (2.9) is unknown and needs to be estimated based on the observed data X i , i = 1, . . . , n. Specifically, we can estimate r and p0 using the moment estimator by solving the following equations: μˆ 0 = r p0 /(1 − p0 ),

σˆ 0 = μˆ 20 /r + μˆ 0 ,

where μˆ 0 = median(X i , i = 1, . . . , n) is the median and σˆ 0 = MAD(X i , . . . , X n ) is the median absolute deviation, a robust estimate for the standard deviation. Using the general results in Section 2.1, the NB transformed data (2.9) have the following distribution: Yk = 2 ln(

θk +

r + θk ) + k + m −1/2 Z k + ξk ,


where λ0 is the baseline rate parameter that depends on the RD in the normal genomic regions, and d j is the rate parameter that corresponds to the jth CNV region, √ where d j > 0 implies a duplication and d j < 0 implies a deletion. Under the Poisson model, G(θ ) = 2 θ , a = 1/4, b = 0, and the data transformation (2.5) becomes

1 Xi + Yk = 2 m, 1 k T. (2.7) 4 i∈J

6


where ⎧ ⎪ Jk ⊆ I j for some I j , ⎨= r ( p0 + d j )/(1 − p0 − d j ), θk ∈ [r p0 /(1 − p0 ), r ( p0 + d j )/(1 − p0 − d j )], Jk ∩ I j = | ∅ for some I j and Jk I j , ⎪ ⎩ otherwise, = r p0 /(1 − p0 ), and k and ξk are stochastically small and Z k ∼ N (0, 1).

COMPUTATIONAL COMPLEXITY

After the data transformation, the transformed data Y1 , . . . , YT can be well approximated by N (G(θk ), 1/m), k = 1, . . . , T , where G and θk are defined as in (2.1) and (2.6) for the general NQ family. The baseline of G(θk ) is G(μ0 ), where μ0 , as defined in (2.4), is unknown. We can estimate G(μ0 ) by G(μˆ 0 ) using the sample mean or median of Yk , given the fact that the CNV segments are sparse. We then apply LRS on Z k = Yk − G(μˆ 0 ) with window size L and threshold λn , where L and λn are two parameters specified in LRS algorithm. Specifically, the LRS procedure scans the genome with intervals of length L and for each interval I˜, it calculates the likelihood ratio statistic Z ( I˜) = k∈ I˜ Z k /| I˜|. The algorithm then finds all the local maximums of the intervals with the likelihood ratio statistics passing the threshold λn . As shown in Jeng and others (2010), the criterion for the selection of L is s¯ L < d and log L = o(log n), where s¯ is the longest length of the CNV segments and d is the shortest distance between two adjacent segments. Such a condition is easy to be satisfied as CNVs are √ relatively very short and sparse. The threshold parameter λn , on the other hand, is set as λn = 2 log n/ m. Similar analysis as in Cai and others (2012) implies that the genome-wide error rate can be well controlled by this threshold. LRS selects a set of intervals based on the transformed data. These intervals are then transformed back to the original base-pair scale to identify the genomic locations of the CNVs. After the parametric transformation of the data for fixed-length intervals, our algorithm performs CNV detection based on n/m observations, which greatly increases the computational efficiency with a computational complexity of O(n/m × L). In real applications to whole-genome sequencing data, n is close to 3 billions basepairs. We choose m = 400 and L = 150. Since the CNVs are usually longer than 1200 bps, choosing m = 400 allows the CNVs in the transformed data to have at least three observations. We apply the LRS procedure with L = 150, which assumes that the maximum CNV based on the pre-processed data is 400 × 150 = 60 000 base pairs. Also, we choose L = 150 due to computation consideration. If the true CNVs are longer than the maximum allowable interval length, these intervals are often divided into several contiguous segments, then we can simply perform some post-processing steps to merge these segments into longer ones. In analysis of real datasets and in our software package, RSICNV (https://github.com/yhwu/rsicnv), we perform additional data processing steps to eliminate possible artifacts due to local wave effects and also the mapping difficulty in repetitive genomic regions. First, our RD data are GC-content adjusted using the same procedure as used in CNVnator and EWT (Yoon and others, 2009; Abyzov and others, 2011) before applying our method. This helps to reduce possible false-positive CNVs due to local waves. Specifically, for each nucleotide position, we create a bin of 100 bps centered at this position and then count Gs and Cs in the bin. The average RD is then calculated for each specific value of the GC counts. For each bin, the RD signal is multiplied by a weight defined as the global RD average divided by the average RD over all the bins with the same GC count and then is converted to the closest integer. Secondly, we further filter out the CNVs if over 50% of the reads that cover these CNVs have mapping quality scores of zero. This helps to reduce the false-positive CNVs due to mapping to the repetitive regions. Such quality


3. LRS, DATA PROCESSING AND


7

score-based filtering was also recommended in several RD segmentation methods (Abyzov and others, 2011; Szatkiewicz and others, 2013).

4. SIMULATION STUDIES 4.1 A comparison of transformations for Poisson and NB data

D( Iˆ, I ) = 1 − | Iˆ ∩ I |/ | Iˆ||I |. It is clear that 0 D( Iˆ, I ) 1 with D( Iˆ, I ) = 1 indicating disjointness and D( Iˆ, I ) = 0 indicating complete identity. The simulations are repeated 100 times to calculate D j and #O. The medians of D1 , . . . , Dq and #O are reported in the tables with estimated standard errors. To estimate the standard errors of the medians, we generate 500 bootstrap samples out of the 100 replication results, then calculate a median for each bootstrap sample. The estimated standard error is the standard deviation of the 500 bootstrap medians. Table 1 shows the comparisons between the two transformations when the data are generated from Poisson and NB distribution, respectively. For data generated from the Poisson distribution, both methods performed well and the over-selection errors are well controlled with a median value 0 for all models considered. For data generated from the NB distribution, the performance of both methods is almost identical except that in one scenario, LRS was able to identify a relatively smaller segment (2 transformed binned regions) using NB transformation but not Poisson transformation. For both methods, the accuracy of the identified CNV regions as measured by D j , is best when CNV segment sizes are large and also with bigger bin size, m. These results are expected since transformation based on NB assumption is more flexible compared with Poisson assumption and is robust even when the data are generated from the Poisson distribution. 4.2 A comparison with local median transformation In the second set of simulations, we compare the performance of LRS using two different transformations: local median transformation and parametric transformation based on NB distribution. We extended the simulations to include different means and variances for baseline CNV regions. The results are shown in Table 2. In majority of scenarios, both local median and parametric transformation (based on NB) performed quite similarly, however, in a few cases, particularly when m = 100, there are a few over selections on average for the local median transformation. In general, when the simulated data have high coverage, A1: μ = 40, the accuracy of identified CNVs is quite similar between local median and parametric NB transformation. For low coverage simulated data, i.e. B1: μ = 10, identified CNV segments based on parametric transformation (for NB) tend to have higher accuracy. Also as the bin size gets smaller and the length of CNV segments gets shorter, it is harder for both methods to identify CNVs accurately, particularly for low coverage (B1) as is in the case when m = 20 in Table 2.


We generated data based on Poisson and NB and then performed transformations to test the accuracy of identification of CNV regions. We set the sample size at n = 1 × 106 and the number of segments at |I| = 3. We simulated dataset based on three scenarios using different bin size m (m = 400, 100, and 20) and different length of segments |I1 |, |I2 |, and |I3 |. After respective transformations, we applied LRS algorithm on the transformed data. Similar to Jeng and others (2010), the identification accuracy of LRS is measured by two quantities, D j and #O, where D j measures how well the signal segment I j is estimated and #O counts the number of over-selections. In other words, for each CNV segment I j , define D j = min Iˆ∈Iˆ D( Iˆ, I j ) where D( Iˆ, I j ), the dissimilarity between any pair of Iˆ ∈ Iˆ and I ∈ I is defined as


8

Table 1. Simulation comparisons of parametric transformations based on Poisson (P) and NB distributions for different bin size m and length of segments |I1 |, |I2 |, and |I3 | Poisson data D1 m = 400 P NB

NB m = 20 P NB

D3

D1

D2

D3

0.098 (0.08) 0.088 (0.010)

Long CNVs: |I1 | = 1200, |I2 | = 4000, |I3 | = 8000 0.025 0.013 0.098 0.025 (0.003) (0.001) (0.08) (0.003) 0.026 0.014 0.088 0.026 (0.003) (0.002) (0.010) (0.003)

0.013 (0.001) 0.014 (0.002)

0.107 (0.007) 0.083 (0.007)

Medium CNVs: |I1 | = 300, |I2 | = 500, |I3 | = 1000 0.048 0.025 0.107 0.048 (0.005) (0.002) (0.007) (0.005) 0.054 0.026 0.083 0.054 (0.005) (0.003) (0.007) (0.005)

0.025 (0.002) 0.026 (0.003)

0.293 (0.006) 1.000 (0.000)

Short CNVs: |I1 | = 20, |I2 | = 40, |I3 | = 100 0.150 0.050 0.293 0.175 (0.015) (0.005) (0.003) (0.012) 1.000 0.070 1.000 0.350 (0.182) (0.010) (0.000) (0.288)

0.065 (0.011) 0.070 (0.011)

D1 , D2 , and D3 are the average dissimilarity measurements based on 100 replications with the standard errors given in the parentheses. Baseline non-CNV regions are generated based Poisson distribution with λi = 10 and CNV regions are generated based on λi = 5 where i ∈ I j for j = 1, 2, 3. Baseline non-CNV regions are generated based NB distribution with μ = 10, σ 2 = 20 and CNV regions are generated based on μ = 6.7 and σ 2 = 11.1.

4.3 Whole-genome simulation and comparison with other RD-based methods To further demonstrate the practical applicability of the proposed methods, we also evaluate their performance based on simulated sequence reads that incorporate the CNVs published by the 1000 Genomes Projects (Mills and others, 2011). Our simulation follows a simple pipeline: first, 40X whole-genome sequence reads are simulated and the reads are mapped to the reference genome using the burrows-wheeler aligner (BWA) (Li and Durbin, 2009). The CNVs are then called using our algorithm and are compared with the “true” CNV set. The set of true CNVs is taken from the “gold standard SV set” for the HapMap individual NA12878 as reported by Mills and others (2011). We first covert the coordinates from hg18 to hg19 using liftover and then remove CNVs with break points on different chromosomes and also the overlapped CNVs due to liftover. The final set includes a total of 547 deletion and duplication CNVs that are longer than 1000 bases. We use the short read simulator program WGSIM (Li and Durbin, 2009) to simulate the reads on two sets of chromosomes modified from the hg19 reference. One set of the chromosomes is the same as the reference except that the bases in the deletion regions are removed; and the other set has the bases repeated in the duplication regions. We simulate 20X reads for each set, roughly adding up to 40X for the whole genome. The only exception is the X chromosome, where only 20X reads are simulated and both deletions and duplications are introduced into the reference. Specifically, we use the following parameters in our simulation: the read length is set to 100, the average insert size is 500 with a standard deviation of 50, mutation rate is 0.01, fraction of indel is 15%, fraction of extended indel is 30%, and maximum no read ratio is 5%. The simulated reads are aligned using BWA’s paired-end algorithm (Li and Durbin, 2009). The “gold standard SV set” for NA12878 has been used in several simulation studies (Teo and others, 2012; Jiang and others, 2012) to evaluate different CNV identification methods. We compare results from


m = 100 P

D2

NB data


9

Table 2. Simulation results for local median transformations and NB transformation Median transformation D1 m = 400 A1, A2 A1, A3

B1, B3 m = 100 A1, A2 A1, A3 B1, B2 B1, B3 m = 20 A1, A2 A1, A3 B1, B2 B1, B3

D3

D1

D3

D3

0.072 (0.010) 0.079 (0.006) 0.123 (0.017) 0.0091 (0.008)

Long CNVs: |I1 | = 1200, |I2 | = 4000, |I3 | = 8000 0.025 0.014 0.072 0.025 (0.003) (0.002) (0.010) (0.003) 0.028 0.012 0.079 0.028 (0.003) (0.001) (0.006) (0.003) 0.026 0.013 0.103 0.026 (0.003) (0.001) (0.008) (0.003) 0.022 0.012 0.100 0.058 (0.003) (0.001) (0.013) (0.004)

0.014 (0.002) 0.012 (0.001) 0.013 (0.001) 0.023 (0.003)

0.087 (0.007) 0.092 (0.006) 0.102 (0.014) 0.105 (0.013)

Medium CNVs: |I1 | = 300, |I2 | = 500, |I3 | = 1000 0.052 0.026 0.087 0.052 (0.004) (0.002) (0.006) (0.004) 0.055 0.032 0.092 0.055 (0.004) (0.002) (0.006) (0.005) 0.060 0.029 0.087 0.057 (0.006) (0.004) (0.006) (0.005) 0.060 0.023 0.100 0.058 (0.005) (0.003) (0.013) (0.004)

0.026 (0.002) 0.032 (0.002) 0.025 (0.003) 0.023 (0.003)

0.293 (0.013) 0.250 (0.029) 1.000 (0.000) 1.000 (0.000)

Short CNVs: |I1 | = 20, |I2 | = 40, |I3 | = 100 0.125 0.060 0.293 0.125 (0.020) (0.005) (0.018) (0.016) 0.125 0.040 0.293 0.125 (0.015) (0.005) (0.021) (0.018) 1.000 0.110 1.000 0.184 (0.000) (0.080) (0.000) (0.017) 0.293 0.080 1.000 0.184 (0.164) (0.009) (0.000) (0.016)

0.050 (0.005) 0.040 (0.004) 0.080 (0.009) 0.060 (0.007)

D1 , D2 , and D3 are the average dissimilarity measurements based on 100 replications with the standard errors given in the parentheses. Baseline non-CNV regions are generated based on NB distribution with A1 : μ = 40, σ 2 = 80 or B1 : μ = 10, σ 2 = 20 and CNV regions are generated based on A2 : μ = 26.7, σ 2 = 44.4, A3 : μ = 60, σ 2 = 150, B2 : μ = 6.7, σ 2 = 11.1, and B3 : μ = 15, σ 2 = 37.5.

our methods to two most commonly used RD-based methods, EWT (RDXplorer) (Yoon and others, 2009), and CNVnator (Abyzov and others, 2011). CNVnator is based on a mean-shift algorithm with multiple bandwidth partitioning and EWT is based on the EWT algorithm. Both adopt a default bin size of 100. We use the same bin size in our parametric and local median transformation. The RDs are further adjusted according to the GC content as described in Yoon and others (2009). Our analysis used less than 2.5 GB memory on a single Xeon 2.6 GHZ CPU and took less than 3 h with more than 95% of the time spent on reading the alignments. Due to mapping difficulties, we observe that the RDs are far from the ideal uniform distribution, which can lead to both false-positive and false-negative identifications for any RDbased methods. Since RD methods are intrinsically more reliable for longer CNVs, we only report CNVs longer than 1000 bases. For a concordant identification, we require the identified CNVs to overlap with one of the “gold standard” set by at least 50% by length reciprocally, defined as the overlap between the two CNVs being longer than 50% of the longer one. To check break point accuracy, we use a more


B1, B2

D3

NB transformation


10

Table 3. Simulation comparison of four methods based on 1000 Genomes Project “gold standard set” of 547 CNVs for individual N A12878, including our proposed methods with NB transformation, the local median transformation (MED), CNVnator, and EWT NB Concordance 50% overlap (All) 50% overlap (Del)

±100 (All) ±100 (Del) ±100 (Dup) ±200 (All) ±200 (Del) ±200 (Dup)

CNVnator

EWT

TP

FP

TP

FP

TP

FP

TP

FP

501 (92%) 343 (92%) 158 (90%)

65 (11%) 56 (13%) 9 (5%)

476 (83%) 330 (88%) 146 (84%)

50 (10%) 48 (13%) 2 (1%)

515 (94%) 365 (97%) 150 (86%)

468 (47%) 460 (56%) 8 (5%)

494 (90%) 364 (97%) 130 (75%)

1154 (70%) 859 (70%) 295 (69%)

425 (77%) 300 (80%) 125 (71%)

141 (24%) 99 (24%) 42 (25%)

378 (69%) 289 (77%) 89 (51%)

148 (28%) 89 (24%) 59 (40%)

397 (69%) 298 (79%) 99 (57%)

586 (60%) 527 (63%) 59 (37%)

65 (12%) 46 (12%) 19 (11%)

1583 (96%) 1177 (96%) 406 (96%)

472 (86%) 337 (92%) 135 (77%)

94 (16%) 62 (16%) 32 (19%)

429 (78%) 321 (86%) 108 (62%)

97 (18%) 57 (15%) 40 (27%)

461 (84%) 340 (91%) 121 (70%)

522 (53%) 485 (58%) 37 (23%)

144 (26%) 109 (29%) 35 (20%)

1504 (91%) 1114 (91%) 390 (92%)

Three different concordance measurements are used, 50% overlap: degree of reciprocal overlap is over 50%; ±100: identified CNV breakpoints are within 100 bases of those in the “gold standard set”; ±200: identified CNV breakpoints are within 200 bases of those in the “gold standard set”. For each method and concordance criterion, the numbers (percentages) of true positive CNVs (TP) and false positive CNVs (FP) are shown for all CNVs (All), deletions (Del), and duplications (Dup).

stringent criterion that requires both of its break points to be within 100 or 200 bases of those in the “gold standard” set. Table 3 lists the numbers of concordant and false-positive CNVs obtained from four different methods using three different criteria of overlapping CNVs. For both deletions and duplications, NB transformation has much lower false-positive rates and better or comparable true positive rates. We also observed that the sensitivities are higher for detecting deletions than duplications for all methods considered. Clearly, our proposed algorithm with NB transformation performs the best in terms of both true discovery and falsepositive rates, 92% and 11%, respectively, when the degree of reciprocal overlapping of greater than 50% is used. Such rates are comparable with the methods that use a combination of pair-read informed split read mapping and RD information (Wu and others, 2013; Teo and others, 2012; Jiang and others, 2012). In contrast, both CNVnator and EWT have a very high false-positive rate, 47% and 70%, respectively, with comparable true discovery rate of 94% and 90%. Our simulation results also highlight the advantage of using NB transformation over the local median transformation. 5. APPLICATION TO THE TRIO DATA

OF THE

1000 GENOMES PROJECT

To demonstrate and compare results of the proposed methods, we analyzed the RD data of chromosome 19 of the HapMap CEU trio samples (NA12878, NA12891, NA12892) from the 1000 Genomes Project.


50% overlap (Dup)

MED


80

70

70

60

60

50 40 30

50 40 30 20

10

10 0

2000

4000 6000 8000 Position MED transformation w/o GC

0.5

0

10000

2000

4000 6000 8000 Position NB transformation w/o GC

0.5

Standardized histogram normal reference

0.4

0.3

0.3

Density

0.4

0.2

0.1

0

0


0.2

0.1

-4

-2

0

2

0

4

-4

Standardized data MED transformation w GC

-2 0 Standardized data

2

4

NB transformation w GC 0.5


0.4

0.4

0.3

0.3

Density

Density

0.5

0.2

0.1

0

10000


0.2

0.1

-4


2

4

0

-4


2

4

Fig. 1. Illustration of RD data of NA12878 of the CEU trio from the 1000 Genomes Project. Top: transformed data based on local median (MED) and the NB transformations of the first 4 million observations. Middle: histograms of the transformed data without correcting for guanine-cytosine content (GC) content. Bottom: histograms of the transformed data correcting for GC content.


20

0

Density

NB transformation

80

Transformed data

Transformed data

MED transformation

11

S. VARDHANABHUTI AND OTHERS Median NB Poisson

20 0 top20

top50

top100

top200

top300

Fig. 2. Concordance with parental genomes for CNVs identified in the daughter (NA12878) for methods using the NB, Poisson and local median transformation. The CNVs are ranked based on the likelihood ratio statistics calculated using the transformed data.

We denote X i to be the number of short reads that cover the base pair position i in the genome, for i = 1, . . . , 55, 747, 332. The average coverage of the read counts is 40. The top plots of Figure 1 show comparison of the local median and the parametric NB-based transformations of the first 4 million observations using bin size m = 400, which resulted in 10 000 transformed data points. The middle and bottom plots of Figure 1 show histograms of 10 000 transformed data generated from the local median and the NB transformations without correcting for GC content and after correcting for GC content. It is clear from these plots that the data based on the NB transformation or the local median transformation can be approximated by normal distributions. Method using the NB transformation with m = 400 and L = 150 identifies 368, 342, and 339 in the daughter and two parents, respectively. In contrast, method using the median transformation identifies 467, 322, and 429 CNVs, respectively. Method using the Poisson distribution identifies a lot fewer CNVs, 184, 159, and 159 in the daughter and two parents. To assess specificity of the identified CNVs, we examine the concordance between CNVs called in the daughter (NA12878) and CNVs called in the parental genomes (NA12891 and NA12892). A CNV identified in the daughter is considered concordance if its degree of reciprocal overlap with the CNVs identified in either parents is greater than 50%. We then rank the CNVs identified using the likelihood ratio statistics Z ( I˜) = k∈ I˜ Z k /| I˜| for interval I˜ based on the transformed data Z k , which account for both length and also magnitude of the CNVs. Figure 2 shows the percent concordance for the top CNV regions identified by the three methods (note that for parametric transformation based on Poisson, we only showed concordances up to the top 100 CNVs since the numbers of CNVs identified by this transformation were less than 200 in all three individuals). For example, top 50 CNVs identified by the NB and Poisson transformation both have 100% concordance whereas the local median transformation has the concordance of 88%. In the top 100, CNVs identified by NB transformation have the highest concordance at 97% whereas the concordance of CNVs based on Poisson and local median transformations are only 88% and 89%, respectively. For the top 300, the concordance is 90% and 85% for the NB transformation and the local median transformations, respectively. These results provide


Concordance % 40 60 80

100

12


13

6. DISCUSSION AND

CONCLUSION

We have considered the problem of CNV identification based on the RD data from the whole-genome sequencing. By assuming that the RD data follow some parametric distribution such as the Poisson or the NB distribution, we propose to use a parametric transformation to first transform the count data in non-overlapping bins into data that are approximately normal. We then apply the LRS method of Jeng and others (2010) to identify the CNVs based on the transformed data. We have performed extensive simulations to study the characteristics of NGS RD data and observed advantages of using parametric transformation as opposed to non-parametric local median transformation (Cai and others, 2012). Results from simulations showed that the proposed method with NB transformation resulted in much lower falsepositive rates with comparable or better true-positive rates than other RD-based methods such as CNVnator and EWT. The proposed method also resulted in 779 fewer CNV calls than CNVnator for individual NA12878 from the 1000 Genomes Project, leading to a potential reduction of false-positive CNV calls in real applications. In our simulations, we used the short read simulator program WGSIM (Li and Durbin, 2009) to simulate Illumina reads, allowing uniform substitution errors, indels, and dummy quality values. Other simulators such as ART (Huang and others, 2012) and GemSIM (McElroy and others, 2012) can also be applied. Both ART and GemSIM use empirically derived, sequence-context based error models to simulate reads where empirical fragment length and quality score distributions are used. Since our proposed method and other RD-based methods such as CNVnator allow post-processing of the identified CNVs based on quality scores, we expect similar conclusions when these read simulators are used. However, it would be interesting to further evaluate these RD segmentation methods using these simulators. The RD-based approaches for CNV calling can be affected by the bias originating from read alignment. In the alignment step, some reads can be mapped to multiple positions due to a short read length and the presence of repetitive regions in the reference genome. Our method and CNVnator randomly assign an ambiguous read to one of all possible positions to which this read is mapped by the aligner. This strategy enable us to identify the CNVs in repetitive genomic regions, but can suffer from false-positive detection as a consequence of the random placement of ambiguously mapped reads. Like most of the RD-based methods, we suggest to filter out the CNVs with over 50% of the reads that cover these CNVs having mapping quality scores of zero. This can, however, reduce the sensitivity of detecting the CNVs in the repetitive regions. Alternatively, one can combine the RD data with other data such as pair-end mapping


evidence that using NB distribution to model the RD data can increase the concordance rate of the CNVs called between the daughter and her two parents compared with both the Poisson transformation and the non-parametric median-based transformation. We also compared the CNVs identified by different methods to “gold standard set” of 547 CNVs longer than 1000 bps that was provided by the 1000 Genomes Project (Mills and others, 2011) for individual NA12878. We used the default parameters for CNVnator and EWT and considered the CNVs that are longer than 1000 bps. To make the results comparable with the default setup of CNVnator, we filtered out the CNVs with more than 50% of the reads having mapping quality scores of zero. The NB transformation, local median transformation, CNVnator, and EWT identify a total of 1037, 1137, 1816, and 15 499 CNVs throughout the genome, respectively. Among these CNVs, 306, 305, 355, and 366 were concordant with the “gold standard” set with greater than 50% reciprocal overlap, resulting a sensitivity of 0.56, 0.56, 0.65, and 0.67, respectively. Our method with NB transformation detected much fewer number of CNVs than CNVnator and EWT, resulting a specificity of 30%, when compared with 20% and 2% for CNVnator and EWT and 27% for local median transformation. We again observe that the NB transformation resulted in fewer CNVs but higher concordant rate with the “gold standard” set than the local median transformation.

14


distances and pair-read informed split-read mapping information to improve both sensitivity and specificity for CNV detection. Finally, our proposed method uses the sample mean RD to approximate the baseline depth, which is a reasonable assumption only for whole-genome sequencing data. However, the methods cannot be applied directly to RD data from capture or exome sequencing. For such data, it is necessary to pool information across multiple samples in order to adjust for differential capture efficiencies in different exons. Extending our approach to such data is an important topic for future research.

We thank Tony Cai for pointing out the paper of Brown, Cai, and Zhou (2010) that leads to the development of this paper. We thank both reviewers for their insightful comments. Conflict of Interest: None declared.

FUNDING This research is supported in part by the NIH grants CA127334, GM097525, and EY021451.

REFERENCES ABYZOV, A., URBAN, A., SNYDER, M. AND GERSTEIN, M. (2011). CNVnator: an approach to discover, genotype and characterize typical and atypical CNVs from family and population genome sequencing. Genome Biology 21, 974– 984. BROWN, L. D., CAI, T. T. AND ZHOU, H. (2010). Nonparametric regression in exponential families. The Annals of Statistics 38, 2005–2046. CAI, T., JENG, J. AND LI, H. (2012). Robust detection and identification of sparse segments in ultra-high dimensional data analysis. Journal of Royal Statistical Society, Series B 74(5), 773–797. DISKIN, S. J., HOU, C., GLESSNER, J. T., ATTIYEH, E. F., LAUDENSLAGER, M., BOSSE, K., COLE, K., MOSS E´ , Y. P., WOOD, A., LYNCH, J. E. and others. (2009). Copy number variation at 1q21.1 associated with neuroblastoma. Nature 459, 987–991. HUANG, W., LI, L., MYERS, J. R. Bioinformatics 28, 593–594.

AND

MARTIN, G. T. (2012). Art: a next-generation sequencing read simulator.

JENG, J., CAI, T. AND LI, H. (2010). Optimal sparse segment identification with application in copy number variation analysis. Journal of American Statistical Association 105, 1156–1166. JIANG, Y., WANG, Y. AND BRUDNO, M. (2012). Prism: pair-read informed split-read mapping for base-pair level detection of insertion, deletion and structural variants. Bioinformatics 28, 2576–2583. LI, H. AND DURBIN, R. (2009). Fast and accurate short read alignment with burrows-wheeler transform. Bioinformatics 25, 1754–1760. MCELROY, K. E., LUCIANI, F. AND THOMAS, T. (2012). Gemsim: general, error-model based simulator of nextgeneration sequencing data. BMC Genomics 13, 74. MILLER, C. A., HAMPTON, O., COARFA, C. AND MILOSAVLJEVIC, A. (2011). Readdepth: a parallel r package for detecting copy number alterations from short sequencing reads. PLoS One 6(1), e16327. MILLS, R. E., WALTER, K., STEWART, C., HANDSAKER, E. R., CHEN, K., ALKA, C., ABYZOV, A., YOON, S. C., YE, K., CHEETHAM, R. K. and others. (2011). Mapping copy number variation by population-scale genome sequencing. Nature 470, 59–65.


ACKNOWLEDGMENTS


15

OLSHEN, A. B., VENKATRAMAN, E. S., LUCITO, R. AND WIGLER, M. (2004). Circular binary segmentation for the analysis of array-based dna copy number data. Biostatistics 5, 557–572. REDON, R., ISHIKAWA, S., FITCH, K. R., FEUK, L., PERRY, G. H., ANDREWS, T. D., FIEGLER, H., SHAPERO, M. H., CARSON, A. R., CHEN, W. and others. (2006). Global variation in copy number in the human genome. Nature 444, 444–454. SIEGMUND, D. O., YAKIR, B. AND ZHANG, N. R. (2010). Tail approximations for maxima of random fields by likelihood ratio transformations. Sequential Analysis 29, 245–262. SZATKIEWICZ, J. P., WANG, W., WANG, W., SULLIVAN, P. F. AND SUN, W. (2013). Improving detection of copy-number variation by simultaneous bias correction and read-depth segmentation. Nucleic Acids Research 41, 1519–1532.

URBAN, A. E., KORBEL, J. O., SELZER, R., RICHMOND, T., HACKER, A., POPESCU, G. V., CUBELLS, J. F., GREEN, R., EMANUEL, B. S., GERSTEIN, M. B. and others. (2006). High resolution mapping of dna copy alternations in human chromosome 22 using high-density tiling oligonucleotide arrays. Proceeding of National Academy Sciences 103, 4534–4539. WANG, K., LI, M., HADLEY, D., LIU, R., GLESSNER, J., GRANT, S., HAKONARSON, H. AND BUCAN, M. (2007). Penncnv: an integrated hidden markov model designed for high-resolution copy number variation detection in whole-genome SNP genotyping data. Genome Research 17, 1665–1674. WU, Y., TIAN, L., PIRASTU, M., STAMBOLIAN, D. AND LI, H. (2013). Matchclip: locate precise breakpoints for copy number variation using cigar string by matching soft clipped reads. Frontiers in Genetics 4, 157. XIE, C. AND TAMMI, M. (2009). CNV-seq, a new method to detect copy number variation using high-throughput sequencing. BMC Bioinformatics 10, 80. YOON, S., XUAN, Z., MAKAROV, V., YE, K. AND SEBAT, J. (2009). Sensitive and accurate detection of copy number variants using read depth of coverage. Genome Research 19, 1586–1592. [Received March 18, 2013; revised December 11, 2013; accepted for publication December 12, 2013]


TEO, S. M., PAWITAN, Y., KU, C. S., CHIA, K. S. AND SALIM, A. (2012). Statistical challenges associated with detecting copy number variations with next-generation sequencing. Bioinformatics 28, 2711–2718.

cnvScan: a CNV screening and annotation tool to improve the clinical utility of computational CNV prediction from exome sequencing data.

CNV-CH: A Convex Hull Based Segmentation Approach to Detect Copy Number Variations (CNV) Using Next-Generation Sequencing Data.

Evaluation of three read-depth based CNV detection tools using whole-exome sequencing data.

Bayesian Non-Parametric Hierarchical Modeling for Multiple Membership Data in Grouped Attendance Interventions.

SAAS-CNV: A Joint Segmentation Approach on Aggregated and Allele Specific Signals for the Identification of Somatic Copy Number Alterations with Next-Generation Sequencing Data.

The ICR96 exon CNV validation series: a resource for orthogonal assessment of exon CNV calling in NGS data.

Identification of indels in next-generation sequencing data.

Cancer research in the era of next-generation sequencing and big data calls for intelligent modeling.

Bayesian identification of bacterial strains from sequencing data.

Comparison of non-parametric methods for ungrouping coarsely aggregated data.

Modeling the next generation sequencing read count data for DNA copy number variant study.

Parametric likelihood inference for interval censored competing risks data.

Bayesian inference for longitudinal data with non-parametric treatment effects.

Differential correlation for sequencing data.

Evaluation of variant identification methods for whole genome sequencing data in dairy cattle.

Comparing computational methods for identification of allele-specific expression based on next generation sequencing data.

DREAM: a webserver for the identification of editing sites in mature miRNAs using deep sequencing data.

Proteogenomic strategies for identification of aberrant cancer peptides using large-scale next-generation sequencing data.

An integrative framework for the identification of double minute chromosomes using next generation sequencing data.

Novel LC-MS² product dependent parallel data acquisition function and data analysis workflow for sequencing and identification of intact glycopeptides.

CuBe: parametric modeling of 3D foveal shape using cubic Bézier.

Bayesian spatial semi-parametric modeling of HIV variation in Kenya.

Parametric modeling of visual search efficiency in real scenes.

Incident duration modeling using flexible parametric hazard-based models.