G Model

ARTICLE IN PRESS

CBAC-6515; No. of Pages 6

Computational Biology and Chemistry xxx (2016) xxx–xxx

Contents lists available at ScienceDirect

Computational Biology and Chemistry journal homepage: www.elsevier.com/locate/compbiolchem

Research Article

Copy number variants calling for single cell sequencing data by multi-constrained optimization Bo Xu, Hongmin Cai ∗ , Changsheng Zhang, Xi Yang, Guoqiang Han School of Computer Science & Engineering, South China University of Technology, Guangzhou, China

a r t i c l e

i n f o

Article history: Received 16 January 2016 Accepted 1 February 2016 Available online xxx Keywords: Copy number variants Read depth Sparsity Poisson distribution Total variation

a b s t r a c t Variations in DNA copy number carry important information on genome evolution and regulation of DNA replication in cancer cells. The rapid development of single-cell sequencing technology allows one to explore gene expression heterogeneity among single-cells, thus providing important cancer cell evolution information. Single-cell DNA/RNA sequencing data usually have low genome coverage, which requires an extra step of amplification to accumulate enough samples. However, such amplification will introduce large bias and makes bioinformatics analysis challenging. Accurately modeling the distribution of sequencing data and effectively suppressing the bias influence is the key to success variations analysis. Recent advances demonstrate the technical noises by amplification are more likely to follow negative binomial distribution, a special case of Poisson distribution. Thus, we tackle the problem CNV detection by formulating it into a quadratic optimization problem involving two constraints, in which the underling signals are corrupted by Poisson distributed noises. By imposing the constraints of sparsity and smoothness, the reconstructed read depth signals from single-cell sequencing data are anticipated to fit the CNVs patterns more accurately. An efficient numerical solution based on the classical alternating direction minimization method (ADMM) is tailored to solve the proposed model. We demonstrate the advantages of the proposed method using both synthetic and empirical single-cell sequencing data. Our experimental results demonstrate that the proposed method achieves excellent performance and high promise of success with single-cell sequencing data. Crown Copyright © 2016 Published by Elsevier Ltd. All rights reserved.

1. Background Genomic variation consists of single nucleotide variants (SNVs), copy number variations (CNVs), small insertions or deletions (indels), and structural variants (SVs). The variants are ranging from low level of single base changes to high level of large chromosomal alterations. CNV refers to a type of intermediate-scale SVs with copy number changes involving a DNA fragment that is typically greater than one kilo bases (Kb) and less than five mega bases (Mb). Recent studies have revealed a strong correlation between CNVs and certain diseases, including Mendelian diseases, autism, schizophrenia, and a few tumor types (Lee et al., 2007; Stefansson et al., 2008; Sebat et al., 2007). Various laboratory techniques have been developed to measure the DNA copy number. Traditional major approaches include array-based comparative genomic hybridization (aCGH) and single-nucleotide polymorphism (SNP) array approaches (Pinkel and Albertson, 2005). The

∗ Corresponding author. E-mail address: [email protected] (H. Cai).

next-generation sequencing (NGS) is adopted as a popular strategy for genotyping and has included comprehensive characterization of CNVs by generating hundreds of millions of short reads in a single run (Metzker, 2010). Since there exists high heterogeneities between individual single-cells at the molecular level, NGS has been successfully tailored to sequence entire genomes at the singlecell level. The main difference between single-cell sequencing and traditional bulk sequencing is that single-cell sequencing needs an extra step that amplifies the genome from a single-cell. Currently there are two major amplification schemes, multiple displacement amplification (MDA) (Spits et al., 2006) and multiple annealing looping-based amplification cycle (MALBAC) (Zong et al., 2012). As a revolutionary technology, single-cell sequencing quickly received popularity and has been employed in many biological and medical fields. However, amplification of low genome volume to rich one will inevitably introduce substantial level of technical noises, with some genomic regions being amplified more than others. It is such technical noises that make the CNV analysis for single-cell sequencing data challenging. Rather than developing new method to address this pitfalls, most of the current schemes were directly borrowed from bulk sequencing with an extra step

http://dx.doi.org/10.1016/j.compbiolchem.2016.02.007 1476-9271/Crown Copyright © 2016 Published by Elsevier Ltd. All rights reserved.

Please cite this article in press as: Xu, B., et al., Copy number variants calling for single cell sequencing data by multi-constrained optimization. Comput. Biol. Chem. (2016), http://dx.doi.org/10.1016/j.compbiolchem.2016.02.007

G Model CBAC-6515; No. of Pages 6

ARTICLE IN PRESS B. Xu et al. / Computational Biology and Chemistry xxx (2016) xxx–xxx

2

of bias correction. In bulk sequencing analysis, most of the methods assumed that the raw signals are corrupted by Gaussian noise to ease the computation. However, it has been validated experimentally that the technical and biological variability in the absolute numbers of mRNA molecules in single-cell sequencing experiments break the noise assumption. For example, the author proved that the technical noises could be accurately characterized by negative binomial distribution (Grün et al., 2014). In this paper, we propose to fill the gap of finding the CNVs in single-cell sequencing data by assuming that the technical noises follow Poisson distribution, a simplified version of negative binomial distribution. Due to the popularity of read depth (RD) scheme in CNV calling in bulk sequencing, we propose to carry CNV analysis for RD data. The underlying hypothesis of RD-based methods is that the depth of coverage in a genomic region is correlated with the copy number of the region. Therefore, we assume that the copy number is piecewise constant satisfying two fundamental characteristics: sparsity because of few CNVs after normalization, and smoothness because of contiguous positions possessing similar CNs. Under such settings, the CNV detection problem is formulated as a piecewise linear signal recovering optimization. An efficient numerical solution within the framework of alternating direction minimization method is tailored to solve our problem. Experiments on both synthetic and empirical single-cell sequencing data demonstrated its superior performance. The rest of this article is organized as follows. In Section 2, we formulate an original mathematical model for copy number reconstruction from raw RD signal, corrupted by Poisson noises. Numerical solution is elaborated here too. In Section 3, experiments on both synthetic and empirical data were conducted to demonstrate the performance of the proposed model. Finally, we conclude our article with some discussions in this Section 4. 2. Methods The proposed method for analyzing single-cell sequencing data consists of four steps. In the first step, the short reads were aligned to a reference genome using standard tools such as MAQ (Li et al., 2009) and Bowtie (Langmead et al., 2009). The data were further processed for GC content correction and segmented into bins with variable lengths to reduce read mapping bias (Baslan et al., 2012). In the second step, the proposed method was used to reconstruct the copy number by the raw reads in each bin. Finally, statistical testing was borrowed to find the CNVs. 2.1. Problem formulation Mathematically, let y = (y1 , y2 , . . ., yn ) be the observed RD signals in a bin, and x = (x1 , x2 , . . ., xn ) is the corresponding reconstructed copy number. We wish to determine the copy number (CN) x that is most likely given by the sequencing reads y. Bayes’s Law says that:

P(x|y) =

P(y|x)P(x) P(y)

Thus, we wish to maximize P(y|x)P(x). Assuming Poisson noise corrupted at each genome position t, we have: y

P(yt ) = Pxt (yt ) =

e−xt xt t yt !

Suppose that the values of y at the position t are independent, then:

P(y|x) =

 e−xt xyt t

t

yt !

Considering the characteristics of sequencing reads, we further impose that the prior distribution on copy number (after standardization by subtracting its mean value) satisfies two assumptions: Smoothness copy numbers at contiguous chromosome positions are similar except for abrupt changes between different segments; Sparsity copy number variants are less common than invariants. Mathematically, the above two characteristics could be penalized by:

 

P(x) = exp



(1 |∇ x| + 2 |x|)d˝



where 1 and 2 are trade-off parameters for controlling the sparsity and smoothness of copy number function, respectively. The integration operation takes value along the genome on each bin ˝. Finally, we minimize − log(P(y|x)P(x)) to seek the maximum a posterior probability on x: min

 

x



 (xt − yt log xt+ ) +

(1 |∇ x| + 2 |x|)d˝

(1)

t

where x+ = max {0, x}. Once we obtained the legitimate copy number signal, its variants can be easily derived by simple threshold. 2.2. Numerical solution The minimization problem (1) is actually a quadratic optimization constrained both by a total variational (TV) norm and a l1 norm. Such minimization problems are widely seen in various areas, including signal processing and image recovering (Ng et al., 2010). Since the optimization problem (1) is convex, there exists multiple choices to solve it by employing standard optimization methods, such as majority minimization (Zhang et al., 2010, 2012), and Lasso approach (Duan et al., 2013; Zhou et al., 2013). Unfortunately, due to the high volume of the sequenced data, efficient numerical solution is desirable for practical usage. In this paper, we propose to solve (1) within the framework of the Alternating Direction Method of Multipliers (ADMM) (Ng et al., 2010; Zhou et al., 2013). The most admiring characteristic of ADMM lies in its capability of decomposing a complex problem into favorably separable subproblems, thus allowing it to be efficiently solved individually. Let g1 (x) = x − y log(x), g2 = 1+ (·), g3 = 1  ·  1 , g4 = 1  · − c  1 , where 1+ is the indicator function for positive real numbers:



1+ (x) =

0,

x>0

+∞,

otherwise

Let G = [I ; I ; ∇ ; I]T with I being identity matrix and ∇ be usual difference matrix. Then the minimization equation (1) could be accordingly rewritten as: L(x) = min

m 

x

{(xi − yi log(xi ) + 1+ (xi )) + 1 ∇ x1 + 2 x − c1 }

i

= minf1 (x) + f2 (Gx) x

4

where f1 (x) = 0, f2 (x) = g (x). j=1 j By introducing a slack variable of u = Gx, the augmented Lagrange function for L(x) is: L = minf1 (x) + f2 (u) + x

 Gx − u22 2

(2)

Please cite this article in press as: Xu, B., et al., Copy number variants calling for single cell sequencing data by multi-constrained optimization. Comput. Biol. Chem. (2016), http://dx.doi.org/10.1016/j.compbiolchem.2016.02.007

G Model

ARTICLE IN PRESS

CBAC-6515; No. of Pages 6

B. Xu et al. / Computational Biology and Chemistry xxx (2016) xxx–xxx

where  is Lagrange multiplier. The above minimization problem could be now fitted into the ADMM framework, and thus could be decoupled into the following two subproblems:

3

Its solution is given by a hard threshold on :

v∗ = max{, 0}

(8)

For i = 3, the minimization problem (4) has form: Subproblem 1: xk+1 = arg minf1 (x) +

 Gx 2

Subproblem 2: uk+1 = arg minf2 (u) +

 Gxk+1 2

x

u



− uk − dk 22 − u − dk 22

Updating: dk+1 ← dk − (Gxk+1 − uk+1 ).



It remains to solve the two subproblems and we shall demonstrate that they could be solved elegantly by using standard methods after simple algebraic transformation. In particular to subproblem 1, one can verify that: xk+1 ∈ arg min x

= (GT G)

−1

−1

1 −1

0

···

⎢ 0 1 −1 · · · ⎢ D=⎢ . . ⎣ .. . . . . . . . . ...

0

0

1

−1

−1

(3)

where F is 2-D discrete Fourier transformation matrix and K = diag(FD)/n. The subproblem 2 amounts to solving:

u

 ||Gxk+1 − u − dk ||22 2

T

each u(i) in u = [u(1) , u(2) , u(3) , u(4) ] for i = 1, 2, 3, 4, and then be solved separately. The decomposed problems are in form: (j)

v

 (j) v − sk 22 2

(4)

(j)

(j)

for j = 1, . . ., 4, with sk = (Gxk+1 )(j) − dk . For i = 1, the minimization problem (4) has form:

 argmin v

 v − y log(v) + (v − )2 2

 1 v (i) = 2

1 + (i) − 





1 ,0 

2 v − c1 +

 (v − )2 2



2 



(9)

(10)

1 (i) − 

(5)

2

2.3. Parameter pruning There are two parameters involved in our model. The parameter of 1 is used to penalize the total variational term and 2 is used to control the sparsity of the recovered signal. When we consider the parameter pruning for 1 alone, one can minimize the energy function with respect to x. By using DÄlembert convergence condition lim

n→inf

x(n+1)  x(n) 

 4yi + 

(6)

= 1, we have:

y 1 x x 1 + 1 div ∇ x (n) ∇ x

(11)

∇ x | < 1 to ensure the Therefore, one should impose that |1 + 1 div ∇ x convergence. Furthermore, in the higher gradient areas, it implies a possible changing position and thus we need to preserve the information. Since the divergence in Eq. (11) is large, a small value of 1 should be adopted. In contrary, in flat area with lower gradient values, a large value of 1 should be imposed to encourage the noise smoothing. Incited by our earlier works (Cai et al., 2008; Zhang et al., 2007), we are proposing to use the following spatially adaptive scheme in setting the value of 1 :

1 = a



It has an analytic solution, given by:



|| −

Finally, the auxiliary sequences of {dk } is updated by:

x(n+1) =

One can verify that it could be further decoupled with respect to

uk+1 ← argmingj (v) +



Once have successfully reconstructed the copy number from RD signals, a simple threshold scheme could be used to find the variants.

(3I + K)F(1 + DT 2 + 3 + 4 )

uk+1 ∈ argminf2 (u) +



argmin

dk+1 ← dk − (Gxk+1 − uk )

Since DT D is a circular matrix and thus can be diagonalized by discrete Fourier transform. By series of simple calculations, the analytical solution for the subproblem 1 is given by: ∗ xk+1 = F T (3I + K 2 )

= sign(v) max

For i = 4, the minimization problem (4) has form:

v∗ = soft  − c,

⎥ ⎥ ⎥ ⎦

0





(1 + DT 2 + 3 + 4 )



0

1 

Similarly, its solution is also given by the Moreau proximity operator:

where  k = uk + dk , and D is the total variational difference matrix:



v∗ = soft ,

v

 ||Gx − uk − dk ||22 2

GT k = (3I + DT D)



 (v − )2 2 v Its solution is given Moreau proximity operator, which is a simple soft threshold: 1 v1 +

argmin

1 1 + b∇ x

(12)

where a is a scaling factor and b is a shape factor. The two parameters are decided empirically. In our experiment, we selected a = 1, b = 10 as default value. In terms of the sparsity penalizing term of 2 , it is a trade-off constant within soft thresholds, shown in Eq. (9). Therefore, a small value is desired to have a convergent outcome. 3. Results and discussion

For i = 2, the minimization problem (4) has form:  argmin{1+ (v) + (v − )2 } 2 v

(7)

We conducted two experiments to demonstrate the performance of the proposed model. The tested data sets were categorized into two types: synthetic and empirical data. The synthetic experiments were used to test the performance of the proposed method

Please cite this article in press as: Xu, B., et al., Copy number variants calling for single cell sequencing data by multi-constrained optimization. Comput. Biol. Chem. (2016), http://dx.doi.org/10.1016/j.compbiolchem.2016.02.007

G Model

ARTICLE IN PRESS

CBAC-6515; No. of Pages 6

B. Xu et al. / Computational Biology and Chemistry xxx (2016) xxx–xxx

4 5

5.5 Proposed Method CNV−TV

4 3.5 3 2.5 2

Proposed Method CNV−TV

Normalized Copy Number

5

Normalized Copy Number

Normalized Copy Number

4.5

6 Proposed Method CNV−TV

4.5 4 3.5 3 2.5 2

5

4

3

2

1 1.5 1

1.5

0

500

1000

1500

2000

1

0

500

Bins

(a) High Amplification (x50)

1000

1500

2000

0

0

500

1000

1500

Bins

Bins

(b) Mediate Amplification (x30)

(c) Low Amplification (x10)

2000

Fig. 1. Experiment with a synthetic example. A raw piecewise constant RD signal highlighted in green was degraded by various levels of the Poisson noises due to amplification levels. The results after the proposed method and CNV-TV on the degraded RD signals denoted in light black with (a) low noises (50× amplification level); (b) mediate noises (30× amplification level) and (c) high noises (10× amplification level) were shown for comparison. The result after the proposed method is highlighted in red, while the result after CNV-TV was in blue color. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

on CNV detection from raw simulated signals under various Poisson noise corruptions. We further tested the proposed method on an empirical single-cell sequencing data set (Navin et al., 2011). 3.1. Synthetic experiments The first experiment is used to test the capabilities of the proposed method on CNV detection from the RD data, which were corrupted by the Poisson noises with various variances. A raw piecewise constant RD signal was firstly created to simulate the copy numbers. We simulated the single-cell sequencing process by using three different amplification levels (10×, 30× and 50×). It resulted in various noise degradations of the Poisson type. The higher the amplification level the lower the noise variance is. We categorized the three types as low, mediate and high noises, shown in Fig. 1(a–c) respectively. We then applied the proposed method on this corrupted data to recover the copy number. For comparison, a recently reported method, called CNV-TV (Duan et al., 2011) was employed here. CNV-TV was designed for analyzing CNV in bulk sequencing data. It shares with the same assumptions of sparsity and smoothness of the CNV as our method does. For CNV-TV, CNVs detection is modeled as a change-point detection on RD signals, which is fitted with a total variation (TV) penalized least squares model (Duan et al., 2011). The result after the proposed method is highlighted in red, while the result after CNV-TV was in blue color. Both of the results were overlapped with degraded RD signals, denoted in light black, for comparison. One may observed that both of the two methods performed stably in three cases. However, the CNV-TV is less superior to our methods in that it failed to detect less prominent variants. For example, the RD signals at position from 600 to 700 were not been correctly detected by CNV-TV under all three cases. Similarly, the variants at position 1500 were also falsely detected by CNV-TV. The proposed method could accurately detect all the variants by perfectly reconstructing the copy number even when the noise degradation was high, as shown in Fig. 1(c). 3.2. Application in single-cell sequencing data To further assess the performance of the proposed method on empirical applications, a single-cell sequencing data set of 100 cells was downloaded from NCBI and tested (Navin et al., 2011). The sequencing samples were from a high-grade(III), triple-negative (ER− , PR− , HER2− ) ductal carcinomas(T10). Each single-cell of the tumor was sequenced based on single-nucleus sequencing by flow

sorting of single nuclei, followed by whole-genome amplification and next-generation sequencing to characterize its copy number alterations. The characteristic of the data allows us to have a comprehensive view of the tumor evolutionary process. The data have been analyzed by an newly reported protocol (Baslan et al., 2012). The protocol consists of three main steps, reads mapping, bias correction and copy number profile detection. The copy number profiles were obtained by using standard CBS method (Olshen et al., 2004). Circular binary segmentation (CBS), a modification of binary segmentation, can translate noisy intensity measurements into regions of equal copy number for bulk sequencing data (Olshen et al., 2004). The analysis on these single-cells revealed three distinct clonal subpopulations that probably represent sequential clonal expansions (Navin et al., 2011). The proposed method was applied on this data set and results after three typical single-cells with different ploidies were shown in Fig. 2. The RD signals, highlighted in light gray, were overlapped with the reconstructed copy numbers after CBS (blue broken line), Ginkgo (Garvin et al., 2014) (green broken line) and the proposed method (red broken line) respectively. Ginkgo (http://qb.cshl.edu/ ginkgo) is a user-friendly, open-source web platform for the CNVs detection of single-cell sequencing data. To have a quantitative evaluations of the three methods, we applied the proposed method on all 100 cells to have their copy number profiles. Then the profiles were clustered by using neighbor-joining tree. For comparison, both the protocol proposed in Baslan et al. (2012) and Ginkgo were employed. The experimental results, shown in Fig. 3, clearly show three major branches of ploidy: a diploid fraction in green, a hypodiploid fraction in yellow, and a aneuploid fraction in red. The tree nicely indicated three subpopulations by demonstrating their common ancestors and evolutionary distance among them. The tumor grades of 1–4 were indexed by its ploidy (Navin et al., 2011). To quantify how “goodness” of each clustering result, two evaluation criteria were defined and calculated. The first criterion evaluated the closeness of the cluster labels to its known clinical grades. The smaller the value is, the better of the genome profile matching with its clinical status is. The distance after CBS and Ginkgo are 0.19 and 0.3, respectively. Both of them are dramatically larger than that of the proposed method with 0.15, showing a better performance and higher accuracy than the other two. The second criterion evaluated the “closeness” of the clusters themselves. The value is defined by the fraction of averaged Euclidian distances between inter-group and intra-group. The results are 0.2917/0.236, 0.28/0.1182 and 0.4/0.078 for the proposed method, CBS and Ginko, respectively. Our method preformed suboptimal to the CBS in that

Please cite this article in press as: Xu, B., et al., Copy number variants calling for single cell sequencing data by multi-constrained optimization. Comput. Biol. Chem. (2016), http://dx.doi.org/10.1016/j.compbiolchem.2016.02.007

G Model CBAC-6515; No. of Pages 6

ARTICLE IN PRESS B. Xu et al. / Computational Biology and Chemistry xxx (2016) xxx–xxx

5

Fig. 2. Copy number detection on three typical cancer cells with single-cell sequencing. The selected cells consisted of one diploid cells (SRR053623), one hypodiploid cells (SRR054569), and one aneuploid cells (SRR089401). The raw signal was denoted in light gray as background, different colors were used to distinguish the CNVs detected by the three methods. The result by CBS, the proposed method and the Ginkgo were highlighted in blue, red and green broken line respectively. For visual comparison, partial segment (highlighted in black) along the genome was zoomed-in. For each cell, the zooming segment was cropped from the results, bounded by black lines and aligned to its right. Euclidean distance between RD signals and copy number detected was used to evaluate the performance of the proposed method by comparing with CBS and Ginkgo. The smaller Euclidean distance the better the CNVs predicted result is. One may observed that three methods performed consistently when the cell is diploid, as shown in (a, b). However, the performance of Ginkgo deteriorated when the copy number variations tend to increase, which has a greater Euclidean distance to RD signals. In comparison, both the CBS and the proposed method perform stably and robustly, even when the copy number variations were high. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

Fig. 3. Results of 100 cells clustering by neighbor-joining tree on copy number profiles after (a) the proposed method; (b) Ginkgo; and (c) CBS. Different colors are used to highlight the cluster labels. We used two criteria to compare the performances after the three methods.

Please cite this article in press as: Xu, B., et al., Copy number variants calling for single cell sequencing data by multi-constrained optimization. Comput. Biol. Chem. (2016), http://dx.doi.org/10.1016/j.compbiolchem.2016.02.007

G Model CBAC-6515; No. of Pages 6

ARTICLE IN PRESS B. Xu et al. / Computational Biology and Chemistry xxx (2016) xxx–xxx

6

the inter-group distances were comparable, yet the intra-group distances were larger than after the CBS. Ginko performed the least satisfactory. 4. Conclusion We have proposed an original optimization model for identifying CNVs from single-cell sequencing data in this article. The new formulation imposes two constraints of sparsity and smoothness on the reconstructed copy number. Due to the amplification step specific to single-cell sequencing, technical noises are major roadblock for accurate detection of CNVs. Existing researches have proved that the noises could be accurately characterized by the negative binomial distribution. To this end, we tailored the proposed model to recover the copy numbers from RD signals corrupted by Poisson noises, which is a simplified version of the negative binomial distribution. An exact numerical solution for the convex formulation has been provided by using the classical ADMM framework to guarantee a global optimal solution. We have demonstrated the capability of our method to separate CNVs from other variations in wide data types, including synthetic and empirical sequencing data. Acknowledgements This work was partially supported by BGI-SCUT Innovation Fund Project (SW20130803), National Nature Science Foundation of China (61372141), the Fundamental Research Fund for the Central Universities (2015ZZ025) and theOpen Fund by State Key Laboratory of Oncology in South China (HN2014-01). References Baslan, T., Kendall, J., Rodgers, L., Cox, H., Riggs, M., Stepansky, A., Troge, J., Ravi, K., Esposito, D., Lakshmi, B., Wigler, M., Navin, N., Hicks, J., 2012. Genome-wide copy number analysis of single cells. Nat. Protoc. 7 (6), 1024–1041. Cai, H., Xu, X., Lu, J., Lichtman, J., Yung, S., Wong, S.T., 2008. Using nonlinear diffusion and mean shift to detect and connect cross-sections of axons in 3D optical microscopy images. Med. Image Anal. 12 (6), 666–675. Duan, J., Zhang, J., Lefante, J., Deng, H., Wang, Y., 2011. Detection of copy number variation from next generation sequencing data with total variation penalized least square optimization. In: IEEE International Conference on Bioinformatics and Biomedicine Workshops, pp. 3–12. Duan, J., Zhang, J.-G., Deng, H.-W., Wang, Y.-P., 2013. Cnv-tv: a robust method to discover copy number variation from short sequencing reads. BMC Bioinform. 14 (1), 150.

Garvin, T., Aboukhalil, R., Kendall, J., Baslan, T., Atwal, G.S., Hicks, J., Wigler, M., Schatz, M., 2014. Interactive analysis and quality assessment of single-cell copy-number variations. bioRxiv, 011346. Grün, D., Kester, L., van Oudenaarden, A., 2014. Validation of noise models for singlecell transcriptomics. Nat. Methods 11 (6), 637–640. Langmead, B., Trapnell, C., Pop, M., Salzberg, S., 2009. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10 (3), R25. Lee, C., Iafrate, A.J., Brothman, A.R., 2007. Copy number variations and clinical cytogenetic diagnosis of constitutional disorders. Nat. Genet. 39 (7), S48–S54. Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G., Durbin, R., et al., 2009. The sequence alignment/map format and samtools. Bioinformatics 25 (16), 2078–2079. Metzker, M.L., 2010. Sequencing technologies the next generation. Nat. Rev. Genet. 11 (1), 31–46. Navin, N., Kendall, J., Troge, J., Andrews, P., Rodgers, L., McIndoo, J., Cook, K., Stepansky, A., Levy, D., Esposito, D., Muthuswanmy, L., Kransnitz, A., McCombie, W., Hicks, J., Wigler, M., 2011. Tumour evolution inferred by single-cell sequencing. Nature 472 (7341), 90–94. Ng, M.K., Weiss, P., Yuan, X., 2010. Solving constrained total-variation image restoration and reconstruction problems via alternating direction methods. SIAM J. Sci. Comput. 32 (5), 2710–2736. Olshen, A.B., Venkatraman, E., Lucito, R., Wigler, M., 2004. Circular binary segmentation for the analysis of array-based DNA copy number data. Biostatistics 5 (4), 557–572. Pinkel, D., Albertson, D.G., 2005. Array comparative genomic hybridization and its applications in cancer. Nat. Genet. 37, S11–S17. Sebat, J., Lakshmi, B., Malhotra, D., Troge, J., Lese-Martin, C., Walsh, T., Yamrom, B., Yoon, S., Krasnitz, A., Kendall, J., Leotta, A., Pai, D., Zhang, R., Lee, Y., Hicks, J., Spence, S., Lee, A., Puura, K., Lehtimaki, T., Ledbetter, D., Gregersen, P., Bregman, J., Sutcliffe, J., Jobanputra, V., Chung, W., Warburton, D., King, M., Skuse, D., Geschwind, D., Gilliam, T., Ye, K., Wigler, M., 2007. Strong association of de novo copy number mutations with autism. Science 316 (5823), 445–449. Spits, C., Le Caignec, C., De Rycke, M., Van Haute, L., Van Steirteghem, A., Liebaers, I., Sermon, K., 2006. Whole-genome multiple displacement amplification from single cells. Nat. Protoc. 1 (4), 1965–1970. Stefansson, H., Rujescu, D., Cichon, S., Pietilä inen, O.P., Ingason, A., Steinberg, S., Fossdal, R., Sigurdsson, E., Sigmundsson, T., Buizer-Voskamp, J.E., et al., 2008. Large recurrent microdeletions associated with schizophrenia. Nature 455 (7210), 232–236. Zhang, Y., Xu, X., Cai, H., Yung, S., Wong, S.T.,2007. A new nonlinear diffusion method to improve image quality. In: IEEE International Conference on Image Processing, 2007. ICIP 2007, vol. 1. IEEE, pp. 1–329. Zhang, Z., Lange, K., Ophoff, R., Sabatti, C., 2010. Reconstructing DNA copy number by penalized estimation and imputation. Ann. Appl. Stat. 4 (4), 1749. Zhang, Z., Lange, K., Sabatti, C., 2012. Reconstructing DNA copy number by joint segmentation of multiple sequences. BMC Bioinform. 13 (1), 205. Zhou, X., Yang, C., Wan, X., Zhao, H., Yu, W., 2013. Multisample aCGH data analysis via total variation and spectral regularization. IEEE/ACM Trans. Comput. Biol. Bioinform. 10 (1), 230–235. Zong, C., Lu, S., Chapman, A.R., Xie, X.S., 2012. Genome-wide detection of singlenucleotide and copy-number variations of a single human cell. Science 338 (6114), 1622–1626.

Please cite this article in press as: Xu, B., et al., Copy number variants calling for single cell sequencing data by multi-constrained optimization. Comput. Biol. Chem. (2016), http://dx.doi.org/10.1016/j.compbiolchem.2016.02.007

Copy number variants calling for single cell sequencing data by multi-constrained optimization.

Variations in DNA copy number carry important information on genome evolution and regulation of DNA replication in cancer cells. The rapid development...
1MB Sizes 0 Downloads 9 Views