METHODS OFFICIAL JOURNAL

An Evaluation of Copy Number Variation Detection Tools from Whole-Exome Sequencing Data www.hgvs.org

Renjie Tan,1,2 Yadong Wang,1∗ Sarah E. Kleinstein,2,3 Yongzhuang Liu,1,2 Xiaolin Zhu,2 Hongzhe Guo,1,2 Qinghua Jiang,1 Andrew S. Allen,2,4 and Mingfu Zhu2,5∗ 1

Center for Biomedical Informatics, School of Computer Science and Technology, Harbin Institute Technology, Harbin, Heilongjiang, China; Center for Human Genome Variation, Duke University School of Medicine, Durham, North Carolina; 3 Department of Molecular Genetics and Microbiology, Duke University School of Medicine, Durham, North Carolina; 4 Department of Biostatistics and Bioinformatics, Duke University, Durham, North Carolina; 5 Tute Genomics, Provo, Utah

2

Communicated by Anthony J. Brookes Received 7 November 2013; accepted revised manuscript 21 February 2014. Published online 5 March 2014 in Wiley Online Library (www.wiley.com/humanmutation). DOI: 10.1002/humu.22537

Introduction ABSTRACT: Copy number variation (CNV) has been found to play an important role in human disease. Next-generation sequencing technology, including wholegenome sequencing (WGS) and whole-exome sequencing (WES), has become a primary strategy for studying the genetic basis of human disease. Several CNV calling tools have recently been developed on the basis of WES data. However, the comparative performance of these tools using real data remains unclear. An objective evaluation study of these tools in practical research situations would be beneficial. Here, we evaluated four well-known WES-based CNV detection tools (XHMM, CoNIFER, ExomeDepth, and CONTRA) using real data generated in house. After evaluation using six metrics, we found that the sensitive and accurate detection of CNVs in WES data remains challenging despite the many algorithms available. Each algorithm has its own strengths and weaknesses. None of the exome-based CNV calling methods performed well in all situations; in particular, compared with CNVs identified from high coverage WGS data from the same samples, all tools suffered from limited power. Our evaluation provides a comprehensive and objective comparison of several well-known detection tools designed for WES data, which will assist researchers in choosing the most suitable tools for their research needs. C 2014 Wiley Periodicals, Inc. Hum Mutat 00:1–9, 2014. 

KEY WORDS: copy number variation; whole-exome sequencing; whole-genome sequencing; evaluation studies

Additional Supporting Information may be found in the online version of this article. ∗

Correspondence to: Yadong Wang, Center for Biomedical Informatics, School of

Computer Science and Technology, Harbin Institute Technology, Harbin, Heilongjiang 150001, China. E-mail: [email protected]; Mingfu Zhu, Center for Human Genome Variation, Duke University School of Medicine, Durham, NC. E-mail: mingfu@tutegenomics. com Contract grant sponsors: The Natural Science Foundation of China (61102149 and 61173085); The China Scholarship Council.

Copy number variation (CNV) is an important form of genetic variation in the human genome [Iafrate et al., 2004; Sebat et al., 2004; Conrad et al., 2010]. Many studies have successfully proven that CNV is at least partially responsible for human evolution, phenotypic diversity between individuals, and a rapidly increasing number of human diseases, including autism, schizophrenia, and obesity, among others [Cook and Scherer, 2008; Stefansson et al., 2008; Glessner et al., 2009; Bochukova et al., 2010; Bochukova et al., 2010; Heinzen et al., 2010; Pinto et al., 2010]. In the past, researchers used several approaches to discover copy number variants (CNVs), including fluorescent in situ hybridization, comparative genomic hybridization (CGH), array comparative genomic hybridization (aCGH), and SNP arrays. These traditional methods could detect structural variants ranging from one kilobase (kb) to several megabases in size [Stankiewicz and Lupski, 2010]. In recent years, next-generation sequencing (NGS) technology has been widely employed to detect CNVs in large-scale genetic research studies [Medvedev et al., 2009; Mills et al., 2011]. Researchers have gained the capability to routinely identify small variants as short as 50 bp [Alkan et al., 2011]. In most cases, CNVs are identified from whole-genome sequencing (WGS) data. Although the cost of WGS is plummeting, it is still generally considered too expensive for research involving large cohorts. Whole-exome sequencing (WES) is becoming widely adopted as an alternative, cost-effective strategy [Ng et al., 2009]. Compared with WGS, WES focuses on protein-coding regions (“exomes”) or customer defined target regions, which only encompass about 1% of the entire genome. To date, many advanced methods have been developed for CNV discovery for WGS [Chen et al., 2009; Abyzov et al., 2011; Handsaker et al., 2011; Zhu et al., 2012]. There are five main methods for detecting CNVs with paired-end NGS data: (1) read depth (RD), (2) paired-end mapping (PEM), (3) split read (SR), (4) assembly (AS), and (5) a combination of the above [Alkan et al., 2011; Xi et al., 2012]. Although PEM-based methods are effective at calling deletions and SR methods offer higher resolution for breakpoints, these methods depend on paired-end reads spanning the whole CNV region or a read mapped across a CNV breakpoint for CNV detection. Because the size of most target regions (exon plus flanking regions, hereafter referred to as “exons”) is small, typically around 100–300 bp, discovering and genotyping CNVs from WES data is more challenging than from WGS data. As the noncontiguous nature of exons means that most paired-end reads from WES data map to a single exon, existing WGS-based CNV calling methods  C

2014 WILEY PERIODICALS, INC.

2

HUMAN MUTATION, Vol. 00, No. 00, 1–9, 2014

Magi et al. (2013) Klambauer et al. (2012) HSLM, FastCall Bayesian, Poisson distribution TXT, PDF R environment variable

Koboldt et al. (2012) Coin et al. (2012) CMDS, CBS PCA, cnvHap, HMM Tab-delimited R environment variable

Linux, Mac OS Linux, Mac OS, Windows EXCAVATOR cn.MOPS

Perl, R, Fortran R

Yes No

BAM/pileup (gzipped) tab-delimitted or space-delimitted BAM BAM/Read count matrices (R environment variable) Java Java, R Linux, Mac OS, Windows Linux, Mac OS, Windows

Yes Yes Yes Yes R Java Python, R R, C Linux, Mac OS, Windows Linux, Mac OS, Windows Linux, Mac OS, Windows Linux, Mac OS, Windows

ExomeCNV CONDEX SeqGene OptimalCapture Segmentation VarScan2 ExoCNVTest

Both C++ Linux, Windows for low version Control-FREEC

Yes Yes

Sathirapongsasuti et al. (2011) Arthi (2011) Deng (2011) Rigaill et al. (2012) Log coverage ratio, CBS HMM CBS Regression approach

Boeva et al. (2011); Boeva et al. (2012) GC-content normalization, LASSO,GMM

Fromer et al. (2012) Krumm et al. (2012) Plagnol et al. (2012) Li et al. (2012) Shi and Majewski (2013) PCA, HMM SVD Beta-binomial model, HMM Base-level log-ratio PCA, CBS

VCF, tab-delimited Tab-delimited CSV VCF, tab-delimited Seg(Open in IGV), seg.p (tab-delimited), pdf Regions of gains, losses and LOH, copy number and BAF profiles TXT, PNG Tab-delimited Seg (tab-delimited) R environment variable GATK’s Depth-of-Coverage file BAM/RPKM BAM BAM, SAM RPKM/BAM/Coverage file (tab-delimited) SAM, BAM, SAMtools pileup, Eland, BED, SOAP, arachne, psl (BLAT) and Bowtie formats BAM/pileup/GTF BAM/pileup SAM/pileup/wig/RPKM R environment variable No No Yes Yes Yes C++, R Python R Python, R Java, R

References Methodology characteristics Output format Input format Control/control set required Language OS

Linux, Mac OS Linux, Mac OS Linux, Mac OS, Windows Linux, Mac OS Linux, Mac OS, Windows

In total, 33 exome samples were sequenced, including nine trios (27 individuals). Of the 33 samples, 13 were also sequenced using

XHMM CoNIFER ExomeDepth CONTRA FishingCNV

Samples and Ethics Statement

Tools

Materials and Methods

Table 1. CNV Calling Methods for WES Data

cannot be successfully applied to WES data. So far, only the RD approach has been successfully integrated from WGS to WES-based CNV calling methods. RD-based methods for WES data first calculate the number of sequenced reads aligned to each exon target base, and then calculate the average RD values over each base as a raw RD signal that can be used for further analysis [Fromer et al., 2012]. This approach makes the straightforward assumption that the RD signal is proportional to the number of copies of chromosomal segments. However, the systematic noise associated with WES data complicates the situation, making the sensitive and specific identification of CNVs from WES data extremely difficult. To date, there are fewer CNV identification methods for WES data (Table 1). These methods differ on almost every level, including: statistical model, implementation of algorithms, programming language, operating system, and I/O format. Some methods are specific for detecting either germline or somatic structural variation. Recently, a review study provided a comprehensive analysis focusing on methods of detecting somatic CNVs from NGS data [Liu et al., 2013]. Here, we restricted our evaluations to germline structural variation, and thus we excluded tools that were specific for inferring somatic CNVs. Four of the latest and most frequently used tools were selected for evaluation: XHMM [Fromer et al., 2012], CoNIFER [Krumm et al., 2012], ExomeDepth [Plagnol et al., 2012], and CONTRA [Li et al., 2012]. Although some reports have shown that optimized parameters contribute to a better result [Krumm et al., 2013; Poultney et al., 2013], it is still difficult to balance the tradeoff between sensitivity and specificity. Because of the differences between methods, we evaluated all methods using default parameter settings in order to provide an unbiased evaluation of how the methods performed in comparison with one another. XHMM uses principal-component analysis (PCA) to reduce the noise included in the raw RD signal for each exon and employs a hidden Markov model (HMM) to identify CNVs at the exon level. An important advantage of XHMM is the breakpoint quality score, which provides useful information for further refining the CNVs called, while facilitating more informed downstream analysis. CoNIFER uses a singular value decomposition (SVD) method to correct systematic biases and identifies a CNV call if the corrected signal reaches a predefined threshold at no less than three consecutive exons. ExomeDepth uses a beta-binomial model to fit the read count data and build a reference set (baseline). ExomeDepth then generates a likelihood value under three different copy number states (deletion, diploid, duplication) for each exon, and finally employs an HMM to combine the likelihood across multiple exons. CONTRA uses base-level log-ratios to remove GC-content bias, and then estimates the log-ratio variations via binning and interpolation. In this article, we comprehensively evaluate the performance of WES-based CNV calling tools across six metrics: (1) CNV size and distribution, (2) CNV concordance with WGS data, (3) common CNV discovery by tagSNPs, (4) CNV Mendelian error rates, (5) a heterozygosity check for deletions, and (6) concordance across CNV-calling algorithms. Our study shows the strengths and limitations of each algorithm, which will help researchers to make informed choices as to which tools are the most suitable for their own projects.

whole-genome technology. All samples were collected under the requirements and with the approval of both local and multiregional academic ethics committees.

Table 2. A Comparison of the Number of Identified CNVs in Unrelated Samples and the Median CNV Size in Different Size Ranges Between the Four Exome CNV Calling Tools: XHMM, CoNIFER, ExomeDepth, and CONTRA

WES/WGS and CNV Detection All exome samples were captured using the Agilent SureSelect Human All Exon 50Mb Kit (Agilent Technologies, Santa Clara, CA), and raw sequencing data (FASTQ format) were produced by the Illumina HiSeq2000 platform (Illumina, Inc., San Diego, CA) at the Center for Human Genome Variation, Duke University. For the sake of impartial comparison, the Burrows Wheeler Alignment tool (BWA) v0.5.10 [Li and Durbin, 2009] was employed for paired-end reads aligned to the Human Reference Genome (NCBI build 37, hg19). A quality-control procedure removed potential PCR duplicates with Picard v1.59. WES data (with a mean of 73× coverage overall) were analyzed with four different published CNV calling tools: (1) XHMM (eXome-Hidden Markov Model) v1.0 [Fromer et al., 2012], (2) CoNIFER (Copy Number Inference From Exome Reads) v0.2.2 [Krumm et al., 2012], (3) ExomeDepth v0.9.7 [Plagnol et al., 2012], and (4) CONTRA (Copy Number Analysis for Targeted Resequencing) v2.0.4 [Li et al., 2012]. The default software parameters were used for all tools. XHMM and CoNIFER used a pooled sample calling approach with 33 samples as the input, whereas ExomeDepth and CONTRA called CNVs sample by sample after generating a baseline or optimized reference set. CNVnator v0.2.7 [Abyzov et al., 2011] and ERDS (Estimation by Read Depth with SNVs) v1.1 [Zhu et al., 2012] were used to infer CNVs in the 13 high coverage (38×) WGS samples with default parameters.

XHMM CoNIFER ExomeDepth CONTRA

An evaluation of copy number variation detection tools from whole-exome sequencing data.

Copy number variation (CNV) has been found to play an important role in human disease. Next-generation sequencing technology, including whole-genome s...
1MB Sizes 0 Downloads 3 Views