Expert Review of Molecular Diagnostics Downloaded from informahealthcare.com by Nyu Medical Center on 06/22/15 For personal use only.

Review

Exome sequencing and whole genome sequencing for the detection of copy number variation Expert Rev. Mol. Diagn. Early online, 1–10 (2015)

Jayne Y Hehir-Kwa*1, Rolph Pfundt1,2, Joris A Veltman1–3 1 Department of Human Genetics, Radboud University Medical Center, Nijmegen, The Netherlands 2 Radboud Institute for Molecular Life Sciences and Donders Centre for Neuroscience, Nijmegen, The Netherlands 3 Department of Clinical Genetics, Maastricht University Medical Centre, Maastricht, The Netherlands *Author for correspondence: Tel.: +31 024 361 3864 Fax: +31 024 366 8752 [email protected]

informahealthcare.com

Many laboratories now use genomic microarrays as their first-tier diagnostic test for copy number variation (CNV) detection. In addition, whole exome sequencing is increasingly being offered as a diagnostic test for heterogeneous disorders. Although mostly used for the detection of point mutations and small insertion–deletions, exome sequencing can also be used to call CNVs, allowing combined small and large variant analysis. Whole genome sequencing in addition to these advantages also offers the potential to characterize CNVs to unprecedented levels of accuracy, providing position and orientation information. In this review, we discuss the clinical potential of CNV identification in whole exome sequencing and whole genome sequencing data and the implications this has on diagnostic laboratories. KEYWORDS: clinical sequencing . copy number variation . next-generation sequencing . structural variation

Copy number variants (CNVs) are an important and abundant source of both normal and pathogenic variation in the human genome. Within a single genome, CNVs are estimated to result in a 1.2% difference from the reference human genome [1]. Furthermore, CNVs can have a relatively high locus-specific formation rate of up to 1 in 7000 newborns [2]. Clinically relevant, dominant de novo CNVs have be identified in 10–20% of individuals with intellectual disability and/or multiple congenital anomalies [3–6]. Likewise, pathogenic CNVs have been associated with numerous common [7–10] and rare [11–14] diseases including autism, schizophrenia, congenital heart abnormalities and specific forms of hearing loss. In the clinic, CNV detection is limited to the use of genomic microarrays as first-tier diagnostics for neurodevelopmental disorders and congenital abnormalities [15–17]. The current generation of genomic microarrays contain hundreds-of-thousands to millions of probes that are able to detect CNVs ranging in size from 1 to 10 kilobases (kb) up to several megabases [7]. Many smaller pathogenic (intragenic) CNVs are known to occur but often remain beyond the detection limit of most clinical genomic microarray analyses [18,19].

10.1586/14737159.2015.1053467

While genomic microarrays have revolutionized specifically the detection of CNVs throughout the genome, the development of next-generation sequencing (NGS) technologies have dramatically improved our capability to detect all types of genomic variation, from single nucleotide variations and small insertion-deletions through to CNVs and other forms of structural variation (SV) [20,21]. NGS is most powerful for the detection of all types of variation when it is applied to sequence entire genomes. A popular cost-effective NGS approach is to perform whole exome sequencing (WES) by first capturing the exons that represent the 1–2% of the human genome which is coding, followed by NGS [22,23]. Since the introduction of NGS, the discovery rate of small deletions and duplications has dramatically increased [24,25]. As a consequence, the operational definition of CNVs has widened to include much smaller events (i.e., all deletions and duplications >50 bp in size) [26,27]. Large-scale community efforts, including the 1000 Genomes Project [25,28], the Genome Denmark project [29] and the Genome of the Netherlands [24,30] have begun to investigate the extent of heritable structural variants,

 2015 Informa UK Ltd

ISSN 1473-7159

1

Review

Hehir-Kwa, Pfundt & Veltman

Expert Review of Molecular Diagnostics Downloaded from informahealthcare.com by Nyu Medical Center on 06/22/15 For personal use only.

Breakpoint accuracy

A

B

Exome sequencing

Assembly Split read Discordant readpairs

Whole genome sequencing Depth of coverage

CNV Size

Figure 1. Sequencing and detection strategies for CNV identification in NGS data. (A) Four different approaches are used for CNV detection in NGS data. Depth of coverage methods are predominantly used in WES. Generally each approach specializes in detecting a specific form or size range of CNV and results in a trade-off in breakpoint accuracy. (B) Depth of coverage detection of CNVs in WES and WGS differs, in that in WES data the coverage of the exons is determined and in WGS the genome is divided into windows for which the coverage is calculated.

including CNVs in hundreds of unaffected individuals using NGS [27,31,32]. The clinical implementation of CNV detection in WES and whole genome sequencing (WGS) holds the potential to drastically reduce the number of genomic assays required per patient to reach a diagnosis, and to analyze the combined effects of small nucleotide variants (SNVs) and CNVs within an individual using a single test [33,34]. Despite reports that clinically relevant CNVs can be detected in NGS data, the clinical implementation of CNV detection remains limited [35,36] and many studies still employ microarrays as the first-tier test for CNV detection in combination with WES analysis for mutation detection [37,38]. Identifying CNVs in NGS data is still in its infancy, with currently no accepted standard protocols or quality control measures [39]. The accurate identification of CNVs requires investigation of three genomic features: copy number, content and positional information [40]. The detection of events is largely influenced by the NGS technology’s read lengths, the raw accuracy of base calls as well as the local sequence context. The direct, reference free, assembly of an individual’s complete genome from raw sequence read data is theoretically the best method for CNV identification. However, both simple and long tandem repeats make this difficult [41–43] instead the reads must first be mapped to the reference genome. Furthermore, the accuracy of mapping in repetitive regions is influenced by the length of the sequence reads. For example, sequencing with a read length >5–10 kb makes it possible to span repetitive regions, and inversions occurring at SV breakpoints, but such technology currently remains beyond the reach of clinical laboratories [42,43]. If reference free assembly were possible, the identification of CNVs and other forms of SV could occur via direct comparison between genomes. At present, this approach is not feasible; instead SV detection relies on the identification of distinct patterns in sequence reads occurring as a result of an event [44]. Many different detection algorithms exist, each of doi: 10.1586/14737159.2015.1053467

which falls into one of the four broad approaches, namely depth of coverage, discordant sequence read pair, split-read and assembly-based methods [26]. The different detection approaches are usually complementary and specialize in detecting specific types or size ranges of structural variants, and vary in breakpoint accuracy (FIGURE 1A). Depth of coverage (DOC) methods are more suitable for detecting larger events and more tolerant for repetitive regions, but lack breakpoint accuracy. While assembly-based methods cannot be applied genome wide, local assembly can be used to achieve base pair breakpoint accuracy of events, as well as identify novel inserted sequence. Split-read methods have high breakpoint accuracy, but are only suitable for detecting small CNVs. Finally, discordant read pair methods rely on the insert size of the read pairs to detect events, and are limited to the detection of small CNVs for this reason. Read pair methods can, however, be used to detect inversions and translocations by using information on the orientation in which the read pairs are mapped to the reference [45]. More recently, several algorithms have implemented a hybrid strategy combining several of these approaches [46–48]. All four approaches are employed for CNV detection in WGS data, whereas DOC approaches are most suitable for CNV detection in WES data. Still, there are many challenges associated with accurate identification of CNVs regardless of the NGS method and algorithm used [39]. We discuss the clinical implications of CNV detection in both WES and WGS, its implications and current limitations. Depth of coverage

DOC algorithms are most often used for CNV detection in WES data, because they do not require continuous data or rely on detecting the breakpoints of a CNV event. DOC methods are based on the assumption that when sequencing is uniform, the number of reads mapping to a region is proportional to the number of times the region appears in the genome. Expert Rev. Mol. Diagn.

Expert Review of Molecular Diagnostics Downloaded from informahealthcare.com by Nyu Medical Center on 06/22/15 For personal use only.

Detection of CNV

Therefore, if a region is deleted or duplicated in a sample, the NGS data from this sample will have either less or more reads mapping to this region in comparison to NGS data from reference samples. Sequencing, however, is not always uniform, and samples are often normalized against large reference pools containing hundreds of samples to compensate for variability in coverage [49]. Similar to reference pools used for single dye microarrays, these pools are created in silico using available datasets. The reference samples are most effective when containing samples run within a similar time frame on the same NGS equipment as the test samples. The coverage of WES is also influenced by systematic biases introduced during the exon capture steps, for example, resulting in a guanine and cytosine nucleotide bias [20]. Both WGS and WES are affected by regions with poor mappability due to GC-rich content, pseudogenes and segmental duplications. Normalization is used to correct for systematic biases but may still result in false-positive CNV calls, for example, because of incorrect or ambiguously mapping of sequence reads to the reference genome. DOC methods provide a relative gain or loss measure which, depending on the underlying statistics used by the algorithm, can make accurate genotyping difficult and cannot provide information on the location and orientation of duplicated sequence. Only paired-sample DOC methods exist for detecting mosaic CNVs, more commonly used for tumor-normal samples [50]. It is estimated that mosaic CNVs affect close to 1% of patients with developmental disorders [51]. Despite these limitations, DOC methods are effective for identifying large events and are relatively tolerant for repetitive regions and systematic biases [44]. Breakpoint resolution

DOC methods for WGS generally partition the genome into windows of fixed size, whereas WES uses exon targets as the base unit for windows, resulting in an unequal spacing of the datapoints (FIGURE 1B). The breakpoint accuracy of DOC methods is determined either by the size of the windows in the case of WGS, or by the spacing of exons in WES. Furthermore, the breakpoint accuracy for WES is limited as the majority of CNV breakpoints will lie outside the exons [52–54]. A recent WGS study of 130 breakpoints associated with genomic duplications identified using a hybrid detection strategy found that most (83%) duplications were located in tandem in direct orientation. The remainder were triplications embedded within duplications (8.4%), adjacent duplications (4.2%), distal duplications (2.5%) or other complex rearrangements (1.7%). Moreover, six in-frame fusion genes were predicted [55]. For clinical applications, breakpoint resolution is arguably not necessarily required for deletion events, and can be determined, when required by fine mapping with a separate technique. Moreover, recent developments in expanding exome target kits with baits for non-coding regions and analysis tools/ approaches that include ‘off-target’ reads in exome data may partly overcome issues with breakpoint inaccuracy [56–58]. Breakpoint fine mapping, is required to identify and determine informahealthcare.com

Review

the exact breakpoints of gene-fusion events, the role of which is yet to be determined in congenital disorders [59]. Sensitive and specific detection of CNVs using DOC containing 3 or more exons is possible in WES data [35,49]. Although in itself very powerful, it means that deletions or duplications affecting less than 3 exons are generally not detected. Given the size frequency distribution of CNVs [24], it can be deduced that many potentially pathogenic single exon CNVs remain undiscovered. It should be noted that what may appear as a small exonic deletion in WES data may in practice encompass a large intronic sequence, making it easier to detect by WGS. Additional CNV detection strategies in WGS data such as discordant read pair, split read and de novo assembly methods make it possible to detect the full-size spectrum of CNVs. In combination with local read realignment, these approaches can provide nucleotide resolution of breakpoints of CNVs in WGS data. As an example, WGS detected a partial duplication of TENM3 on chromosome 4 in a patient with severe intellectual disability. The duplication of TENM3 itself was not considered to explain the patient’s phenotype (FIGURE 2A). Detailed analysis of these WGS data using discordant read pairs and local realignment revealed, however, that this duplicated sequence was inverted and inserted into IQSEC2 on chromosome X. The disruption of IQSEC2 as revealed by the duplication’s position information does explain the patient’s phenotype (FIGURE 2B & C). This example illustrates that WGS can be used to identify previously missed small CNVs, determine the position and orientation of duplicated sequence and provide information about the mechanisms in which SVs are formed through exact breakpoint detection [60]. This makes WGS the ultimate genetic test by combining the detection of all types of genomic variations and offering the possibility to interpret these variants within their actual genomic context. CNV detection using WES & WGS

Recent large studies using WES in cohorts of developmental delay [61] and autism spectrum disorder [62,63] have reported multiple de novo SNVs but relatively few CNVs. The depletion of pathogenic CNVs in these studies could be a result of patient selection and prior CNV screening with microarrays, but also likely reflects a lack of power to detect the full-size spectrum of CNVs in WES data. A recent large-scale study in patients with developmental delay used a combined approach of microarray and WES for CNV detection [36,61]. This study reported a significant burden of de novo CNVs in 1133 children with developmental disorders. Although most patients had clinical microarray testing prior to the study, a diagnostic CNV yield of 7% was obtained [36,61,64]. A similar diagnostic yield was reported in a large cohort of 2500 individuals with autistic spectrum disorder [63]. In contrast, a higher diagnostic CNV yield (18%) was reported in a small cohort of 50 individuals with ID studied by WGS, reflecting the broader spectrum of events detectable [33]. In this last study, three pathogenic deletions were

Exome sequencing and whole genome sequencing for the detection of copy number variation.

Many laboratories now use genomic microarrays as their first-tier diagnostic test for copy number variation (CNV) detection. In addition, whole exome ...
804KB Sizes 1 Downloads 8 Views