ORIGINAL ARTICLE

hg19K: addressing a significant lacuna in hg19-based variant calling Savita Karthikeyana, Pushpinder S. Bawaa,b & Subhashini Srinivasan Institute of Bioinformatics and Applied Biotechnology, Biotech Park, Electronic City Phase I, Bangalore 560100, India

Keywords 1000 genomes project, alternate reference, hg19, major allele, minor allele Correspondence Subhashini Srinivasan, Institute of Bioinformatics and Applied Biotechnology, Biotech Park, Electronics City Phase I, Bangalore 560100, India. Tel: +9180 285 289 00, +9180 285 289 01; Fax: +9180 285 289 04; E-mail: [email protected] Funding Information This work was supported by Department of Biotechnology (Grant number: BTPR12422/ MED/31/287/2014). Received: 1 June 2016; Revised: 29 August 2016; Accepted: 5 September 2016 Molecular Genetics & Genomic Medicine 2017; 5(1): 15–20 doi: 10.1002/mgg3.251 a Both these authors have contributed equally to the project. b This author is a registered graduate student at Manipal University, Manipal, Karnataka, India.

Abstract Background The hg19 assembly of the human genome is the most heavily annotated and most commonly used reference to make variant calls for individual genomes. Based on the phase 3 report of the 1000 genomes project (1000G), it is now well known that many positions in the hg19 genome represent minor alleles. Since commonly used variant call methods are developed under the assumption that hg19 reference harbors major alleles at all the ~3 billion positions, these methods mask the calls whenever an individual is homozygous to the minor allele at the respective positions. Hence, it is important to address the extent and impact of these minor alleles in hg19 from the point of view of individual genomes. Method We have created a reference genome, hg19K, in which all the positions in hg19 reference harboring minor allele were replaced by those from the phase 3 report of the 1000 genomes project. The genomes of five individuals, downloaded from the public repository, were analyzed using both hg19 and hg19K and compared. Results Out of the 81 million SNPs in phase 3 report from the 1000 genomes project, 1.9 million positions were found to be major alleles compared to hg19 with many having an allele frequency of >0.9. We observed that ~30% of the SNVs found in individual genomes are confined to the 1.9 million positions. Also, there are ~8% unique SNVs predicted using hg19K-based approach, which are also confined to the 1.9 million positions. Conclusion We report that the presence of minor alleles in hg19 alone results in ~8% false negatives and ~30% false positives during variant calls. Also, among the variant calls unique to hg19K-based methods, which are missed in individuals homozygous to the minor alleles in hg19-based prediction, some are deleterious missense mutations at sites conserved across diverse species.

Introduction The version of the human genome assembly, hg19, is the most widely used and the most heavily annotated human genome to date and continues to serve as a strong foundation toward deciphering the role of gene expression, genetic variation, disease predisposition and population diversity. The hg19 assembly is now routinely used in profiling

variants from genomes of individuals for use in forensics, diagnostics, genetic disorders, and disease management. The underlying assumption being that hg19 harbors major alleles at all 3 billion positions. However, as more and more genomes of individuals from diverse ethnicity is sequenced, such as the 1000 genomes project, it is becoming clear that significant positions on the hg19 reference assembly, which is the mosaic of genomes from six diverse individuals

ª 2016 The Authors. Molecular Genetics & Genomic Medicine published by Wiley Periodicals, Inc. This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited.

15

hg19K

S. Karthikeyan et al.

(Venter et al. 2001), do not represent major alleles at all three billion positions. This scenario is changing as genomes of more individuals are sequenced. The 1000 genomes project was launched in 2008 to identify and catalog all SNPs in the human population (1000 Genomes Project Consortium et al. 2015). The individuals nominated for sequencing include individuals selected from 26 different geographical locations around the globe weighted by the size of the populations. By the close of the project in 2015, genomes and exomes of 2692 individuals from diverse ethnicity were sequenced, analyzed, and results made available to the public repositories. Not surprisingly, many of the alleles in hg19 turned out to be minor allele based on the variant calls from 1000 genomes project. The concept of major allele reference genomes representing different ethnic populations has been proposed (Dewey et al. 2011). Here, the authors have used the major alleles from 1000 genomes project to create ethnically concordant reference sequence to improve genotype accuracy. They also show how ethnicity-based approaches can help interpret genetic variation in the context of disease-risk prediction. However, the impact of minor alleles in hg19 on the ongoing efforts to variant calling from individual’s genomes is not yet addressed. The major assumption in any variant calling method a is that the reference genome harbors major alleles at all the three billion positions on the human chromosomes. In other words, positions where individual genomes are homozygous to the alleles present in the reference genome carries no information about the respective individuals, ethnicity, ancestry, predisposition, or their disease condition. Under this assumption, minor alleles in the reference genome will mask reporting of the minor alleles present in individual genomes, resulting in false negatives. The extent of false negatives will directly depend on the number of minor alleles among the three billion positions in the hg19 assembly and the extent of diversity at these positions within a given genome or population. Here, we address the extent of false discovery in variant calling approaches commonly used.

Materials and Methods The hg19 reference sequence, as used by the 1000 genomes project, was downloaded directly from the 1000 genomes ftp repository (ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/ technical/reference/). The variant file from the phase 3 of the 1000 Genomes Project was downloaded from the 1000 genomes project website (ftp://ftp.1000genomes.ebi.ac.uk/ vol1/ftp/release/20130502/). The coordinates in the 1000 Genomes variant call-set that had an overall alternate allele frequency of more than 0.5 were extracted. These positions

16

were changed in hg19 with the major allele in these positions from the 1000 genomes project using GATK’s AlternateReferenceMaker tool to create a new reference called hg19K. Positions that had multiple alternate alleles were discarded. Whole genome sequences from five individuals from ERP006077 dataset in the public repository were selected based on similar depth of sequencing. These samples are from prostate tissues of individuals with cancer. Samples were mapped onto both hg19 and hg19k using Bowtie2, duplicates were removed using Picard Tools and singlenucleotide variants were called using SAMtools mpileup and BCFtools. The variant calls were filtered by read depth of at least five and a variant quality of at least 30. The filtered set of SNPs for each sample was annotated against the reference sequences using the SnpEff software (Cingolani et al. 2012). For annotating against hg19, we used the database available with SnpEff. However, for annotating the SNVs from hg19k, a new database for SnpEff was built using the hg19K genome. The hg19K reference and the associated SnpEff database are available upon request to the corresponding author. For validation of the extent of false negatives using hg19 as reference in other ethnicity with normal biology, we have repeated the pipeline on genome sequence of similar depth from normal samples of two Indians from a remote corner of Karnataka state. Also for these samples, we used modified pipeline that replaces mapping by BWA instead of Bowtie2.

Results According to the genomes and exomes of 2692 individuals, sequenced, analyzed, and reported as part of the 1000 genomes project, there are 81 million SNPs compared to the hg19 reference, out of which hg19 harbors minor alleles at 1.9 million positions. Figure 1 shows the allele frequencies and distribution of the 1.9 million variants in the hg19 reference. Figure 1A shows both the overall distribution of allele frequencies at the 81 million positions (red) and the frequency distribution of the 1.9 million minor alleles in hg19 (inset). The number of minor alleles with frequencies below 0.5 in hg19 is uniformly distributed across various frequency levels. For example, the number of minor alleles in hg19 with allele frequencies between 0 and 0.1 is 374607, between 0.1 and 0.2 is 2856603, between 0.2 and 0.3 is 340144, between 0.3 and 0.4 is 400727, and between 0.4 and 0.5 is 479231. In Figure 1B the distribution of these 1.9 million SNPs across the human chromosomes is shown. Except in regions marked blue (low density) and red (high density) in the bottom of each chromosome in Figure 1B, the 1.9 million minor alleles are distributed uniformly over the lengths of all the chromosomes.

ª 2016 The Authors. Molecular Genetics & Genomic Medicine published by Wiley Periodicals, Inc.

S. Karthikeyan et al.

hg19K

Figure 1. Provides statistics of the SNPs and major alleles from the 1000 genomes project. (A) shows the distribution of allele frequencies in hg19 corresponding to the 81 million SNPs reported by phase 3 of the 1000 genomes project with cyan showing allele frequencies G g.231149108A>G g.37101380C>T g.24901276T>G g.1389153A>C g.29704010C>A g.22989256G>A g.1370597C>G g.1370614G>C g.37617630G>T g.140130606T>A g.29141743G>A g.42666145T>C g.180810264T>A g.80895933G>A g.96795857C>T g.36789729G>T g.12046364C>G g.29911256G>T g.111749349T>A g.44104925C>T g.65382685C>T g.56047448G>A g.88008423C>A g.36497358G>C g.84563135A>T g.72938100C>T g.741736T>G g.140559596G>T g.155026942C>A g.34330067C>A g.150500729G>A

TUBA3E OR12D2 PHF1 SP140 FBXO47 KHNYN BAIAP3 UQCRFS1 GGTLC2 UBE2I UBE2I DOPEY2 SLC34A3 OR2J2 PRPH2 CWC22 TBCD ASTL MBIP SYN2 HLA-A FDXACB1 ABCG8 HEPH SBK2 CPXCR1 SYNE4 POF1B OTOP3 ZDHHC11B PCDHB8 ADAM15 HMGB4 TMEM176A

p.(Glu449Ala) p.(Phe104Ser) p.(Lys304Arg) p.(Lys516Glu) p.(Arg209Gln) p.(Leu270Trp) p.(Thr87Pro) p.(Ala6Ser) p.(Gly70Glu) p.(Ser164Arg) p.(Arg170Thr) p.(Gly1118Cys) p.(Val513Glu) p.(Ala111Thr) p.(Lys310Arg) p.(Arg773Ser) p.(Gly1135Glu) p.(Arg222Gln) p.(Ser22Arg) p.(Pro37Ala) p.(Glu185Asp) p.(Asn87Ile) p.(Ala632Val) p.(Ala39Val) p.(Arg72Cys) p.(Ser3Try) p.(His278Gln) p.(Leu349Met) p.(Pro119Ser) p.(Asp314Ala) p.(Val661Leu) p.(Thr191Lys) p.(Ala92Glu) p.(Ala122Thr)

0.020567 0.021166 0.022364 0.026158 0.027955 0.040535 0.04373 0.046126 0.046925 0.049521 0.049521 0.050719 0.053914 0.058506 0.058706 0.06869 0.069089 0.072883 0.073083 0.073682 0.074281 0.07528 0.077077 0.079205 0.082069 0.086358 0.08726 0.087947 0.09385 0.095248 0.097644 0.098043 0.098842 0.09984

Lists 34 nonsynonymous mutations with allele frequencies

hg19K: addressing a significant lacuna in hg19-based variant calling.

The hg19 assembly of the human genome is the most heavily annotated and most commonly used reference to make variant calls for individual genomes. Bas...
1MB Sizes 1 Downloads 4 Views