Identification of Mutations in Zebrafish Using Next-Generation Sequencing

UNIT 7.13

Katrin Henke,1 Margot E. Bowen,1 and Matthew P. Harris1 1

Department of Genetics, Harvard Medical School, and Department of Orthopedics, Boston Children’s Hospital, Boston, Massachusetts

ABSTRACT Whole-genome sequencing (WGS) has been used in many invertebrate model organisms as an efficient tool for mapping and identification of mutations affecting particular morphological or physiological processes. However, the application of WGS in highly polymorphic, larger genomes of vertebrates has required new experimental and analytical approaches. As a consequence, a wealth of different analytical tools has been developed. As the generation and analysis of data stemming from WGS can be unwieldy and daunting to researchers not accustomed to many common bioinformatic analyses and Unix-based computational tools, we focus on how to manage and analyze next-generation sequencing datasets without an extensive computational infrastructure and in-depth bioinformatic knowledge. Here we describe methods for the analysis of WGS for use in mapping and identification of mutations in the zebrafish. We stress key elements of the experimental design and the analytical approach that allow the use of this method across different sequencing platforms and in different model organisms with annotated genomes. Curr. C 2013 by John Wiley & Sons, Inc. Protoc. Mol. Biol. 104:7.13.1-7.13.33.  Keywords: whole-genome sequencing r WGS r mutation mapping r zebrafish r next-generation sequencing

INTRODUCTION Whole-genome sequencing (WGS) has successfully been used as a fast and efficient method for localization and identification of mutations in organisms with relatively small, inbred genomes, such as C. elegans and A. thaliana (Schneeberger et al., 2009; Cuperus et al., 2010; Doitsidou et al., 2010; Zuryn et al., 2010; Austin et al., 2011; Uchida et al., 2011). Through further advancements in next-generation sequencing technologies, it has become feasible to sequence considerably larger genomes at a relatively low expense. As larger genomes require significantly more sequencing to obtain sufficient depth of coverage across the genome for analysis of allele frequency and high-resolution mapping, alternative methods need to be generated to allow high-throughput and affordable mutation mapping by use of low-coverage, next-generation sequencing. The establishment of methods to efficiently detect signatures of linkage and to identify causative mutations in organisms with large genomes permits genetic analysis in many non-model organisms that have previously not been used for forward genetic analysis. The basic requirements for the successful use of next-generation sequencing for mapping and mutation identification are that homozygous mutant F2 progeny can be generated from an outcross of a mutant founder to a polymorphic strain, and that an annotated genome exists for the organisms under study. Here, we focus on the use of this method in the zebrafish. Zebrafish is a powerful model for the study of vertebrate development. Its utility is in part due to the ease and efficiency of mutational analysis and its suitability for doing largescale mutagenesis screens. In these screens, large numbers of mutants can be generated DNA Sequencing Current Protocols in Molecular Biology 7.13.1-7.13.33, October 2013 Published online October 2013 in Wiley Online Library (wileyonlinelibrary.com). DOI: 10.1002/0471142727.mb0713s104 C 2013 John Wiley & Sons, Inc. Copyright 

7.13.1 Supplement 104

that subsequently require the identification of the altered gene underlying the mutant phenotype. The zebrafish model, however poses some unique challenges to the use of WGS for identifying mutations. First, there are no inbred strains available that are useful for mutagenesis screening. Therefore, the lines used in mutagenesis screening maintain a very high level of variation in and between fish coming from the same population and strain (Stickney et al., 2002; Guryev et al., 2006; Bradley et al., 2007; Coe et al., 2009; Bowen et al., 2012). The existing methods for WGS in the zebrafish use this high level of polymorphism as a tool to efficiently localize or map the genetic region underlying the mutant phenotype (Bowen et al., 2012; Leshchiner et al., 2012; Voz et al., 2012). These methods differ significantly in experimental design and in how causative mutations are identified. Often, paired sequence data from mutant siblings are used to differentiate between segregating SNPs and fixed SNPs within any particular mating pair. This is a powerful method to identify polymorphic SNPs within the population; however, this method is cumbersome, as it requires twice the amount of sequence. With the use of WGS to establish databases of common SNP variation existing in laboratory strains, the number of identified common SNPs has increased over 10-fold. Additionally, this makes all downstream evaluations of potential candidate mutations within linked intervals much easier, as they can be excluded as phenotype-causing if they can be found in wild-type fish. We outline methods to use these databases of common variants as a tool to allow mapping and mutation detection requiring only mutant sequence. Although library preparation and sequencing can be routine, the analysis of the enormous amount of data produced by many NGS platforms, even from only one single lane of sequencing, poses a challenge for many researchers. This is especially difficult because, until recently, there were no analysis pipelines available that would be easily adjustable to the different methods of analysis and make use of the data for different experimental situations and needs. Consequently, researchers are confronted with a wealth of available tools for different aspects of data analysis and their specific requirements for data formats. The different published methods for mapping and sequence analysis (Bowen et al., 2012; Leshchiner et al., 2012; Obholzer et al., 2012; Voz et al., 2012) each use slightly different strategies; however, the end result for each method is the same: a researcher will be able to (1) identify a region potentially linked to the mutation, and (2) identify SNP(s) within the linked interval as potential phenotype-causing mutations. The question that needs to be raised in the beginning of such experiments is the scope of the project, and thus costs.

Identification of Zebrafish Mutations Using NGS

In this unit, a detailed protocol is provided for the experimental design and execution of mapping through analysis of WGS data in the zebrafish. The protocol starts with (i) planning the mapping experiment (Strategic Planning), (ii) library preparation for nextgeneration sequencing (Basic Protocol 1), (iii) raw data analysis (Basic Protocol 2), and (iv) the stepwise protocol to map and identify candidate mutations (Basic Protocol 3, 4, and 5). Software and datasets used in this protocol are described in Support Protocol 1. In addition, a short protocol for verification of linkage using PCR-based methods is provided (see Support Protocol 2). This outlined protocol is modular and adaptable. In zebrafish, this method can be used with different strains and without comparison to sibling data. Analysis can also be done to identify mutations in WGS data in cases where the map position is already determined by alternate methods. The sets of protocols are directed at researchers with little or no experience in analysis of next-generation sequencing data. The protocol requires a significant amount of NGS data analysis; however, the increased efficiency of the use of WGS for mapping and identification of mutations easily justifies the time invested in learning these new techniques. Lastly, with knowledge of common SNPs in reference and mapping strains, the outlined method of mapping by use of lowcoverage sequence depth can be used in most vertebrates with an available genome assembly.

7.13.2 Supplement 104

Current Protocols in Molecular Biology

STRATEGIC PLANNING OF MAPPING EXPERIMENTS As a first step in planning, a mapping experiment, it is advisable to become familiar with the different approaches that can be taken: (i) Establishing a high level of polymorphism within the mutants for use in analysis of linkage. In linkage analysis, it is advisable to establish high levels of polymorphism within the analyzed mutants to be able to identify strain-specific alleles that cosegregate with the mutation. While zebrafish have a high inherent level of variation, often areas of shared haplotypes exist between individuals that complicate analysis. Thus, a separate strain other than the one used for mutagenesis can provide a sufficient level of polymorphic SNPs for linkage analysis and to establish regions harboring the phenotype-causing mutation. Such analysis requires at least one round of meiotic recombination between chromosomes of the different strains. The simplest and most often utilized strategy is to use F2 progeny of intercrosses of a mutant and a particular mapping strain (e.g., WIK (Geisler et al., 2007; Fig. 7.13.1A). Using this cross design provides the most robust and time efficient material for mapping and serves as the basis of the analyses described below. (ii) Next-generation sequencing (NGS) strategies: Sequencing of genome (WGS) or transcriptome (RNAseq). Once mutants have been isolated in the map cross for the particular mutant, a decision has to be made concerning the readout of the NGS analysis.

B

A T T

A A

PO

wt +/+

m +/ T A

F1

pooled mutant DNA

fragmentation

T A m +/

m +/ F2

siblings +/

mutants +/

adapter ligation

unlinked

T A

T T

A A

T A

T A

T T

A A

T A

linked

A A

A A

T A

A A

T A

T T

T A

T A

next-generation sequencing

Figure 7.13.1 Crossing scheme and subsequent library preparation for mapping zebrafish mutations by next-generation sequencing. (A) Crossing scheme to facilitate mapping of zebrafish mutations based on homozygosity-by-descent. Heterozygous carriers for the mutation are crossed to a polymorphic strain to enable tracking of recombination events in the F2 generation. In-crosses between mutant carriers are performed in the F1 generation, and their progeny are sorted into homozygous mutants and wild-type (wt) siblings. If an SNP allele (here A is the mutant and T is the polymorphic strain) is linked to the mutation, the F2 mutant progeny will be homozygous for the allele found in the mutant strain (A). Depending on the distance between the SNP and the mutation, occasional recombination events can be detected, as they appear heterozygous in mutants (∗ ). Importantly, siblings will either be heterozygous or homozygous for the alternate allele (here T) for a linked marker. An unlinked SNP will show random distribution of alleles in mutants and siblings. (B) Basic steps for library preparation for next-generation sequencing. For library preparation, genomic DNA is first fragmented into small fragments (∼200 bp). To permit next-generation sequencing, short adapters consisting of annealed and therefore double-stranded oligonucleotides are ligated onto the DNA fragments. These adapters can then be used as sequencing primers in next-generation sequencing.

DNA Sequencing

7.13.3 Current Protocols in Molecular Biology

Supplement 104

A common approach is to use WGS to identify SNP variants within genomic DNA for linkage and as candidates for mutation (Bowen et al., 2012; Leshchiner et al., 2012; Obholzer et al., 2012; Voz et al., 2012). Recent studies, however, have used analysis of transcript expression and variation by RNAseq (Hill et al., 2013; Liu et al., 2012; Miller et al., 2013; UNIT 4.11). Whereas WGS provides unbiased, low-depth coverage of a genome, RNAseq provides often greater coverage but with a bias of coverage based on expression levels—highly expressed transcripts will be over-represented while messages expressed at low levels, at specific time-points, or in specific tissues might not be detected. Analysis by RNAseq in the mutant can be an asset to identifying regulatory mutations, as it can facilitate detection of the effects of promoter/enhancer or splicing mutations within the interval that may be more difficult to identify by WGS analysis. However, a priori assumptions of specific tissues or developmental timing in which a particular mutation is affecting the phenotype is necessary for such analysis, and thus may lead to difficulties in finding causative mutations in cases where this is not known. (iii) Basic strategies to establish linkage. There are two basic ways to identify linkage of the causative mutation to a specific region in the genome by NGS: (1) by directly comparing variation between a mutant pool and their wild-type siblings, or (2) by comparing variants in a mutant pool to variants found in wild-type strains. Both strategies have advantages and drawbacks. The advantage of comparing mutants to their siblings is that analysis of segregation of existing variation within the siblings can assist in identification of linkage. However, a comparison of mutant sequence to common SNP variants in wild-type fish allows both linkage analysis and the exclusion of common variants within the mapping interval as putative candidate mutations. One functional drawback of comparing mutant to sibling data is that, in addition to sequence of the mutant, the sibling sequence needs to be acquired for each mutant to be mapped. This essentially doubles the cost per mutant analyzed. Furthermore, datasets of common SNP variants within wild-type strains continue to grow as more investigators utilize WGS as a tool. Thus, this SNP database will serve as an expanding reference tool. The accuracy of mapping by WGS, as well as the ability to identify a mutation, is correlated with the extent of genome coverage obtained by sequencing. Currently, the authors obtain an average of 180 million reads in one lane of single-end sequencing using an Illumina HiSeq2000 machine. This results in ∼10× average genome coverage of the zebrafish genome, and 98% of the exome is covered by at least two reads. We have found that identification of unique homogeneous SNPs present in two or more reads is sufficient for identification of true variants over sequencing errors. The size of the identified interval linked to the mutation is also defined by the extent of coverage. Given the limited sampling of genetic diversity through use of low-genome coverage, we found that a pool of 20 mutant fish is sufficient to maximize diversity within the sequenced pool. The higher the sequencing coverage, the higher the possibility that a single recombination event can be detected in the mutant DNA pool, and thus the resolution of linkage increases. Therefore, addition of more fish to the mutant pool does not necessarily increase the resolution of mapping if the pool is sequenced only to low coverage. The number of fish analyzed could be increased and sequenced to a higher coverage to identify a smaller interval containing the mutation; however, this advantage would not outweigh the costs. As one lane of sequencing (∼180 million reads; 100SE) provides sufficient data to define a candidate interval, and in addition provides coverage of over 98% of the exome, the causative mutation can be identified within this sequence even if a slightly larger interval needs to be analyzed. Identification of Zebrafish Mutations Using NGS

While more sequence coverage provides more robust detection of variants and potential recombinants, the added informational content quickly becomes saturated. The costs of

7.13.4 Supplement 104

Current Protocols in Molecular Biology

individual sequencing runs can vary widely depending on the platform and institution used. The decision regarding a specific sequencing strategy/platform should be influenced by current costs and experimental needs. The design of the protocols described here is chosen as a cost-effective means to allow high-throughput mutation identification. A further difference between mapping strategies is the way the data are analyzed. Whereas some protocols perform linkage analysis solely based on homozygosity (homogeneity) and physical distance (Leshchiner et al., 2012; Voz et al., 2012), we have found that the analysis of the origin of particular SNP alleles from particular parental strains as well as the consideration of recombination rates are required for robust and accurate detection of linkage. This is discussed further in Bowen et al. (2012) and in the Commentary below. (iv) How to gather sufficient computational resources. The analysis of next-generation sequence data requires a large amount of computer memory and is computationally intensive. Therefore, a workstation configured to work with Unix, with at least 500 GB of free disk space, a 64-bit CPU, and 8 GB or more RAM should be acceptable. As an alternative to having a workstation in the laboratory, many research institutes provide computer clusters for data analysis. A second alternative is cloud computing. Here, computational resources are accessible via a network, for example with a standard Web browser. These provide the entire bioinformatic infrastructure needed to analyze nextgeneration sequence data. One of these tools, Galaxy (http://galaxyproject.org), is described in UNIT 19.10 (Blankenberg et al., 2010) and Goecks et al. (2010). A summary and description of other resources can be found in Afgan et al. (2012). As the analysis of next-generation sequence data is rather complex, Leshchiner et al. (2012) have provided a Web-based platform for mapping and mutation identification in zebrafish. Here, the raw sequence files from a mutant and a sibling pool can be uploaded to their Web site; mapping information will be provided and candidate mutations will be identified. The Leshchiner et al. (2012) method requires additional sequencing data from siblings and differs in their analysis in several ways from the protocol described here, as discussed later in this protocol. Similarly, a mapping tool provided by Obholzer et al. (2012) bundles the necessary components needed for mapping and can be used either in a Galaxy-based Web interface or can be installed on a local workstation. After decisions have been made concerning the various means and approaches that can be taken to map a mutant using next-generation sequencing data, the specifics of sequencing need to be defined. There are several options. The first consideration is the length of the sequencing reads. Currently, read-length options vary based on the sequencing facility, but are generally 50, 75, or 100 bases. Whereas the quality of the reads decreases slightly with an increase in length, the overall genome coverage will increase with longer reads, as the number of reads obtained is independent of their length. Furthermore, single- or paired-end sequencing can be performed. Paired-end sequencing has the advantage of providing positional information that can help to identify deletions and insertions as well as potential errors in the reference genome assembly. Paired-end sequencing, however, will also increase the overall costs significantly. For a detailed description of sequencing technologies see UNIT 7.1. The procedure introduced here utilizes WGS data from a single lane of Illumina HiSeq2000 100 bp single-end sequencing. We analyze pooled genomic DNA from twenty F2 mutants derived from a genetic cross with commonly used polymorphic strains. As discussed above, the number of mutants to be pooled is flexible. Linkage analysis is performed by analyzing SNP variation within the mutant pool and identifying common SNPs within the parental or mapping strain. This protocol provides a modular approach that can be modified depending on specific experimental designs, as discussed above.

DNA Sequencing

7.13.5 Current Protocols in Molecular Biology

Supplement 104

BASIC PROTOCOL 1

PREPARATION OF THE DNA LIBRARY FOR NEXT-GENERATION SEQUENCING The basic steps necessary to prepare a DNA library for next-generation sequencing specific to this unit are summarized in this protocol. For general preparation of DNA libraries for next-generation sequencing, see Son and Taylor (2011), which outlines this in more detail.

Materials F2 generation of a genetic cross, sorted by phenotype into mutants and siblings (Fig. 7.13.1A) Reagents for DNA extraction: e.g., DNeasy Blood & Tissue Kit (Qiagen, cat. no. 69504) Optional: Kit for library preparation for next-generation sequencing, e.g., TruSeq DNA Sample Preparation kit (Illumina, cat. no. CES FC-121-2001) Spectrophotometer, e.g., NanoDrop (see APPENDICES 3D & 3J) Additional reagents and equipment for DNA extraction (UNIT 2.1A), quantitation of nucleic acids (APPENDICES 3D & 3J), and library preparation for Illumina sequencing (Son and Taylor, 2011) 1. After a suitable number of mutants have been identified, extract and quantify genomic DNA. Depending on the age of the fish, the extraction can be done either from whole embryos or from dissected adult fins. The genomic DNA used for preparing a library for nextgeneration sequencing needs to be of high quality. The Qiagen DNeasy Blood & Tissue kit reproducibly provides high-quality genomic DNA suitable for library preparation. If genomic DNA is extracted from embryos, it is advisable to first pool these, as DNA extraction from single embryos via column purification results in low DNA concentrations, given the minimal volume needed for elution. It makes sense to extract genomic DNA from fins for each mutant separately, as analysis of single mutants is later helpful to verify linkage. After isolation, DNA can be frozen and stored at −20◦ C. The protocol has been tested for DNA pooled from 20 mutants. DNA from siblings is required for verification of linkage, as described in Support Protocol 2, and should be extracted at this stage of the protocol.

2. Measure and normalize DNA concentrations for each individual. Pool equal amounts of DNA from each mutant in a single microcentrifuge tube. Each fish used in the analysis carries its own unique recombination events. As linkage is based on the detection of recombination frequencies, it is important that each fish be represented at an equal level in the DNA pool. Therefore, if DNA is extracted from fins and not from a pool of embryos, the DNA concentrations should be measured and normalized for each individual. This can be done by spectrophotometric quantification—e.g., use of a NanoDrop spectrophotometer (APPENDIX 3J). For alternative methods, see APPENDIX 3D. The minimal amount of genomic DNA needed for preparation of sequencing libraries is dependent on the platform used for sequencing. We generally use 3 μg of genomic DNA.

3. Prepare library.

Identification of Zebrafish Mutations Using NGS

Many sequencing facilities offer sequencing as well as library preparation services. However, if a research group is planning on using next-generation sequencing for more than just a single experiment, it is advisable to prepare the library within the lab (Fig. 7.13.1B). There are a number of kits available from different companies for library preparation for next-generation sequencing, such as the Illumina TruSeq DNA Sample Preparation kit. Alternatively, reagents can be purchased separately and libraries can be prepared as described in Bowen et al. (2012). A detailed protocol for library preparation for Illumina sequencing can be found in Son and Taylor (2011).

7.13.6 Supplement 104

Current Protocols in Molecular Biology

4. After preparation, submit the library for next-generation sequencing. The analysis of the resulting sequence data to permit mapping and identification of mutations in zebrafish is described below in Basic Protocols 2 to 5.

SEQUENCE DATA ALIGNMENT AND VARIANT IDENTIFICATION This protocol describes how to align sequence reads to a reference genome using Novoalign. The Novoalign software is designed for the alignment of sequencing reads stemming from an Illumina sequencer. In case a different platform was chosen for sequencing, an alternate alignment program needs to be used. Different alignment algorithms are discussed in Li and Homer (2010). Subsequently, variants are identified simultaneously in mutant and wild-type data using SAMtools. The created files are then used for mapping in Basic Protocol 3 and identification of candidate mutations and indels in Basic Protocols 4 and 5.

BASIC PROTOCOL 2

Materials Hardware Workstation (Unix, Linux, Mac OS X, Windows) with 64-bit CPU, 8 GB or more RAM, and 500 GB or more free disk space (alternatives to setting up a workstation for whole-genome data analysis in the lab are described in Strategic Planning of Mapping Experiments, above) Software The following software applications are freely available: Novoalign (http://www.novocraft.com/) SAMtools (http://samtools.sourceforge.net/) Picard (http://sourceforge.net/projects/picard/) Genome Analysis Toolkit (GATK; http://www.broadinstitute.org/gsa/wiki/ index.php/The Genome Analysis Toolkit) Integrative Genomics Viewer (IGV; http://www.broadinstitute.org/igv/) A short description of the different software packages as well as an introduction on how to install them can be found in Support Protocol 1 Datasets A raw sequence file with next-generation sequence data from a pool of phenotypically identical mutants (result of Basic Protocol 1) Files with the aligned sequences of wild-type strains in the BAM format, paired with an index file (.bai): a description of the file format can be found on the SAM tools Web site, http://samtools.sourceforge.net, or in Li et al. (2009). Aligned sequence files of wild-type strains can be downloaded from the Resources section at http://fishyskeleton.com (see Support Protocol 1 for more description) A fasta file and index files of the reference genome (Table 7.13.1) Importantly, Support Protocol 1 provides a detailed description on how to generate these different files for the zebrafish genome. We have chosen to use the software listed above, but there are many alternatives available for every task. A list with available software, including a short description of the main function of each package, can be found at http://seqanswers.com/wiki/software/list. NOTE: In the following protocol, use of the command line as well as verbatim user input and machine output and file/directory names are indicated by the courier font. File names, or locations of files that need to be replaced by the user are indicated with brackets {filename}. For example in line run {filename.txt} for file testsample1.txt, the finished command line should read:

run testsample1.txt Current Protocols in Molecular Biology

DNA Sequencing

7.13.7 Supplement 104

Table 7.13.1 Files of the Zebrafish Genome Needed for Analysis

Filename

Description

danRer7 novoindex

Index file of the zebrafish genome, indexed for use with Novoalign

danRer7.fasta

Fasta file of the zebrafish genome

danRer7.fasta.fai

Index file of the zebrafish genome

NOTE: For command-line text entry, please note that simple ‘copy and paste’ from the PDF may introduce unexpected characters and hard returns that will result in processing errors. 1. Uncompress the sequence file:

gunzip {filename.txt.gz} 2. Split sequence file into smaller files (to speed up the alignment process):

split -l 40000000 {filename.txt} splitseq A series of output files will be created named splitseq aa, splitseq ab, splitseq ac. . . . Each file will contain 40 million lines, which equals 10 million reads.

3. Create a folder for the aligned reads:

mkdir aligned reads This command creates a folder in the location where you execute the command.

4. Align sets of reads to the reference genome. Align the read sets to the reference genome using the Novocraft algorithm. This step is computationally intensive. If this command is executed on a computer with 8 GB RAM, it can take a whole week to finish. Therefore, for this step, it is advisable to use the alternatives described in the Strategic Planning section on gathering computer resources for analysis.

for f in splitseq ∗ ; do echo $f ; {../novocraft/novoalign} -f $f -d {../danRer7 novoindex} -o SAM -F ILMFQ -a {AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTATGCCG TCTTCTGCTTG} > aligned reads/"$f".sam ; done Note 1: The {../novocraft/novoalign} part of the command describes the location of the program, in this case Novoalign, to be run. The ../ at the beginning of the folder name indicates that the folder novocraft is located in a folder one hierarchy level up from where you are right now. This part of the command needs to be modified by the user according to the position of the novoalign program. Note 2: Similarly, the position of the danRer7 novoindex file needs to be specified by the user. Note 3: -F ILMFQ specifies the format of the sequencing file. It is important to specify the correct fastq format of the sequence file so that the qualityscore values can be interpreted correctly. The fastq format of the reads is dependent on the sequencer used. A list of different fastq formats can be found at http://www.novocraft.com/userfiles/file/NovoBarcode.pdf. Identification of Zebrafish Mutations Using NGS

Note 4: -a specifies that 3 adaptor trimming should be performed. Adaptor trimming is necessary if it is expected that a significant fraction of library fragments will have an insert size less than the Illumina read length. For example, libraries prepared from degraded genomic DNA in our laboratory had a significant fraction of fragments with an insert size below 100 bp. When Illumina 100-bp sequencing is performed on these

7.13.8 Supplement 104

Current Protocols in Molecular Biology

Figure 7.13.2 Novoalign output. Screen shot of a typical Novoalign output during read alignment. The first line indicates the file currently being processed. The next four lines list the Novoalign version used. The command that is being executed is given in line 6. The number of reads in the file and the number of reads which could be aligned to the genome are given in lines 13 and 14, respectively.

fragments, the 3 end of some reads will correspond to the adaptor sequence, which should be removed before alignment to the reference genome. The sequence listed above is the sequence corresponding to the Illumina paired end adaptor used in our laboratory (which can be used for both paired and single end sequencing).

5. Alignment statistics. When running novoalign, the program will list statistics of the alignment (see Fig. 7.13.2). 6. Go to the folder with the .sam files:

cd aligned reads 7. Convert .sam files to .bam files:

for f in splitseq ∗.sam ; do echo $f ; {../../samtools} import {../../danRer7.fasta} $f "$f".bam ; done 8. Sort reads based on their location in the genome:

for f in splitseq ∗.sam.bam ; do echo $f ; {../../samtools} sort $f "$f".sort ; done 9. Make ‘read group’ file. In the next step, the .bam files will be merged and for each read; a read group tag will be added. This tag will specify which input file contained a specific read. Some programs, such as the GATK used later, require that all reads contain a read group tag.Therefore, a readgroup.txt file must first be created as a text file, for example, using Excel, and the file saved as Tab Delimited Text (.txt) file. The readgroup file should contain one line for each input file. Shown here are the first three lines of an example file:

@RG ID:{splitseq aa.sam.bam.sort} SM:{mutant1} LB:{mutant1} PL:Illumina @RG ID:{splitseq ab.sam.bam.sort} SM:{mutant1} LB:{mutant1} PL:Illumina @RG ID:{splitseq ac.sam.bam.sort} SM:{mutant1} LB:{mutant1} PL:Illumina . . .. Current Protocols in Molecular Biology

DNA Sequencing

7.13.9 Supplement 104

Notes: Each individual tag should be separated by Tabs or written in separate columns. The ID tag must contain the input file name (excluding the .bam extension). The SM tag must contain the name of the sample (for example mutant1). The LB tag lists the name of the library. If only one library was made for each mutant, the library name will be the same as the sample name. The PL tag contains the name of the sequencing platform.

10. Merge the .bam files:

{../../samtools} merge -rh {rgMut1.txt} {merge mutant1.bam} splitseq ∗.sam.bam.sort.bam Notes: {rgMut1.txt} should be replaced by the name of the read group file created in step 8. {merge mutant1.bam} is the name of the output file that will be created. In general names of output files are defined by the user and only need to have the appropriate file extension.

11. Index the merged BAM files:

{../../samtools} index {merge mutant1.bam} 12. Remove PCR duplicates:

java -jar {../../picard-tools-1.73/MarkDuplicates. jar} I= {merge mutant1.bam} O= {merge mutant1. PCR.bam} METRICS FILE= {merge mutant1 PCRmetrics} ASSUME SORTED=TRUE REMOVE DUPLICATES=true In addition to a .bam file containing aligned sequence reads without PCR duplicates, the PCRmetrics file that will also be created as an output, will contain information on how many reads were found to be PCR duplicates.

13. Index the merged PCR duplicate removed file:

{../../samtools} index {merge mutant1.PCR.bam} 14. Create a file for each chromosome:

for f in {1..25} ; do echo $f ; {../../samtools} view -b {merge mutant1.PCR.bam} chr"$f" > {mutant1 chr "$f".bam} ; done This command creates a .bam file for each chromosome, containing all the reads that could be aligned to this particular chromosome.

15. Index .bam files for each chromosome:

for f in {mutant1 chr∗.bam} ; do echo $f ; {../../samtools} index $f ; done Identification of Zebrafish Mutations Using NGS

7.13.10

With completion of this step, a file for each chromosome containing all reads aligned to this chromosome (.bam) with the associated index file is created (.bai). The index file is necessary to be able to visualize the bam files in IGV for analysis, see Support Protocol 2.

Supplement 104

Current Protocols in Molecular Biology

16. Move all mutant .bam and .bai files to the folder with the files from the wild-type strains:

mv {mutant1 chr∗} {../wt reads/mutant1 chr∗} 17. Go to the folder with the .bam and .bai files:

cd ../wt reads/ 18. Call SNP variants:

for f in {1..25} ; do echo $f ; {../../samtools mpileup} -ugf {../../danRer7.fasta} {∗ chr"$f".∗bam} | {../../bcftools} view -vcg - > {mutant1 chr"$f".vcf} ; done The variant calling is performed on the mutant and all wild-type strains simultaneously (all files present in the folder the command is executed in will be used). This increases the accuracy of variant calling on low-coverage sequence data. It is also necessary for the downstream analyses of the parental origin of SNP alleles used for calculating the mapping score. The output file of this command is in a .vcf format. A description of this format can be found at: http://vcftools.sourceforge.net/specs.html.

19. Change the header to make the ‘variant file’ compatible with GATK:

for f in {mutant1 chr∗.vcf} ; do echo $f ; sed 's/##fileformat=VCFv4.1/##fileformat=VCFv4.0/g' $f > "$f" header ; done 20. Separate the identified indels from the SNPs:

for f in {mutant1 chr∗vcf header} ; do echo $f ; grep -v INDEL $f > "$f" noINDEL ; done for f in {mutant1 chr∗vcf header} ; do echo $f ; grep INDEL $f > "$f" INDEL ; done Due to the inaccuracy of aligning reads with insertions or deletions, only SNPs, and not indels, are used for mapping purposes. However, once the linked interval is identified, indels can then be analyzed to identify candidates for the causative mutation.

21. Identify high-quality SNPs for mapping:

for f in {1..25} ; do echo $f ; java -jar {../../GenomeAnalysisTK-1.6-13g91f02df/GenomeAnalysisTK. jar} -T VariantFiltration -R {../../danRer7.fasta} -V {mutant1 chr"$f".vcf header noINDEL} -o {mutant1 chr"$f".vcf noINDEL filtered} --mask {../../scriptsandtext 2/repeat mask/danRer7 repeatmask chr"$f".bed} --maskName RepeatMask --clusterWindowSize 10 --filterExpression "QUAL < 30" --filterName "QualFilter" --filterExpression "DP < 8" --filterName "LowDepth" --filterExpression "DP > 60" --filterName "HighDepth" --filterExpression "MQ < 40" --filterName "MapQual" ; done The zebrafish genome is highly repetitive and contains a high number of low-complexity regions. Due to the frequent misalignment of reads in these regions, SNPs lying in interspersed repeats or low-complexity sequences are excluded from the analysis. This is defined by --mask {../../scriptsandtext 2/repeat mask/danRer7 repeatmask chr"Sf".bed} in the command. The filtering criteria used to identify SNPs can be adjusted as needed. See http://www.broadinstitute.org/gsa/wiki/ index.php/VariantFiltrationWalker for details on the parameters used for filtering. Current Protocols in Molecular Biology

DNA Sequencing

7.13.11 Supplement 104

The file created in this step enables mapping (Basic Protocol 3) and the identification of potential candidate mutations in the mutant (see Basic Protocols 4 and 5, as well as Support Protocol 3). For mutants with already established linkage, users can directly proceed to Basic Protocols 4 and 5. SUPPORT PROTOCOL 1

SOFTWARE AND DATASETS USED FOR DATA ANALYSIS This support protocol describes the basic tools needed to be able to follow and run the steps described in Basic Protocols 2, 3, 4, and 5. Information is provided on the software used and the datasets and scripts needed for mapping mutations in zebrafish. NOTE: In the following protocol, use of the command line as well as verbatim user input and machine output and file/directory names are indicated by the courier font. File names, or locations of files that need to be replaced by the user are indicated with brackets {filename}. For example in line run {filename.txt} for file testsample1.txt, the finished command line should read:

run testsample1.txt NOTE: For command-line text entry, please note that simple ‘copy and paste’ from the PDF may introduce unexpected characters and hard returns that will result in processing errors. 1. Become familiar with Unix and next-generation sequence data analysis. To analyze next-generation sequencing data using the protocol presented here, it is necessary to use Unix and Perl. Users do not need to be able to write their own scripts, but it is helpful to understand some basic commands. Therefore, it is advisable to take a short introduction into Unix. A free online tutorial can be found at http://www.ee.surrey.ac.uk/Teaching/Unix/unixintro.html. Wild cards and For Loops. Since it is often required to run the same command on multiple files, ‘wild cards’ and ‘for loops’ are used in this protocol. The wild card symbol ∗ can be used to represent any character or characters in a file name, while the wild card symbol ? can be used to represent any single character in a file name. ‘For Loops’ can, for example, be used to run the same command on files from all 25 chromosomes (for f in {1..25} ; do echo $f ; command chr"$f" ; done). A short tutorial about next-generation sequencing data analysis is available at https://sites.google.com/site/cuevolutionarygeneticsngs/. This tutorial also describes how to install software for sequence analysis, for example SAMtools and BCFtools. 2. Become familiar with software used for next-generation sequence data analysis. There are many different software packages available for next-generation sequence data analysis. An extensive list of software can be found at http://seqanswers. com/wiki/Software/list. The software listed here is freeware and can be downloaded from the Web sites mentioned. Some downloads require registration. It is recommended to create a folder containing all software and datasets to make the analysis easier. Identification of Zebrafish Mutations Using NGS

a. Alignment software, such as Novoalign (http://www.novocraft.com/). The free version includes all necessary options. Instructions on how to install the software can be found on the Novocraft Web site (see above). A compressed

7.13.12 Supplement 104

Current Protocols in Molecular Biology

version of the program is downloaded. To uncompress the file, open the terminal on a Mac (Applications>Terminal). Go to the folder containing the compressed file: cd Desktop/software Uncompress the file: tar -xzf novocraftV2.08.02.MacOSX.tar.gz This command can be used to uncompress all other tar.gz compressed software files accordingly.

b. SAMtools software package for analyzing next-generation sequence data in the .sam format (http://samtools.sourceforge.net or https://sites.google.com/site/cuevolutionarygeneticsngs/ for Mac users; Li et al., 2009). SAMtools needs to be compiled to be used. A samtools and bcftools version already compiled for MacOS X can be downloaded from https://sites.google.com/site/cuevolutionarygeneticsngs/. Installation of the software directly downloaded from the samtools Web site requires installation of additional software on a Mac.

c. Picard package for working with next-generation sequencing data in the .bam format (http://sourceforge.net/projects/picard/). d. Genome Analysis Toolkit (GATK) used for variant filtering (http://www.broadinstitute.org/gatk; McKenna et al., 2010). e. Annovar for annotating variants (http://www.openbioinformatics.org/annovar/; Wang et al., 2010). f. Integrative Genomics Viewer (IGV) to visualize the sequence data (http://www.broadinstitute.org/igv/; Robinson et al., 2011). 3. Datasets used for analysis. The following datasets and scripts should be downloaded from the Resources section on http://fishyskeleton.com.

a. Aligned sequences of wild-type strains. This file contains the aligned Illumina sequencing reads (100-bp single end), for the T¨u, WIK, AB, and TLF zebrafish strains (Bowen et al., 2012). The alignment was done against the Zv9 assembly of the zebrafish reference genome (http://www.ensembl.org). For the T¨u and WIK strains, sequencing was performed separately on fish maintained in T¨ubingen, Germany (referred to as TUG and WKG), and fish maintained at Boston Children’s Hospital (referred to as TUB and WKB). Each aligned sequence file (in the .bam file format) is paired with an index file (.bai). Example file: AB aligned sequences.tar: The files can be uncompressed using the command introduced in step 2: tar -xzf AB aligned sequences.tar The uncompressed folder will contain one file with the aligned sequence reads for each chromosome with the associated index file. Example file: AB chr1.bam AB chr1.bam.bai Make sure that the complete file is downloaded from the Web site. The estimated size of each file can be found on the Web site for comparison. If incomplete files are downloaded an error will occur when uncompressing. Current Protocols in Molecular Biology

DNA Sequencing

7.13.13 Supplement 104

b. Scripts and associated text files: This folder contains several subfolders with the following files: File 1: parental variants.tar (SNP datatsets) The uncompressed files will look like this: Example file: chr1 parental PASS.vcf.gz This file can be uncompressed by double-clicking on the file. The resulting chr1 parental PASS.vcf file can be opened in Excel if exact positions of polymorphic SNPs on certain chromosomes need to be identified. File 2: scriptsandtext.gz Uncompress file by double-clicking. This file contains the following folders: Folder name: coding exon bases Example file name: coding bases chr1 The above files contain the genomic coordinates of each coding exon base as well as bases up to15 bp upstream and downstream of each coding exon. This list was constructed from the merged dataset of all Refseq and Ensemble coding exons. Folder name: MGH map Example file: chr1 cM The above files contain the estimated physical coordinates for genetic markers at 0.25-cM intervals in the genome. These estimates are based on the genetic distance of markers in the MGH meiotic map (http://zfin.org/zf info/downloads.html#map; Knapik et al., 1998) that have been mapped to the Zv9 version of the zebrafish reference genome. Folder name: perl Example file: classifySNPs.pl The perl folder contains Perl scripts written by the author (M.E. Bowen) used for mapping and candidate mutation identification. Folder name: repeat mask Example file: danRer7 repeatmask chr1.bed The repeat mask file contains the positional coordinates of all interspersed repeats and low-complexity sequences in the zebrafish genome. The files are the output of RepeatMasker (http://www.repeatmasker.org/) that have been converted to the .bed file format. Folder name: zebradb Example file: danRer7 dbSNPs.txt These files contain the physical coordinates and sequences of Ensembl and Refseq zebrafish genes (http://hgdownload.cse.ucsc.edu/goldenPath/danRer7/ database/) as well as files with coordinates of dbSNPs (http://useast.ensembl. org/info/data/ftp/index.html). All files have been formatted for use with the Annovar software. c. Zebrafish genome files. Identification of Zebrafish Mutations Using NGS

Lastly, in order to be able to make use of the reference genome, several files need to be prepared. Some programs, such as the GATK, require the reference genome to have the .fasta extension, and that a dictionary file (.dict) and a fasta index file (.fai) be present in the same folder as the reference genome.

7.13.14 Supplement 104

Current Protocols in Molecular Biology

See the GATK Web site for more details: http://www.broadinstitute.org/gsa/ wiki/index.php/Preparing the essential GATK input files: the reference genome. 4. Download the zebrafish genome sequence from the UCSC Genome Bioinformatic Web site: http://hgdownload.cse.ucsc.edu/goldenPath/danRer7/bigZips/. Download the latest version of the reference genome: danRer7.fa.gz. A description of the UCSC genome browser can be found in UNIT 19.9.

5. Unzip the file by double-clicking. 6. Rename the reference genome to .fasta (command line).

mv danRer7.fa danRer7.fasta 7. Create a fasta sequence dictionary file (.dict), using the CreateSequence Dictionary.jar file from Picard

java -jar {picard-tools-1.73/CreateSequence Dictionary.jar} R=danRer7.fasta O=danRer7.dict Note if an error appears:

Exception in thread "main" java.lang. OutOfMemoryError: Java heap space Use the following command instead to allow usage of more memory:

java -Xmx2048m -jar {picard-tools-1.73/Create SequenceDictionary.jar} R=danRer7.fasta O=danRer7.dict 8. Create a fasta index file (.fai) using SAMtools

{./samtools} faidx danRer7.fasta The tools described here summarize essential and available software and datasets for analysis of next-generation sequence data in zebrafish, not only for variant detection, but also for mapping and mutation identification as described in Basic Protocols 2, 3, 4, and 5.

LINKAGE MAPPING BASED ON HOMOZYGOSITY-BY-DESCENT This protocol describes how to use the variant files created in Basic Protocol 2, which contain high-quality SNPs identified in the mutant and the analyzed wild-type strains, to calculate a mapping score and how to analyze these mapping scores to assess linkage. The mapping score described here is a measure of the level of heterogeneity within the mutant pool and the number of mapping strain–specific alleles in 20-cM windows over the entire genome. The mapping score output will highlight the linked region harboring the mutation.

BASIC PROTOCOL 3

Materials Hardware Workstation (Unix, Linux, Mac OS X, Windows) with 64-bit CPU, 8 GB or more RAM, and 500 GB or more free disk space (alternatives to setting up a workstation for whole-genome data analysis in the lab are described in Strategic Planning of Mapping Experiments, above) Current Protocols in Molecular Biology

DNA Sequencing

7.13.15 Supplement 104

Table 7.13.2 Perl Scripts Provided for Mapping

Script name

Function

classifySNPs.pl

Identification of SNPs useful for mapping, selection of SNPs specific for the parental strains

convert bp to cM.pl mapping score.pl

Conversion of physical coordinates of SNPs to genetic coordinates Calculation of the mapping score according to the formula described in Support Protocol 2

Scripts See Table 7.13.2 (Perl scripts can be downloaded from the Resources section at http://fishyskeleton.com; see Support Protocol 1) Data A variant file (.vcf format) for each chromosome containing high-quality SNPs (created in Basic Protocol 2) NOTE: In the following protocol, use of the command line as well as verbatim user input and machine output and file/directory names are indicated by the courier font. File names, or locations of files that need to be replaced by the user are indicated with brackets {filename}. For example in line run {filename.txt} for file testsample1.txt, the finished command line should read:

run testsample1.txt NOTE: For command-line text entry, please note that simple ‘copy and paste’ from the PDF may introduce unexpected characters and hard returns that will result in processing errors. 1. Identify SNPs useful for mapping:

for f in {chr∗.vcf noINDEL filtered} ; do echo $f ; perl {./perl/classifySNPs.pl} $f {mutant1} {TU} {WK} > {"$f" mutant1 TU WK.cn} ; done Here, the user has to indicate which strains were used in the map cross. It is important that the exact strain names used in the .vcf file be indicated. The strain names can be looked up by using the following command:

head -100 {mutant1 chr1.vcf header noINDEL} This command will show the first 100 lines in the specified file. A typical output of this command is shown in Figure 7.13.3. For both the T¨u as well as the WIK strain, data from two separate pools of fish are available. One pool from each strain contained fish maintained at the Max-Planck-Institute for Developmental Biology in T¨ubingen, Germany, indicated as TUG and WKG. The other pools were collected from fish maintained at Boston Children’s Hospital, referred to as TUB and WKB. These strain data can be used either separately—in this case, the above names would be used in the command—or they can be combined. In this case, the strain abbreviations TU or WK would be used.

Three output files for each chromosome will be created: Identification of Zebrafish Mutations Using NGS

Example file 1: chr1.vcf noINDEL filtered mutant1 TU WK.cn This file is used for calculating the mapping score (step 3) and analyzing the linked interval (Basic Protocol 4)

7.13.16 Supplement 104

Current Protocols in Molecular Biology

Figure 7.13.3 VCF file content. Represented are the first 100 lines of an example .vcf created in step 19 of Basic Protocol 2. The first 21 lines provide information about the format of the file content. The header of the file is boxed in red; highlighted in blue are the strain names of the four strains used in this example, NC31, AB, TUG, and WKG. The lines below the header list the genotypes of SNPs in each strain. A detailed description of this file format can be found at http://samtools.sourceforge.net/mpileup.shtml. For the color version of this figure, go to http:/www.currentprotocols.com/protocol/mb0713.

Example file 2: chr1.vcf noINDEL filtered mutant1 homnovelSNPs This second file contains all homogeneous SNPs that are unique to the mutant pool. These will later be used to identify candidate causative mutations (Basic Protocol 4).

Example file 3: chr1.vcf noINDEL filtered mutant1 hetSNPs The last file lists the coordinates and quality scores (in the SAMtools/BCFtools output format) of all SNPs classified as heterogeneous in the mutant pool.

2. Convert physical coordinates to cM coordinates:

for f in {1..25} ; do echo $f ; perl {./perl/convert bp to cM.pl} {./MGH map/chr"$f" cM} {chr"$f".vcf noINDEL filtered mutant1 TU WK.cn} > {chr"$f" mutant1 TU WK cM.cn} ; done In this step, the SNPs in the chr∗.vcf noINDEL filtered mutant1 TU WK.cn files are added up in 0.25 cM intervals.

3. Calculate the mapping score:

for f in {1..25} ; do echo $f ; perl {./perl/mapping score.pl} {chr"$f" mutant1 TU WK cM.cn} 20 > {chr"$f" mutant1 TU WK mappingscore 20cM} ; done The script used in this command will calculate a mapping score for 20-cM sliding windows, with 0.25-cM intervals. The size of the interval is specified by the number after the filename, here 20. In the case where more than 20 mutants are pooled and sequenced to higher coverage (>4×), the size of the interval can be adjusted, as a smaller homogeneous interval would be expected. This is an important point in the analysis in which the user can shape the analysis to fit tparticular experimental conditions and data.

4. Combine the mapping scores from each chromosome into one file:

cat chr1 mutant1 TU WK∗20cM chr2 mutant1 TU WK∗20cM chr3 mutant1 TU WK∗20cM chr4 mutant1 TU WK∗20cM chr5 mutant1 TU WK∗20cM chr6 mutant1 TU WK∗20cM

DNA Sequencing

7.13.17 Current Protocols in Molecular Biology

Supplement 104

chr7 mutant1 TU WK∗20cM chr8 mutant1 TU WK∗20cM chr9 mutant1 TU WK∗20cM chr10 mutant1 TU WK∗20cM chr11 mutant1 TU WK∗20cM chr12 mutant1 TU WK∗20cM chr13 mutant1 TU WK∗20cM chr14 mutant1 TU WK∗20cM chr15 mutant1 TU WK∗20cM chr16 mutant1 TU WK∗20cM chr17 mutant1 TU WK∗20cM chr18 mutant1 TU WK∗20cM chr19 mutant1 TU WK∗20cM chr20 mutant1 TU WK∗20cM chr21 mutant1 TU WK∗20cM chr22 mutant1 TU WK∗20cM chr23 mutant1 TU WK∗20cM chr24 mutant1 TU WK∗20cM chr25 mutant1 TU WK∗20cM > mutant1 TU WK mappingscore20cMall The output file will contain the following data: Column1: chromosome number Column2: genomic coordinate of the start of 20 cM windows, in bp Column3: mapping score Column4: ratio of homogeneous to heterogeneous SNPs Column5: ratio of reference alleles to mapping strain alleles. Any suitable graphing software can be used to create plots for the visualization of the mapping score across the genome. High mapping scores indicate possible linkage of the region to the mutation. A simple way to analyze the mapping score file is to open it in Excel and sort the data in column 3 in descending order. Now, the first row will show the chromosome and the genomic coordinate of the start of the 20-cM window with the highest mapping score. The first couple of rows should indicate the same chromosome. If a high mapping score is only found in one window on a chromosome, this may indicate a false positive result caused by the number of SNPs in the interval analyzed. As ratios are used to calculate the mapping score, the score will depend on the number of SNPs analyzed. Regions close to the telomeres are particularly prone to false positives, as the 20-cM windows are small due to the high frequency of recombination, and hence fewer SNPs are contained in these regions. One example on how to visualize the data used for calculating the mapping score can be found in Support Protocol 2. SUPPORT PROTOCOL 2

VERIFICATION OF LINKAGE Although we found that in all the mutants that we have mapped with this method, the interval with the highest mapping score always identified the region linked to the mutation, it is advisable, and important, to verify linkage by alternative methods.

Materials Software Integrative Genomics Viewer (IGV; http://www.broadinstitute.org/igv/) Files Variant file, chr1.vcf noINDEL filtered mutant1 TU WK.cn, in the .cn format, created in Basic Protocol 3. This file summarizes the genotypes and allele identities of high-quality SNPs in 20-kb intervals over the entire chromosome. Additional materials Identification of Zebrafish Mutations Using NGS

7.13.18 Supplement 104

Single mutant and sibling DNA from the F2 generation of a mapping cross Primers for SSLP or SNP markers Locations in the genome and primer information for SSLP markers can be found on the Ensembl Web site (http://useast.ensembl.org/Danio rerio/Info/Index) Current Protocols in Molecular Biology

1. Visualize the mapping data and generate a mapping score to identify linked regions. As a first step, the data used for calculating the mapping score can be analyzed in more detail, to define a region of homogeneity as a region potentially linked to the mutation. This can be done with the chr1.vcf noINDEL filtered mutant1 TU WK.cn file created in Basic Protocol 3, which contains all the raw data used to calculate the mapping score. This data can be visualized in IGV to get an idea about the genetic architecture of the potentially linked region.

a. Open IGV and select Zebrafish (Zv9/danRer7) from the genome drop-down menu in the top left corner. b. Select a chromosome from the drop-down menu next to the genome menu. c. Load the variant file into IGV by selecting File > Load from File and choose the file of interest. For example, chr1.vcf noINDEL filtered mutant1 TU WK.cn if you want to analyze chromosome 1. d. Convert heatmaps to line plots. This can be done either by a right-click on the line labeled as DATA TYPE, to change all tracks at the same time, or by a right-click on the track name, if only single tracks are modified. Select in the pop-up menu as Type of Graph > Line Plot. The default setting for this type of graph in IGV is a heatmap.

e. Change the track height. Similarly to the Type of Graph, the track height can be changed either by a right-click on the track or by selecting Tracks > Set Track Height from the menu in the top left corner. f. Analyze the data. The file opened contains seven tracks all of which represent numbers of specific SNPs contained in the previously identified SNP database, in 20-kb intervals over an entire chromosome (Bowen et al., 2012). For description of individual tracks see Table 7.13.3. The first track named {mutant1}het represents all heterogeneous SNPs from the mutant pool across the chromosome. This track can be used to identify the homogeneous region on the chromosome. As the number of recombination events decreases when getting closer to the mutation, a gradual decrease in the level of heterogeneity is expected at the borders of the interval (Fig. 7.13.4). If the region is near the ends of the chromosome, the decrease in heterogeneity may be quite sharp due to the increase in recombination. To zoom in on a region, the plus sign in the top right corner can be used, or one can click on the region marked in red on the top. The homogeneous region identified here, can after verification of linkage, be used in step 1 of Basic Protocol 4 for the identification of candidate mutations. A linked region should be characterized by a low level of heterogeneous SNPs and mapping strain specific alleles (tracks 1 and 6) and a high level of homogeneous SNPs and alleles specific to the mutagenized strain (tracks 2 and 4). These characteristics will also be represented by a high mapping score. The mapping score is calculated as follows in 20-cM sliding windows: # homogenous SNPs # mapping strain–specific alleles not found in the mutant × # mapping strain–specific alleles founnd in the mutant # heterogenousSNPs

For estimating the parental origin of a specific region, the ratio of mutagenized strain–specific alleles to mapping strain–specific alleles is most accurate. Most SNPs identified in the parental strains were heterogeneous, making the assignment of

DNA Sequencing

7.13.19 Current Protocols in Molecular Biology

Supplement 104

Table 7.13.3 Individual Tracks Contained in a chr1.vcf noINDEL filtered mutant1 TU WK.cn file.

Track name

Content

{mutant1}het

All heterogeneous SNPs

{mutant1}hom refnonref

All homogeneous SNPs

{mutant1}hom nonref

All SNPs homogeneous for the non-reference allele

{TU}SNP sharedmutant1

SNPs specific to the mutagenized strain that show an mutagenized strain specific allele, here a T¨u-specific allele, either heterogeneous or homogeneous

{TU}SNP notsharedmutant1

SNPs that are specific to the mutagenized strain and do not show the mutagenized strain–specific allele, here the T¨u-specific allele

{WK}SNP sharedmutant1

SNPs specific to the mapping strain that show a mapping strain–specific allele in the mutant, here a WIK-specific allele, either heterogeneous or homogeneous

{WK}SNP notsharedmutant1

SNPs specific to the mapping strain that do not show a mapping–strain specific allele in the mutant, here the WIK-specific allele

a Given are the names of each track as well as the data represented in the individual tracks.

strain-specific alleles difficult. For these reasons, we decided to define the parental origin of a region by the presence or absence of a mapping strain–specific allele. 2. Verify linkage. Linkage can be verified quite easily once a region with a high mapping score has been identified. For verification of linkage, microsatellite or SNP markers can be analyzed to address frequency of mapping strain–specific alleles and recombination events in pooled and single mutants (Nusslein-Volhard and Dahm, 2002).

a. Identify markers for linkage analysis. The most frequently used markers for linkage analysis are simple sequence length polymorphisms (SSLPs), as they are easily scored using standard agarose gel electrophoresis. SSLP markers in the candidate region can be found in the UCSC (UNIT 19.9) or Ensembl genome browsers (Flicek et al., 2012). Alternatively, SNP markers can be utilized. Here, the second file created in Basic Protocol 3, step1, chr1.vcf noINDEL filtered mutant1 homnovelSNPs, can be used to identify unique homogeneous changes in the region. Following Basic Protocol 4, these can be classified as being in coding or non-coding regions. So, these novel homogeneous changes can be used to verify linkage and potential candidate mutations simultaneously by capillary sequencing.

b. Establish marker polymorphism.

Identification of Zebrafish Mutations Using NGS

After identification of markers in the region, these markers need to be tested to find out if they are polymorphic in the particular cross used for mapping. The simplest way to test this is by analyzing pooled sibling DNA. As siblings are expected to be heterogeneous individuals, a pool of sibling DNA should show two different alleles if the marker is polymorphic. This can be seen in an SSLP marker as a double band on an agarose gel, or for a SNP marker as a double peak in capillary sequencing (Fig. 7.13.5).

7.13.20 Supplement 104

Current Protocols in Molecular Biology

Figure 7.13.4 Visualizing data used for mapping in IGV. Shown in the figure is a screen shot of an example file, chr16.vcf noINDEL filtered NC31 TU WK.cn, created, in step 1 of Basic Protocol 3, in IGV. A description of the single tracks seen in the figure can be found in table 7.13.3. This file can be used to cross check the results of the mapping score and to define a homogeneous region for the identification of potential candidate mutations. A linked interval (high mapping score) is characterized by a low level of heterogeneity (track 1) and a high level of homogeneity (track 2). In this example, two regions showing these characteristics can be found on the chromosome. One between 19 and 25 Mb and one between 40 and 51 Mb. A linked region is also defined as having a low level of SNPs with identity to the mapping strain (e.g., region between 40 and 51 Mb; track 6). The high level of mapping strain alleles thus excludes the region between 19 and 25 Mb; this is reflected in a low mapping score for this region. Within the linked interval defined by the mapping score, a more defined region of reduced/absent heterogeneity can be detected between 42 and 50 Mb. This region is a defined interval for harboring the candidate phenotype causing mutation.

C

A

homozygous sibling S

M

S

B

M

S

M siblings heterozygous sibling

mutants homozygous mutant

Figure 7.13.5 SSLP and SNP marker analysis. (A) SSLP marker analysis on pooled DNA samples from siblings (S) and mutants (M). Examples for a non-polymorphic (left), polymorphic unlinked (middle), and a polymorphic and linked marker (right). For a linked marker, one of the marker alleles co-segregates with the mutation. (B) Verification of linkage on single embryo DNA. Siblings show roughly the expected ratio of homozygous to heterozygous individuals. The band pattern indicates that the upper band is specific for the mapping strain, as all mutants are homozygous for the lower band. One exception is the mutant showing both bands (red asterisk), which indicates a recombination event. (C) Allele distribution for an SNP (highlighted in red) analyzed by capillary sequencing. Typically, an SNP linked to the mutation will either be heterozygous or homozygous for one allele. The mutants will be homozygous for the alternate allele. For the color version of this figure, go to http:/www.currentprotocols.com/protocol/mb0713.

DNA Sequencing

7.13.21 Current Protocols in Molecular Biology

Supplement 104

c. Identify recombination events in single mutants. Polymorphic markers can now be used on genomic DNA of single mutants, to establish linkage. A linked marker should show no or only few recombination events, indicated by the presence of two alleles (Fig. 7.13.5). The closer a marker to the mutation, the fewer recombination events should be detected. A recombination event should not be detectable in the same individual for markers on both ends of the linked area; this is evidence of a mis-sorted sibling fish in the mutant pool. An unlinked marker on the other hand should show random distribution of marker alleles, suggesting a false positive mapping score. BASIC PROTOCOL 4

IDENTIFICATION OF CANDIDATE MUTATIONS This protocol describes the steps necessary to identify candidate mutations in a specific genomic interval. First, all novel homogeneous variants in the linked interval are identified and then classified according to their effect on gene function. Similarly small deletions and insertions can be identified (Basic Protocol 5).

Materials Hardware Workstation (Unix, Linux, Mac OS X, Windows) with 64-bit CPU, 8 GB or more RAM, and 500 GB or more free disk space (alternatives to setting up a workstation for whole-genome data analysis in the lab are described in Strategic Planning of Mapping Experiments, above) Software Annovar (http://www.openbioinformatics.org/annovar/) Files Variant file containing homogeneous novel SNPs in the mutant, created in Basic Protocol 3 Text file containing all known SNPs annotated in the dbSNP database (danRer7 dbSNPs.txt) The danRer7 dbSNPs.txt file can be downloaded from the Resources section at http://fishyskeleton.com (see Support Protocol 1) NOTE: In the following protocol, use of the command line as well as verbatim user input and machine output and file/directory names are indicated by the courier font. File names, or locations of files that need to be replaced by the user are indicated with brackets {filename}. For example in line run {filename.txt} for file testsample1.txt, the finished command line should read:

run testsample1.txt NOTE: For command-line text entry, please note that simple ‘copy and paste’ from the PDF may introduce unexpected characters and hard returns that will result in processing errors. 1. Select homogeneous novel SNPs in a genomic interval:

awk '($2 >= 42000000)&&($2 = 42000000)&&($2 = 42000000)&&($2

Identification of mutations in zebrafish using next-generation sequencing.

Whole-genome sequencing (WGS) has been used in many invertebrate model organisms as an efficient tool for mapping and identification of mutations affe...
807KB Sizes 2 Downloads 0 Views