Molecular Phylogenetics and Evolution 80 (2014) 137–144

Contents lists available at ScienceDirect

Molecular Phylogenetics and Evolution journal homepage: www.elsevier.com/locate/ympev

Application of RAD-based phylogenetics to complex relationships among variously related taxa in a species flock Tetsumi Takahashi a, Nobuaki Nagata b, Teiji Sota a,⇑ a b

Department of Zoology, Graduate School of Science, Kyoto University, Sakyo, Kyoto 606-8502, Japan Division of Ecology and Evolutionary Biology, Graduate School of Life Sciences, Tohoku University, Aoba, Sendai 980-8578, Japan

a r t i c l e

i n f o

Article history: Received 15 April 2014 Revised 30 June 2014 Accepted 24 July 2014 Available online 7 August 2014 Keywords: Carabus Maximum-likelihood analysis Missing data Ohomopterus Orthologous sequence

a b s t r a c t Restriction site-associated DNA (RAD) sequences from entire genomes can be used to resolve complex phylogenetic problems. However, the processed data matrix varies depending on the strategies used to determine orthologous loci and to filter loci according to the number of taxa with sequence data for the loci, and often contains plenty of missing data. To explore the utility of RAD sequences for elucidating the phylogenetics of variously related species, we conducted RAD sequencing for the Ohomopterus ground beetles and attempted maximum-likelihood phylogenetic analyses using 42 data matrices ranging from 1.6  104 to 8.1  106 base pairs, with 11–72% missing data. We demonstrate that robust phylogenetic trees, in terms of bootstrap values, do not necessarily result from larger data matrices, as previously suggested. Robust trees for distantly related and closely related taxa resulted from different data matrices, and topologically different robust trees for distantly related taxa resulted from various data matrices. For closely related taxa, moderately large data matrices strongly supported a topology that is incompatible with morphological evidence, possibly due to the effect of introgressive hybridization. Practically, exploring variously prepared data matrices is an effective way to propose important alternative phylogenetic hypotheses for this study group. Ó 2014 Elsevier Inc. All rights reserved.

1. Introduction The recent development of restriction site-associated DNA (RAD) analyses using high-throughput sequencing technologies provides new, powerful approaches to various studies of nonmodel organisms without prior genomic information (e.g., Davey et al., 2011). In RAD sequencing, genome samples are digested using a restriction enzyme and sequenced from the restriction sites (Baird et al., 2008). This method can yield a vast number of short sequences using a next generation sequencer such as the Illumina HiSeq. Development of bioinformatic tools for RAD sequence data (e.g., Stacks, Catchen et al., 2011; PyRAD, Eaton, 2014) will promote its utility further. RAD sequencing has been successfully applied to linkage and quantitative trait locus (QTL) analyses using cross families (e.g., Baird et al., 2008; Baxter et al., 2011; Anderson et al., 2012; Everett et al., 2012; Hecht et al., 2012; Heliconius Genome Consortium, 2012; Martin et al., 2012; Takahashi et al., 2013), association studies using wild populations (Parchman et al., 2012; ⇑ Corresponding author. Address: Department of Zoology, Graduate School of Science, Kyoto University, Kitashirakawa-Oiwake-cho, Sakyo, Kyoto 606-8502, Japan. E-mail address: [email protected] (T. Sota). http://dx.doi.org/10.1016/j.ympev.2014.07.016 1055-7903/Ó 2014 Elsevier Inc. All rights reserved.

Hecht et al., 2013), population genomics (e.g., Hohenlohe et al., 2010; Bruneaux et al., 2012; Keller et al., 2013), phylogeny (e.g., Emerson et al., 2010; Heliconius Genome Consortium, 2012; Eaton and Ree, 2013; Jones et al., 2013; Wagner et al., 2013; Hipp et al., 2014; Cruaud et al., 2014), and identification of simple sequence repeat (SSR) markers (Barchi et al., 2011). However, practical questions remain about applying this method to phylogenetic studies of a group that includes variously related species (Rubin et al., 2012; McCormack et al., 2013). Some of these questions are common to phylogenetic approaches with other types of data, including phylogenomics (Wiens, 2003; Lemmon et al., 2009; Philippe et al., 2011; Wiens and Morrill, 2011). In phylogenetic analysis with RAD sequencing, major difficulties exist in establishing criteria for determining orthologous RAD sequences between samples and filtering RAD loci. Rubin et al. (2012) conducted an extensive in silico test using known genomic sequences for evaluating these criteria. They suggest that less strict criteria for clustering orthologous loci (i.e., higher allowance in sequence difference between samples) and filtering RAD loci (i.e., inclusion of loci shared by fewer samples) result in larger data matrices including more informative loci, and that the largest matrix results in robust topologies despite a large proportion of missing data. Overall, RAD sequences are considered useful for

138

T. Takahashi et al. / Molecular Phylogenetics and Evolution 80 (2014) 137–144

phylogenetic reconstruction in younger clades in which sufficient numbers of orthologous restriction sites are retained across species (Rubin et al., 2012). The application of RAD sequencing to the phylogeny of organisms from natural populations has just begun (McCormack et al., 2013). Some empirical studies have successfully resolved complex relationships among closely related species using RAD sequences (e.g., Heliconius Genome Consortium, 2012; Eaton and Ree, 2013). However, the manners in which different criteria for orthology searches and filtering of RAD loci affect phylogenetic inference have not been addressed in previous studies, except for Wagner et al. (2013), who compared the phylogenetic resolution of trees resulting from matrices with different RAD loci filtering thresholds with regard to the number of taxa recovered, and Cruaud et al. (2014), who examined four data sets with varying allowance of mismatches between taxa in determining orthologous RAD loci at a fixed minimum number of taxa recovered. Thus, further study is needed to understand the influence of different criteria for orthology searches and locus filtering on phylogenetic results to promote the utility of RAD-based phylogenetics. The subgenus Ohomopterus of the genus Carabus is one of the best models of rapid diversification resulting in a species flock, with 15 or more species on the Japanese islands in which diverged body size and genital morphology play a major role in reproductive isolation between species (Sota and Nagata, 2008). Ohomopterus diverged from its sister subgenus Isiocarabus in East Asia 2.14 million years ago during the Pleistocene (Sota and Nagata, 2008). Molecular phylogenetic studies using nuclear DNA revealed that this subgenus diverged into several species groups during the early Pleistocene, followed by differentiation within the species groups leading to the extant species (Sota and Vogler, 2003; Sota and Nagata, 2008). Among these species groups, the iwawakianus–insulicola group (Sota and Vogler, 2003) exhibits the most marked diversification in genital morphology. The six species are estimated to have diverged 0.76 million years ago, representing the most rapid speciation in this subgenus (Sota and Nagata, 2008). Previous molecular phylogenetic studies revealed that phylogenetic reconstruction of this subgenus is extremely difficult owing to introgressive hybridization between species and/or incomplete lineage sorting of ancestral polymorphism (Sota and Vogler, 2001, 2003; Sota and Sasabe, 2006; Sota and Nagata, 2008). In this study, we examined how varying thresholds for determining orthology and filtering loci affected the phylogenetic resolution using RAD sequence data from Ohomopterus beetles. We also examined whether the largest data matrix results in the most robust topology, as Rubin et al. (2012) suggested. Based on the specific pattern of radiation in Ohomopterus, we focused on both the iwawakianus–insulicola group, which represents the most extreme diversification in species-specific genitalia, and the remaining divergence mostly concerning splits of species groups (hereafter, ‘‘basal radiation’’), and we examined whether these radiations could be resolved in parallel in the same data matrix.

2. Materials and methods 2.1. Sampling We used 31 individuals from 15 species of Carabus (Ohomopterus) as the ingroup and one individual of Carabus (Isiocarabus) fiduciarius as the outgroup, as in a previous analysis (Sota and Nagata, 2008; Table 1). Only male adult beetles were used. Because the monophyly of each species within the iwawakianus–insulicola group was hard to recover in previous studies (Sota and Vogler, 2003; Sota and Nagata, 2008), we slightly increased the sample size for this species group. Samples were stored at 30 °C in either

an ethanol-fixed or fresh condition. The haploid genome size of Ohomopterus is estimated to be 260–280 million base pairs (Sota et al., 2013). 2.2. RAD sequencing RNA-free total genomic DNA was extracted from the testes of specimens using a DNeasy Blood and Tissue Kit (Qiagen). The concentration of DNA was determined using a Nanodrop 1000 (Thermo Fisher Scientific) and adjusted to 25 ng/ll. The RAD libraries were prepared according to Etter et al. (2011). PstI was selected as the restriction enzyme as it was expected to yield approximately 55,000 tags based on the estimated genome size of Ohomopterus and GC content of 35%. Each individual sample (10 ll containing 250 ng genomic DNA) was digested with the high-fidelity restriction enzyme PstI (New England Biolabs; NEB) in NEB Buffer 4 for 60 min, and barcoded with a unique five-nucleotide sequence. The barcodes were designed to be different at least two bases between samples. The library construction was conducted at Kyoto University until the P1 adaptor ligation and subsequently at Hokkaido System Science, Sapporo, Japan. All samples were run in a single lane of single-end 101-bp sequencing on a HiSeq2000 sequencer (Illumina) at Hokkaido System Science. 2.3. Data processing In the processing of sequence data for phylogenetic analyses, we focused on two important aspects following Rubin et al. (2012): the search for orthologous RAD sequences between samples, and the filtering of orthologous RAD loci according to the number of taxa with sequence data for the loci. For processing the de novo RAD sequences of Ohomopterus, we used PyRAD 1.621 (Eaton and Ree, 2013; Eaton, 2014). This software package allows for indels when clustering sequence reads into orthologous loci and probably performs better in orthology identification among distantly related taxa than the Stacks program (Catchen et al., 2011), which does not consider indels. The key parameter for identifying orthologous RAD sequences in PyRAD is the clustering threshold (hereafter, Wclust), which is the minimum proportion of identical nucleotides allowed for identification of orthologous RAD loci between samples. When Wclust is large, the identification is strict, but RAD loci with variable sequences among taxa can be excluded. When Wclust is small, various RAD loci from conservative to highly variable sequences will be included, with some misidentification of orthologs. Thus, Wclust can greatly affect phylogenetic inferences. The loci were filtered by choosing the minimum number of taxa that must recover a sequence of a particular RAD locus (hereafter Mintaxa, which is equivalent to ‘‘min. taxa’’ of Rubin et al., 2012 and ‘‘min individuals’’ of Wagner et al., 2013). At Mintaxa = 4, for example, a locus is included in the data matrix only when its allelic sequences are shared by four or more samples. A sequence of a particular RAD locus may not be recovered in some taxa due to evolution of the restriction site per se (i.e., absence of the sequence data on a particular loci in the sample), evolution of RAD loci sequences that exceeds Wclust (i.e., the sequence exists in the raw data but is not identified as orthologous to the particular RAD locus), misidentification of orthologous loci with three or more alleles in a sample (i.e., the sequence exists in the raw data but is removed from the data matrix), and experimental errors (e.g., the orthologous sequence exists in the genome but is not read or is misread during sequencing). With higher Mintaxa, the data contain a higher proportion of RAD loci associated with conserved restriction sites recovered from more taxa, but the data matrix size is reduced. With smaller Mintaxa, numerous RAD loci recovered within limited

139

T. Takahashi et al. / Molecular Phylogenetics and Evolution 80 (2014) 137–144

Table 1 Sample code, taxon, locality and year of collection, the number of restriction site-associated DNA (RAD) sequences that passed quality filtering, and the range of mean coverage depth of loci, which vary with clustering threshold (Wclust), of the samples examined. Code

Taxon

Ingroup: Carabus (Ohomopterus) CR01 C. albrechti albrechti CR02 C. lewisianus awakazusanus CR03 C. kimurai CR04 C. yamato CR05 C. japonicus tsushimae CR06 C. japonicus japonicus CR07 C. daisen daisen CR08 C. dehaanii CR09 C. dehaanii CR10 C. tosanus tosanus CR11 C. tosanus kawanoi CR12 C. yaconinus yaconinus CR13 C. yaconinus maetai CR14 C. iwawakianus kiiensis CR15 C. iwawakianus iwawakianus CR16 C. iwawakianus narukawai CR17 C. iwawakianus shima CR18 C. maiyasanus maiyasanus CR19 C. maiyasanus shigaraki CR20 C. maiyasanus takiharensis CR21 C. maiyasanus hokurikuensis CR22 C. uenoi CR23 C. uenoi CR24 C. arrowianus arrowianus CR25 C. arrowianus nakamurai CR26 C. arrowianus komiyai CR27 C. arrowianus komiyai CR28 C. esakii esakii CR29 C. esakii suruganus CR30 C. insulicola insulicola CR31 C. insulicola shinano Outgroup: Carabus (Isiocarabus) CR32 C. fiduciarius saishutoicus

Locality

Year

RAD sequences

Coverage depth

Shimamaki, Hokkaido Mt. Tomiyama, Boso, Chiba Tajima, Shimizu, Shizuoka Mt. Kongosan, Osaka Kamimisaka, Tsushima Is. Kabeshima I., Saga Kyusho-koen, Tottori Gokase, Miyazaki Yonezawa, Chino, Nagano Madotoge, Saijo, Ehime Mt. Tsurugi, Mima, Tokushima Mt. Yoshida, Sakyo, Kyoto Takahashi, Okayama Mt. Sanjogatake, Ohmine, Nara Mt. Kongosan, Osaka Minamihata, Suzuka, Mie Tozu, Watarai, Mie Mt. Uryu, Sakyo, Kyoto Mt. Aboshi, Ritto, Shiga Ohgasho, Ohdai, Mie Mt. Monju, Fukui, Fukui Mt. Kongosan, Osaka Mt. Katsuragi, Nara Iwakura, Toyota, Aichi Shimojo, Nagano Mt. Ogasayama, Shizuoka Mt. Akibayama, Shizuoka Izu, Shizuoka Maruno, Nirasaki, Yamanashi Yamakita, Kanagawa Toyoshina, Azumino, Nagano

2007 2011 2010 2011 2008 2011 2003 2008 2007 2008 2009 2011 2007 2005 2011 2002 2003 2009 2011 1999 2004 2011 2004 2004 2004 2002 2002 2006 2006 2009 2007

9,421,372 7,182,945 1,045,270 2,407,306 4,674,219 4,789,792 1,923,894 4,561,445 2,303,695 4,877,683 5,870,600 4,511,516 96,962 5,083,924 808,723 4,946,625 4,363,845 8,381,231 6,147,279 5,336,660 4,397,618 2,054,664 153,989 8,581,763 7,297,268 8,041,877 694,529 1,373,386 130,972 301,941 118,265

95–140 76–111 20–22 39–42 55–73 55–73 30–33 58–74 41–43 54–73 58–83 50–67 8.7–9.2 52–74 17–18 53–73 48–64 75–113 62–90 55–76 52–78 35–38 9.8–10 79–120 72–102 71–108 16–17 24–26 9.6–9.9 11 9.6–10

Mt. Hallasan, Jeju I., South Korea

1999

741,158

38–47

taxa are also included; hence the proportion of missing data increases. Using PyRAD, 42 data matrices with six Wclust values (50%, 60%, 70%, 80%, 90%, and 95%) and seven Mintaxa values (4, 8, 12, 16, 20, 24, and 28) were made. The raw RAD sequence data obtained from the HiSeq sequencer were processed under given Wclust and Mintaxa values as follows: The RAD sequences (101 base pairs in length) were sorted by their barcode sequences into individual samples. Bases with a Phred quality score lower than 20 were changed into ‘‘N’’s, and sequence reads with more than four ‘‘N’’s were discarded (i.e., quality filtering). Then, the barcode sequence (1st to 5th sites) and the restriction site (6th to 10th sites) were removed from each RAD sequence. In each sample, the processed sequences (11th to 101st sites; 91 base pairs in length) were clustered into orthologous loci under the given Wclust value. Note that we used the same clustering threshold to identify orthologous loci within samples and among samples; otherwise, sequences that were identified as different loci within a sample could be combined, undesirably, into the same loci in the later orthologous search among samples. In each sample, loci with coverage depths of less than three, four or more undetermined sites, four or more heterozygous sites, or three or more alleles were discarded. From these data, orthologous RAD loci were identified among samples under the given Wclust value and were filtered under the given Mintaxa value. The resulting orthologous loci, including both variable and invariable loci, were concatenated into a data matrix. 2.4. Tree search and evaluation Maximum-likelihood (ML) analysis was performed for each data matrix using RAxML ver. 8.0.0 (Stamatakis, 2014), and a rapid bootstrapping analysis with 100 bootstrap replicates was

conducted (Stamatakis et al., 2008). Throughout the ML analyses, we used a general-time-reversible model with gamma distributed rate variation (GTR+), which is the most inclusive model, because simpler evolution models may greatly reduce the accuracy of phylogenetic inference when data matrices contain large proportions of missing data (Roure et al., 2012). 3. Results 3.1. RAD sequencing and sequence data matrices We obtained a total of 142 million RAD sequences [DNA Data Bank of Japan (DDBJ): accession No. DRA001025]. The number of RAD sequences that passed quality filtering per sample ranged from 9.7  104 to 9.4  106 (123 million sequences in total; Table 1). The mean coverage depth of loci (the mean number of sequences per locus with three or more sequences) greatly varied among samples, from 8.7 to 141, partly due to low quality of some DNA samples. The size of data matrices increased as Wclust increased from 50% to 60% and decreased as Wclust increased from 90% to 95% (Fig. 1a). The size of the matrix at Wclust = 80% and Mintaxa = 4 was the largest among the matrices obtained. The proportion of missing data was naturally higher with smaller Mintaxa, and at a given Mintaxa, it was low at intermediate Wclust values (60– 90%) when Mintaxa 6 12 (Fig. 1b). In each data matrix, the proportions of missing data were high in the ingroup samples with mean coverage depth lower than 30 (Fig. 2) as well as in the outgroup, despite its reasonable mean coverage depth. The proportion of gaps in data matrices (0.1–1.5%; Fig. 1c) was much smaller than the proportion of missing data (11–72%). The proportion of single nucleotide polymorphisms (SNPs) increased with decreasing Wclust and Mintaxa when Wclust P 70% and Mintaxa P 12 (Fig. 1d).

140

T. Takahashi et al. / Molecular Phylogenetics and Evolution 80 (2014) 137–144

(a)

9

(b)

80

1.8

(c)

30

(d)

Min taxa = 4 1.6

70

6 5 8 4 12 3 16 2

20 24

1

25 1.4

60 50 40 30 20

Percentage of SNPs

7

Percentage of gaps

Percentage of missing sites

Length of matrix (M bp)

8

1.2 1.0 0.8 0.6

20

15

10

0.4 5

10

0.2

28 0

0 50 60 70 80 90 95

0.0 50 60 70 80 90 95

0 50 60 70 80 90 95

50 60 70 80 90 95

Clustering threshold (Wclust %) Fig. 1. Properties of the 42 data matrices used in the present study. (a) Length of concatenated sequence. (b) Percentage of missing sites. (c) Percentage of gaps. (d) Percentage of single nucleotide polymorphisms (SNPs). Lines are coloured according to Mintaxa value as shown in (a).

Percentage of missing sites

100 80

60

40

20

0 0

20

40

60

80

100

120

140

160

Mean coverage depth Fig. 2. Percentage of missing sites at each mean coverage depth of loci in the matrix with Wclust = 80% and Mintaxa = 4. Open circles indicate ingroup samples and the grey circle indicates the outgroup sample. Matrices with the other Wclust and Mintaxa values exhibited the same tendency, i.e., high proportion of missing data in samples with mean coverage depth of loci

Application of RAD-based phylogenetics to complex relationships among variously related taxa in a species flock.

Restriction site-associated DNA (RAD) sequences from entire genomes can be used to resolve complex phylogenetic problems. However, the processed data ...
2MB Sizes 0 Downloads 8 Views