J Mol Evol (2014) 78:148–162 DOI 10.1007/s00239-013-9603-y

ORIGINAL ARTICLE

Multiple ITS Haplotypes in the Genome of the Lichenized Basidiomycete Cora inversa (Hygrophoraceae): Fact or Artifact? Robert Lu¨cking • James D. Lawrey • Patrick M. Gillevet • Masoumeh Sikaroodi Manuela Dal-Forno • Simon A. Berger



Received: 5 September 2013 / Accepted: 20 November 2013 / Published online: 17 December 2013 Ó Springer Science+Business Media New York 2013

Abstract The internal transcribed spacer region (ITS) of the nuclear rDNA cistron represents the barcoding locus for Fungi. Intragenomic variation of this multicopy gene can interfere with accurate phylogenetic reconstruction of biological entities. We investigated the amount and nature of this variation for the lichenized fungus Cora inversa in the Hygrophoraceae (Basidiomycota: Agaricales), analyzing base call and length variation in ITS1 454 pyrosequencing data of three samples of the target mycobiont, for a total of 16,665 reads obtained from three separate repeats of the same samples under different conditions. Using multiple fixed alignment methods (PaPaRa) and maximum likelihood phylogenetic analysis (RAxML), we assessed phylogenetic relationships of the obtained reads, together with Sanger ITS sequences from the same samples. Phylogenetic analysis showed that all ITS1 reads belonged to a single species, C. inversa. Pyrosequencing data showed 266 insertion sites in addition to the 325 sites expected from Sanger sequences, for a total of 15,654 insertions Electronic supplementary material The online version of this article (doi:10.1007/s00239-013-9603-y) contains supplementary material, which is available to authorized users. R. Lu¨cking (&) Science & Education, Integrative Research Center, The Field Museum, 1400 South Lake Shore Drive, Chicago, IL 60605-2496, USA e-mail: [email protected] J. D. Lawrey  P. M. Gillevet  M. Sikaroodi  M. Dal-Forno Department of Environmental Science and Policy, George Mason University, Fairfax, VA 22030-4444, USA S. A. Berger The Exelixis Lab, Scientific Computing Group, Heidelberg Institute for Theoretical Studies, Schloss-Wolfsbrunnenweg 35, 69118 Heidelberg, Germany

123

(0.94 insertions per read). An additional 3,279 substitutions relative to the Sanger sequences were detected in the dataset, out of 5,461,125 bases to be called. Up to 99.3 % of the observed indels in the dataset could be interpreted as 454 pyrosequencing errors, approximately 65 % corresponding to incorrectly recovered homopolymer segments, and 35 % to carry-forward-incomplete-extension errors. Comparison of automated clustering and alignment-based phylogenetic analysis demonstrated that clustering of these reads produced a 35-fold overestimation of biological diversity in the dataset at the 95 % similarity threshold level, whereas phylogenetic analysis using a maximum likelihood approach accurately recovered a single biological entity. We conclude that variation detected in 454 pyrosequencing data must be interpreted with great care and that a combination of a sufficiently large number of reads per taxon, a set of Sanger references for the same taxon, and at least two runs under different emulsion PCR and sequencing conditions, are necessary to reliably separate biological variation from 454 sequencing errors. Our study shows that clustering methods are highly sensitive to artifactual sequence variation and inadequate to properly recover biological diversity in a dataset, if sequencing errors are substantial and not removed prior to clustering analysis. Keywords Basidiomycota  Dictyonema  Environmental sequencing  Next-generation sequencing

Introduction The internal transcribed spacer (ITS) of the nuclear ribosomal DNA cistron is the barcoding locus for Fungi

J Mol Evol (2014) 78:148–162

(Schoch et al. 2012), because of its versatility in resolving the species-level clades and the large amount of ITS sequence data already available (Pinto et al. 2004; Rossman 2007; Seifert 2008; Taylor and McCormick 2008; Eberhardt 2010; Begerow et al. 2010; Del-Prado et al. 2010; Hibbett et al. 2011; Schoch et al. 2012). Nuclear ribosomal DNA occurs in multiple copies in the genome and is easy to amplify from small quantities of template material. This is also disadvantageous, since multiple copies open up the possibility of paralogs and pseudogenes that do not reflect species phylogenies (Simon and Weiss 2008; Nilsson et al. 2008; Kova´cs et al. 2011a, b; Kiss 2012). It is assumed that the evolution of paralogs and pseudogenes in nuclear ribosomal DNA is prevented by concerted evolution (Hurst and Smith 1998; Liao 1999, 2008; Ganley and Kobayashi 2007). Yet, an increasing number of cases are being reported that suggest ITS paralogs in Fungi (Hughes and Petersen 2001; Ko and Jung 2002; Gomes et al. 2002; Smith et al. 2007; Simon and Weiss 2008; Nilsson et al. 2008; James et al. 2009; Lindner and Banik 2011; Kova´cs et al. 2011a, b; Li et al. 2013; Lindner et al. 2013). In plants, such cases were mostly interpreted as the result of hybridization (Hughes and Petersen 2001; Won and Renner 2005; Gulyas et al. 2005). On the other hand, it is unclear to what extent intragenomic variation detected through cloning, restriction fragment length polymorphism, and next-generation sequencing (NGS) (Gomes et al. 2002; Smith et al. 2007; Lindner and Banik 2011; Lindner et al. 2013) is real or artifactual. Critical screening of ITS Sanger sequences for selected taxa reveals that sequencing errors are frequent (Lu¨cking et al. 2012; Moncada et al. 2013). On the other hand, approaches using specific primers for ITS pseudogenes (Li et al. 2013) leave little doubt that such pseudogenes are a reality at least in some fungal lineages. Next-generation sequencing offers a powerful approach to the analysis of potential intragenomic variation on multicopy genes. It produces a large amount of sequence data in short time at low cost and allows for discrimination of biological entities within a single source without cloning (Margulies et al. 2005; Ahmadian et al. 2006; Rothberg and Leamon 2008; Nilsson et al. 2011). Apart from genome sequencing, these approaches are useful for analyzing microbial and fungal communities at the environmental level (O’Brien et al. 2005; Sogin et al. 2006; Geml et al. 2008; Taylor et al. 2008; Bue´e et al. 2009; Amend et al. 2010; Lumini et al. 2010; Wallander et al. 2010; Hibbett et al. 2011; Unterseher et al. 2011; Dai et al. 2012; McGuire et al. 2012). On the downside, NGS presents a number of problems, the most important being the huge amount of data generated, which requires approaches to data analysis that rely mostly on automated clustering (Li and Godzik 2006; Schloss et al. 2009; Edgar 2010, 2013; Caporaso et al. 2010a; Huang et al. 2010; Huse et al. 2010;

149

Kumar et al. 2011; Nilsson et al. 2011). Since clusters and consensus or seed sequences are based on overall similarity derived from pairwise alignments, testing the quality and consistency of individual reads becomes a challenge, because potential errors are masked. One of the most problematic issues is the overestimation and misinterpretation of biological units present in a sample, be it taxa or gene copies (Quince et al. 2009; Engelbrektson et al. 2010; Kunin et al. 2010; Porter and Golding 2011; Powell et al. 2011; Unterseher et al. 2011; Zhou et al. 2011). Improved filtering algorithms have reduced the number of errors, but they cannot guarantee that sequence reads are error-free, and a large amount of reads already generated and deposited in public databases such as the sequence read archive (SRA) have not undergone rigorous error testing. Various papers have reported and analyzed potential errors generated by 454 pyrosequencing, including duplication, chimaeras, substitution errors caused by over- and under-calls, and length variation of individual reads due to indels not present in corresponding Sanger sequences (Margulies et al. 2005; Huse et al. 2007; Quince et al. 2008; Hoff 2009; Gomez-Alvarez et al. 2009; Kunin et al. 2010; Niu et al. 2010; Tedersoo et al. 2010; Balzer et al. 2011). To filter potential sequencing errors or erroneous reads, flow values from the 454 pyrosequencing process are transformed to quality scores and used to detect lowquality reads prior to data analysis or to weight reads in clustering approaches (Huse et al. 2010). Other denoising methods use rank-abundance distribution of reads or a likelihood approach (Reeder and Knight 2010; Quince et al. 2011). Studies using duplicated 454 pyrosequencing runs from the same samples found no significant differences in the allocation of sequences to taxonomic units, suggesting that at least part of the detected variation is biological (Kauserud et al. 2012; Kumar et al. 2011). This was also supported by cloning techniques (Tedersoo et al. 2010; Kauserud et al. 2012), suggesting that 454 pyrosequencing errors are low relative to biological variation. However, the assumption that length variation in non-singleton clusters is biological, while in singletons it is artifactual (Kunin et al. 2010; Tedersoo et al. 2010; Unterseher et al. 2011), is ill-defined, since clustering methods are unable to distinguish between biological and artifactual sources of length variation, and hence cluster sizes cannot be used to categorize such variation. If the frequency of sequence reads representing separate taxa approximately reflects the abundance of the taxa themselves, with variation depending on PCR settings, it would in fact be natural to assume that a large number of biological taxa in a NGS dataset are represented by singletons only, a notion pointed out by Unterseher et al. (2011). A major problem of 454 pyrosequencing is the accurate detection of homopolymer segments. Since these are

123

150

sequenced in a single step, the only way to determine the number of bases per segment is by measuring the chemiluminescent signal strength; the stronger the signal, the more bases were added in a given step. Signal strength is thereby calibrated against an initial key of four bases. In order to produce a signal that correctly represents the homopolymer length at a given position, the fragments present in a well must be completely extended, which depends on the number of DNA templates in the well, homopolymer length, and the amount of fluorescent base molecules flooded during each cycle. In reality, the extension follows a Poisson distribution for a given well; most fragments will be fully extended, whereas a small portion will only be partially extended. This leads to a suboptimal light signal for the well which will then be interpreted as a homopolymer of shorter length (Margulies et al. 2005; Huse et al. 2007; Gomez-Alvarez et al. 2009; Kunin et al. 2010; Niu et al. 2010; Tedersoo et al. 2010; Balzer et al. 2011). In addition, the DNA templates in the well that are not fully complemented become desynchronized and will be completed during the next cycle with the same flow base. Depending on calibration, this additional signal may disappear as background noise or the sequences are filtered out, since the additional signal is below the calibrated threshold. However, if the proportion of desynchronized templates is sufficiently large, the well can produce an additional signal interpreted as an extra base call. Since flow cycles of the same base are separated by other base flow cycles, this extra base call appears several bases after is corresponding homopolymer and hence is not detectable as an error. This is known as carry-forwardincomplete-extension (CaFIE) error; it is rarely mentioned or accounted for correctly in 454 studies and usually not considered a significant source of error, and sometimes even assumed to be automatically corrected by the analytical software filtering the reads (454 Life Sciences Corporation 2009; Quince et al. 2009, 2011; Balzer et al. 2011; Gilles et al. 2011; Vandenbroucke et al. 2011; Keegan et al. 2012). Two recent studies proposed automated pipelines including corrections for CaFIE errors (RC454: Henn et al. 2012; ICC: Deng et al. 2013); however, upon testing, these either did not work properly (RC454) or did not provide a full set of corrected reads as results (ICC). Also, both methods rely on quality scores obtained from the sequencing process to perform corrections, although CaFIE errors technically cannot be detected using quality scores. As part of a 454 study using the ITS1 region of the fungal ITS barcoding locus to detect fungal inhabitants and contamination in thalli of the lichenized basidiomycete genus Dictyonema and its allies (Lawrey et al. 2009; DalForno et al. 2013), we obtained a large set of target mycobiont reads from three samples belonging to a single

123

J Mol Evol (2014) 78:148–162

species, Cora inversa, but with a large amount of variation of initially undetermined origin. Clustering techniques suggested this variation to represent a large number of biological entities, whereas multiple-alignment-based phylogenetic methods using a maximum likelihood approach clustered them into a single species. To resolve this conflict, we analyzed the same samples repeatedly under different conditions, to discriminate between biological and artifactual variation and to quantify the proportion of each within the studied samples.

Materials and Methods Both Sanger and 454 sequence data were obtained from three samples of C. inversa from Colombia (Table 1), as part of a larger set of samples run together in two sequencing runs, MTPS25 and MTPS57. The samples were targeted using fungal-specific primers for the entire ITS (Sanger) and the ITS1 region (454), the latter producing reads of a target length of 325 bases length for this taxon. The ITS was selected since it is the fungal barcoding locus (Hebert et al. 2003; Schoch et al. 2012) and since we used it successfully to address genus and species delimitation in the Dictyonema clade (Lawrey et al. 2009; Dal-Forno et al. 2013). Genomic DNA was extracted from lichenized thalli using the Bio 101 Fast DNA Spin Kit for tissue (Qbiogen, Illkirch, France). About 10 ng of extracted DNA were subjected to a standard PCR in a 25 lL reaction volume using either Taq Gold polymerase (Applied Biosystems, Foster City, California, USA) or Bio-X-Act Long Mix (Bioline USA, Taunton, Massachusetts, USA). Sequence data were obtained from the ITS using the universal ITS primers ITS5 (forward) and ITS4 (reverse) for Sanger sequencing (ITS1, 5.8S, ITS2), and the primers ITS1F (forward) and ITS2 (reverse) (ITS1 region) for 454 pyrosequencing (http://www.biology.duke.edu/fungi/myco lab/primers.htm). After visualizing the PCR products with ethidium bromide on a 1 % agarose gel, the products were purified with magnetic beads (Agencourt Biosciences, Beverly, MA, USA) and used in standard sequencing reactions for Sanger sequencing with BigDye Terminator Ready Reaction Mix v3.1 (Applied Biosystems). The sequencing reactions were then purified using Sephedex G-50 (Sigma-Aldrich, St. Louis, MO, USA), dried in a speedvac, denatured in HiDi Formamide (Applied Biosystems) and run on an ABI3130-xl capillary sequencer (Applied Biosystems). The data collected were analyzed using ABI Sequence Analysis software and sequences were assembled with Sequencher v4.7 software (GeneCodes, Ann Arbor, MI, USA). Sanger ITS sequences were aligned with sequences of other species of Cora obtained from GenBank (same dataset as in Dal-Forno et al. 2013).

J Mol Evol (2014) 78:148–162

151

Table 1 Genbank and SRA and BioProject accession numbers for sequences used in this study. The SRA accession number for the entire dataset is SRP033721 Sequences

DIC149

DIC150

DIC154

Sanger ITS barcoding gene

KF443236

KF917526

KF443237

454 pyrsequencing ITS barcoding gene (consensus)

KF922370

KF922371

KF922372

454 pyrsequencing ITS barcoding gene (full dataset)

SRX391441, SRS513899

SRX391469, SRS513900

SRX391486, SRS513901

For 454 pyrosequencing, the samples were initially amplified with ITS primers, ITS1F (forward) and ITS2 (reverse) primers attached to a FAM label, allowing us to fingerprint the products to confirm the size and distribution of the community (Sikaroodi and Gillevet 2012). The forward and reverse primers had different custom tags and an A and B adapter, respectively (titanium adapters from the 454 pyrosequencing protocol), making it possible to pool different samples into one emulsion PCR. After confirming the PCR products on a 1 % agarose gel with ethidium bromide and fingerprinting them, the products were pooled and subjected to an emulsion PCR using the 454 emPCR_LibL kit, with varying number of cycles (see below). Pyrosequencing was done on the recovered beads from the emulsion PCR using 454 sequencing kits. The sequence results were separated into bins for different tags using a custom made perl script (Sikaroodi and Gillevet 2012). Three different combinations of settings and conditions were used for each repeat emulsion PCR and 454 pyrosequencing run, all based on the same community PCR and hence on the same biological templates for the three samples of C. inversa: (1) initial emulsion PCR plus initial 454 pyrosequencing settings (repeat 1 in sequencing run MTPS25); (2) initial emulsion PCR plus modified 454 pyrosequencing settings (repeat 2A in sequencing run MTPS57); (3) de-novo emulsion PCR plus the same modified 454 pyrosequencing settings (repeat 2B in the same sequencing run MTPS57). The modified settings mainly concerned optimization between repeats 1 and 2A/ B. Since tests showed that the suggested standard 50 cycles for emulsion PCR resulted in less reads passing the quality filters and hence less yield and higher proportion of sequening errors, we modified the settings to 45 cycles of emulsion PCR for repeats 1 and 2A (initial emulsion PCR) versus 40 cycles for repeat 2B (de-novo emulsion PCR). As a result, repeats 1 and 2A produced approximately 5 % bead recovery, with all beads loaded on the sequencing plate, whereas repeat 2B resulted in approximately 10 % bead recovery, half of which were loaded on the sequencing plate. In addition, between sequencing runs 1 (MTPS25) and 2A/B (MTPS57), the valves of the GS Junior were replaced. Since yield and proportional read quality improved consistently between repeats 1 and 2A/B (from less than 40 % to over 70 % yield), the overall

quality of the sequencing results was categorized as moderate (repeat 1) and good (2A/B). As a consequence, substantial differences in the porportional frequency of read variation patterns (e.g. the same indel at the same position) detected between repeats 1 and 2A would have to be attributed to systematic sequencing errors, whereas differences between repeats 2A and 2B would be interpreted as caused by variation in the emulsion PCR settings. In contrast, patterns of real biological variation would result in comparable proportional frequencies across repeats or (slightly) increased frequency in repeats 2A/B, since it can be assumed that repeats 2A/B would yield a higher proportion of accurate reads compared to repeat 1. Since, the 454 pyrosequencing amplifies and sequences all fungi present in the samples, the resulting reads were subjected to a filtering process to select only those ITS1 reads corresponding to C. inversa. The reads were ‘‘predigested:’’ by merging all reads from the three samples into a single fasta-formatted file. This dataset was analyzed using a combined USearch-Perl-Windows script in batch format (Online Appendix A). The script uses USearch (Edgar 2010, 2013) and the Perl script fasta-to-tab-delimited format (and viceversa) provided by the FAS Center for Systems Biology of Harvard University (http://sysbio.har vard.edu/csb/resources/computational/scriptome/Windows/ Tools/Change.html), as well as the Windows findstr command. It arranges all reads according to length and then clusters the reads at a prespecified ID level, producing a combined fasta-formatted file containing the reference read (the longest read) for each cluster. We used 10.0, 0.1, 1.0, and 0.1 as internal and external gap opening and gap extension penalties (Edgar 2010), to minimize the formation of separate clusters based on read length rather than base sequence patterns. The ID level was set to 0.90 to minimize splitting of closely related taxa into separate clusters, while at the same time making the clusters consistent enough to allow for de-novo alignment of the contained reads. The script generates a modified fasta file of all reads, sorted by their cluster assignments from the previous step, which is aligned using the quick alignment algorithm provided by USearch StarAlign (Edgar 2010) and then converted to a tabdelimited text file, and the contained clusters are parsed out into separate fasta files. In addition to parsing the entire dataset into separate clusters at ID = 0.90, we also analyzed cluster formation of the dataset at sequential ID levels

123

152

(0.50, 0.60, 0.70, 0.80, 0.90, 0.91, 0.92, 0.93, 0.94, 0.95, 0.96, 0.97, 0.98, 0.99) to obtain the number of potential OTUs obtained at each level. To provide taxonomic identifications of the reads and extract those representing C. inversa, each of the clusters obtained at the ID = 0.90 level was batch-blasted (i.e. with blast results for each individual read) against a curated database combined from IAB and UNITE (Ko˜ljalg et al. 2005, 2013; Nilsson et al. 2005, 2006, 2008, 2009; Abe et al. 2011) plus a Sanger sequence database obtained from a large number of Dictyonema s.lat. samples including C. inversa. The clusters were also batch-blasted against GenBank sequences using a threefold approach: (1) including, (2) excluding uncultured/environmental sample sequences, and (3) if the best hit from (1) was an uncultured/ environmental sequence that sequence was blasted again excluding uncultured/environmental sample sequences. Thus, we obtained four separate blast results for each read and its taxonomic identity was determined based on consistency of these results. The reads were then separated into (A) those blasting with C. inversa and other species of Dictyonema, (B) those blasting with closely related Basidiomycota, and (C) those blasting with other Basidiomycota and Ascomycota or lower fungi. While the reads of group (C) were not further considered, those of group (B) were subjected to a more specific local blast including only Dictyonema and closely related Basidiomycota (Agaricales) as reference sequences, to detect further reads representing Dictyonema which were then added to group (A). All reads of group (A) were then aligned to a Sanger ITS reference alignment representing all sequenced species of Dictyonema s.lat. (including Cora; Dal-Forno et al. 2013) using PaPaRa 2.0 (Berger and Stamatakis 2011; Berger et al. 2011). The alignment was analyzed phylogenetically with a maximum likelihood approach using the RAxMLHPC BlackBox 7.3.2 on the Cipres Gateway server (Stamatakis 2006; Stamatakis et al. 2005, 2008; Miller et al. 2010), employing a GTR-Gamma model, with parametric bootstrapping using an automatically determined number of replicates as implemented by the RAxML-HPC BlackBox by means of a saturation assessment. A total of 16,665 reads was found to cluster with the Sanger reference sequences of C. inversa and these were retained for further study. The retained 454 ITS1 reads of C. inversa were realigned with a Sanger ITS reference alignment of the same species to determine indel sites. MAFFT (Katoh et al. 2002, 2009) was unable to accurately align the length variable ITS1 reads, failing to recognize all homologous base and indel positions. Instead, accurate results were achieved with PaPaRa 2.0 (Berger and Stamatakis 2011), which aligns length variable 454 reads to a known Sanger reference alignment, introducing gaps into the reference alignment where the 454 reads contain additional base

123

J Mol Evol (2014) 78:148–162

calls. For this purpose, PaPaRa uses a probabilistic gap model based on the fixed alignment and the best-scoring maximum-likelihood tree derived via RAxML (Stamatakis 2006) from that alignment. This method produces accurate alignments superior to those generated with MAFFT, and the position of insertions is easily discernible. In addition, PaPaRa allows for the generation of insertion-free alignments of the 454 reads by optionally deleting all insertion sites not present in the Sanger reference sequences. PaPaRa was also used to compute read lengths with and without insertions for each 454 read, and to calculate the relative proportion of insertions across all reads. The ‘‘Positional Nucleotide Numerical Summary’’ function in BioEdit 7.0.9 (Hall 1999) was used to compute the base composition at each alignment position to determine the nature of each of the indels across all the reads overall, separated by sample and by run. Indels were classified into two major site categories, homopolymer extensions and heteropolymer insertions. The latter were further analyzed for site-specific variation among reads. In order to test the effect of indels in the 454 ITS reads on the estimation of taxonomic diversity by means of automated clustering methods, we used another combined USearch-Perl-Windows script (Online Appendix B), which automatically computes the number of taxonomic units based on a predefined threshold similarity level by parsing out the seed sequences for each assumed taxonomic unit into a fasta-formatted file. The script was applied to the complete set of reads corresponding to C. inversa and the following threshold similarity levels were applied: 0.50, 0.60, 0.70, 0.80, 0.90, 0.91, 0.92, 0.93, 0.94, 0.95, 0.96, 0.97, 0.98, 0.99, 1.00. We also randomly selected different numbers of reads (16, 32, 64, 125, 250, 500, 1000, 2000, 4000, 8000, 16000) and ran the script for each set to compute the relationship between read frequency and estimate of biological diversity.

Results The total yield of the two sequencing runs MTPS25 and MTPS57 was 41,465 and 84,368 reads, respectively, for the samples corresponding the Cora and Dictyonema. After separation of reads corresponding to other species of (endolichenic) fungi or contaminants, and separating the remaining reads according to genera and species, the number of ITS1 reads for the target mycobiont of the three samples of C. inversa and three repeats (1, 2A, 2B) was 16,665, with a range of 1,269–2,465 reads per sample and repeat (Table 2). The largest number of reads was generated for sample DIC154 and repeat 2B; for all samples, repeats 2A/B had a substantially higher yield than repeat 1. The target length for the ITS1 reads according to Sanger

J Mol Evol (2014) 78:148–162

153

Table 2 Number of 454 ITS1 reads obtained for each sample of C. inversa in each of the runs Run

DIC149

DIC150

DIC154

Total

Run 1 (setting 1)

1,269

1,668

1,341

4,278

Run 2 (setting 2, community PCR of run 1)

1,830

1,943

2,318

6,091

Run 3 (setting 2, new community PCR)

1,588

2,243

2,465

6,296

Total

4,687

5,854

6,124

16,665

Fig. 1 Histogram of read lengths of the 16,665 454 ITS1 reads corresponding to C. inversa generated from the three samples and three runs

sequences from the same samples was 325 bases. The 454 ITS1 reads varied in length from 100 bases (predetermined lower length cut-off) to 349 bases (Fig. 1), with the 22 longest reads found in repeat 1 of samples DIC150 and DIC154. The majority of the reads (88 %) was complete to near-complete, with 300–349 bases length; only 5.7 % of the reads had less than 200 bases (Fig. 1). The number of insertion sites per read varied from 0 to 25, with most of the larger values found in reads of repeat 1 of samples DIC150 and DIC154; however, the largest value was found in a read of repeat 1 of sample DIC149 (Table 3). On average, reads from repeats 2A/B had a significantly lower number of insertions (0.20–0.44 per sample) than those from repeat 1 (2.30–3.33 per sample; Kruskal–Wallis ANOVA, p \ 0.001). Relative to read length, this difference was even more pronounced, with 0.07–0.15 % (range 0–5.37 %) in reads of repeats 2A/B, and 0.84–1.23 % (range 0–14.4 %) in reads of repeat 1 (Table 4). While some reads had up to 14.4 % of their bases corresponding to insertions, the large majority of

reads even in repeat 1 had less than 1 % of insertion sites, corresponding to about 2–3 bases per read. The total length of the entire PaPaRa alignment of all 16,665 454 ITS1 reads plus nine corresponding Sanger reference sequences was 591 bases, including 325 sites corresponding to the Sanger reference sequences and 266 insertion sites obtained from the 454 ITS1 reads (Online Appendix C). Thus, 45 % of the PaPaRa alignment represented insertion sites. Of the 591 sites, positions 1–15 corresponded to the nuSSU primer and positions 505–591 to the highly conserved 5.8S region including the 5.8S primer. The insertion sites could be classified into three categories: (1) homopolymer extensions; (2) heteropolymer insertions; and (3) mixed sites which, depending on the base call in individual reads, can be interpreted as either homopolymer extensions or heteromer insertions (Online Appendix D). A total of 115 positions, five corresponding to the ITS5 forward primer and ten to the 5.8S region, were classified as homopolymer extensions; 47 sites represented heteropolymer insertions (one in the primer and seven in

123

154

J Mol Evol (2014) 78:148–162

Table 3 Number of insertions (mean and range) per read obtained for each sample of C. inversa in each of the runs Run

DIC149

DIC150

DIC154

Run 1 (setting 1)

2.30 (0–25)

3.33 (0–24)

3.11 (0–24)

Run 2 (setting 2, community PCR of run 1)

0.44 (0–15)

0.20 (0–18)

0.31 (0–16)

Run 3 (setting 2, new community PCR)

0.44 (0–12)

0.24 (0–14)

0.26 (0–14)

Table 4 Relative proportion of insertions (mean and range) per read (relative to total read length including insertions) obtained for each sample of C. inversa in each of the runs DIC149 (%)

DIC150 (%)

DIC154 (%)

Run 1 (setting 1)

0.84 (0–8.04)

1.23 (0–14.4)

1.20 (0–9.96)

Run 2 (setting 2, community PCR of run 1)

0.15 (0–4.80)

0.08 (0–4.15)

0.09 (0–4.14)

Run 3 (setting 2, new community PCR)

0.15 (0–4.44)

0.07 (0–5.37)

0.11 (0–5.06)

the 5.8S region); 52 sites were mixed sites which could not be unambiguously classified as either homopolymer extensions or heteropolymer insertions (one in the ITS5 forward primer and seven in the 5.8S region); and 53 positions had insertions in no more than two reads and were classified as random (one in the ITS5 forward primer and 15 in the 5.8S region). Since homopolymer extensions can be explained by incorrectly calibrated chemiluminescent sensor sensitivity (base overcall or undercall), we focused further analysis on heteropolymer insertions. Of the 47 heteropolymer insertion sites, 31 were present in ten or more reads, providing a sound basis for statistical analysis (Table 5). A further 13 were present in three or more (but less than ten) reads. All of these 44 insertion sites had either a single, uniform base call in all reads across all three samples and repeats (31 sites) or a strongly dominating base call with a single exception in no more than one or two reads (13 sites; Online Appendix D). This implies that the nature of the insertion was practically the same in all reads across all samples and repeats for a given site, independent of the emulsion PCR and sequencing parameters, as well as independent of the well position corresponding to individual reads. This pattern was found to be highly significantly different from random (Chi Square test; p \ 0.001). Heteropolymer insertions were most frequently found in repeat 1 (all three samples), with sample DIC150 having the highest proportion of reads with insertions present (Table 5). In most cases, the proportion of reads with insertions was significantly lower in repeats 2A/B, although the same insertions were found in these reads across all samples. In some cases, insertions were about equally distributed across reads from all samples and repeats (positions 174, 180, 259, 389, 424, 455), and in a few others, they were more frequent in reads from repeats 2A/B (positions 264, 297, 515, 547). Overall, we found 15,654 insertions in the 16,665 reads (0.94 per read on

123

average), 8,066 of which corresponded to homopolymer extensions, and 3,437 to heteropolymer insertions; 4,053 corresponded to mixed sites (both homopolymer extensions and heteropolymer insertions) and 106 to random sites. Of the 3,437 heteropolymer insertions, 3,368 (98 %) corresponded to the 31 insertion sites present in ten ore more reads. Without exception, all of these occurred 1–3 bases after a homopolymer segment of the same base call, corresponding to partial deletions in the homopolymer segment and hence representing CaFIE errors. The remaining heteropolymer insertions, as well as those in mixed sites, showed the same patterns, even in those cases where insertions were equal in frequency between repeats or more frequent in repeats 2A/B (total of 263 out of 3,368). Thus, of the total of 15,654 insertions, up to 15,548 (99.3 %) could be attributed to either homopolymer extension or CaFIE errors. In addition to insertions, we also detected 3,279 substitutions in 187 sites aligned to those present in the Sanger references sequences, corresponding to 0.06 % of all base calls. The number of reads with substitutions ranged from 1 to 806, with 184 sites having 31 or less reads with substitutions (less than 0.2 % of all reads) and three sites having a larger number of reads with substitutions (124, 129, 806). A total of 118 substitution sites had 2,839 substitutions. These were highly consistent across all reads for each position and apparently not artifacts, since 2,677 substitutions (94 %) exhibited exactly the same base call independent of read and run. Inspection of individual reads showed that they might in part represent masked CaFIE errors, incorrectly aligned with non-homologous sites in the Sanger reference sequences due to adjacent deletions; however, we found no significant differences in relative frequency of these substitutions among repeats, suggesting that they might also represent real biological substitutions. For example, the site with the largest number of reads with substitutions, 806 out of 16,665 (4.8 %) at position 125,

J Mol Evol (2014) 78:148–162

155

Table 5 Summary of 31 heteropolymer insertion sites present in ten or more reads, sorted in descending order by number of reads in which insertion was present

Numbers are given as totals and also parsed out according to run and sample. Dominant and alternative base calls are indicated, as well as total number of reads with gaps at (insertion absent) the corresponding position

123

156

J Mol Evol (2014) 78:148–162

invariably had G in the Sanger reference sequences and A in 805 out of 806 reads with substititions: 224 out of 4,278 reads in repeat 1 (5.2 %), 301 out of 6,091 reads in repeat 2A (4.9 %), and 280 out of 6,296 reads in repeat 2B (4.4 %). The next two positions had much fewer reads with substitutions (129 and 124, respectively, or 0.74–0.77 %), but with similarly consistent substitution patterns, replacing G with A (position 193) and A with T (position 256). The number of OTUs present in the dataset of 16,665 reads was estimated between 3 and 5,945 depending on the threshold level, with 37 at 95 % (Table 6). After correcting the reads using PaPaRa (insertions automatically removed), the corresponding estimates were between 1 and 3,425, with 4 at 95 % (Table 6). Since, phylogenetic analysis demonstrated that all reads belonged to C. inversa, the application of the cluster analysis on the original reads resulted in a 11-fold, 35-fold, and 137-fold overestimation of biological diversity using the commonly applied threshold ID levels of 90, 95, and 97 %. For the corrected reads (insertions removed but substitutions retained), biological diversity was estimated correctly up to 80 % ID level, whereas at the 90, 95, and 97 % ID level, overestimation by the factor 2, 4, and 8 occurred (due to the presence of substitutions). This represented a 5–16-fold improvement compared to uncorrected reads.

Table 6 Number of assumed biological units derived from the complete set of 454 ITS1 reads of C. inversa resulting from the three samples and runs, using the original reads and reads after removal of insertions with PaPaRa 2.0 (Berger and Stamatakis 2011) Threshhold ID level

No. of biological units estimated using original reads

No. of biological units estimated using reads with insertions removed

There was a highly significant linear correlation between read frequency and estimate of biological diversity at all threshold levels, with R = 0.86 at 90 %, R = 0.94 at 95 %, and R = 0.96 at 97–100 % ID threshold level (Table 7). For up to 16 reads, threshold levels of up to 97 % predicted the biological diversity correctly, but this value dropped substantially with higher read frequency (Fig. 2). Using 95 % threshold, 32–125 reads overestimated the biological diversity twofold, and at 90 %, more than 1,000 reads overestimated the biological diversity at least threefold. At 1,000 reads, only a threshold level of 80 % or lower predicted the biological diversity correctly.

Discussion Our findings of substantial amount of variation in 454 pyrosequencing data of the lichenized fungus C. inversa at first glance appears in line with an increasing number of NGS studies report intragenomic variation in the ITS and other portions of the nuclear rDNA cistron (Lindner et al. 2013). However, in contrast to other studies, we found no evidence for underlying biological causes for most of this variation. Instead, while potentially real biological substitutions accounted for only 0.06 % of all corresponding base calls, we attribute more than 99 % of the observed insertions to either homopolymer extension or CaFIE errors (Margulies et al. 2005; Rothberg and Leamon 2008; 454 Life Sciences Corporation 2009; Balzer et al. 2011; Gilles et al. 2011; Vandenbroucke et al. 2011). Technically, CaFIE errors cannot be distinguished from biological variation by inspecting individual reads, since Table 7 Spearman rank correlation between read frequency and estimate of biological diversity (see Fig. 2)

0.50

3

1

0.60

3

1

Threshold (%)

RPearson

0.70

3

1

50

0.781379

3.95947

0.0026

0.80

4

1

60

0.781379

3.95947

0.0026

0.90

11

2

70

0.781379

3.95947

0.0026

0.91

10

2

80

0.781379

3.95947

0.0026

0.92

13

2

90

0.856411

5.24548

0.0004

0.93

15

2

91

0.880705

5.87975

0.0002

0.94

24

3

92

0.907617

6.83693

0.0000

0.95

35

4

93

0.916397

7.23987

0.0000

0.96

65

5

94

0.947050

9.32717

0.0000

0.97

137

8

95

0.943685

9.01994

0.0000

0.98 0.99

315 980

19 116

96

0.943685

9.01994

0.0000

97

0.958042

10.56978

0.0000

1.00

5,945

3,425

98

0.958042

10.56978

0.0000

99

0.958042

10.56978

0.0000

100

0.958042

10.56978

0.0000

The program USearch (Edgar 2010, 2013) was used for the automated clustering analysis, with predefined threshhold ID levels ranging from 0.50 to 1.00

123

t (N–2)

p level

J Mol Evol (2014) 78:148–162

157

Fig. 2 Read frequency (randomly selected out of total of 16,665 reads, with ten repeats per frequency level) and resulting estimate of biological diversity. Colors indicate level of correctness compare to real biological diversity (a single species) (Color figure online)

both represent systematic positional variation, with the same insertion consistently appearing at the same position in different reads. CaFIE errors thus mimic indel variation found in Sanger sequences between closely related species (Dal-Forno et al. 2013). Automated denoising methods (Quince et al. 2009, 2011; Caporaso et al. 2010a; Reeder and Knight 2010) cannot adequately detect these errors, since they focus on signal flow variation of homopolymer segments, but do not take into account associated heteropolymer insertions. Another problem is that CaFIE errors might be masked as substitution errors if reads that contain both insertions and deletions are not properly aligned; this might explain why some studies on sequence errors report very low proportions of indels versus substitutions (Keegan et al. 2012). The difficulty of assessing CaFIE errors might be the reason why previous authors considered them of little significance compared to other potential error sources (Margulies et al. 2005; Rothberg and Leamon 2008; 454 Life Sciences Corporation 2009; Balzer et al. 2011; Gilles et al. 2011; Vandenbroucke et al. 2011). While, CaFIE errors can be theoretically predicted by comparing heteropolymer sequence patterns immediately following homopolymer segments, the only way of actually detecting such errors is the comparison of separate sequencing runs of the same community PCR samples, which permits assessing frequency patterns: if the variation is largely biological, one would expect comparable proportions of positional variation in separate runs, as was here observed for substitutions and a small proportion of insertions. In addition, it is necessary to analyze a larger number of reads from a given sample and species and compare reads with Sanger sequences from the same samples, to accurately recognize erroneous insertion sites. The previous studies using duplicated 454 pyrosequencing runs from the same samples found no significant differences in the allocation of sequences to taxonomic units, suggesting that at least part of the detected variation was biological and not artifactual (Kauserud et al. 2012;

Kumar et al. 2011). However, these authors attempted to replicate the same sequencing conditions and used a clustering approach to detect biological units, rather than analyzing and comparing individual reads by means of multiple fixed alignments. Sequencing of environmental samples (O’Brien et al. 2005; Sogin et al. 2006; Geml et al. 2008; Taylor et al. 2008; Bue´e et al. 2009; Amend et al. 2010; Lumini et al. 2010; Wallander et al. 2010; Hibbett et al. 2011; Dai et al. 2012; McGuire et al. 2012) usually results in a low number of redundant read copies for each taxon present in the sample, which makes the detection of CaFIE errors difficult. In addition, clustering techniques will interpret CaFIE errors as real variation, especially if they use the longest reads as seeds, such as USearch (Edgar 2010, 2013), since the longest reads will likely have the largest number of erroneous insertions. In the present case, automated clustering of reads representing a single species led to a 35-fold overestimation of biological diversity at the commonly applied threshold ID level of 0.95. This finding is alarming, since automated clustering will invariably interpret all variation present in a sample as biological. The problem is especially acute for studies of environmental samples, where species are not known and cannot be referenced against the existing Sanger sequences. Unfortunately, clustering techniques are indispensable to pre-digest large amounts of data (Li and Godzik 2006; Schloss et al. 2009; Caporaso et al. 2010a; Edgar 2010, 2013; Huang et al. 2010; Kumar et al. 2011), especially if the target gene is highly variable and unalignable across the range of analyzed organisms, such as ITS. The problem of overestimation of biological diversity using clustering methods has been discussed in several studies (Quince et al. 2009; Engelbrektson et al. 2010; Kunin et al. 2010; Powell et al. 2011; Unterseher et al. 2011; Zhou et al. 2011), but the error is usually believed not to exceed a factor of ten, and is often considered to be much lower, contrasting with the results presented here. The overestimation of biological diversity through

123

158

clustering increases with read frequency per taxon; hence, species represented by few reads in a sample are likely to be correctly recognized, whereas species with a large number of reads tend to be interpreted as more than one taxon. Our analysis suggests that up to 16 reads per taxon will predict biological diversity correctly up to a relatively high threshold level of 97 %, but accuracy rapidly declines in linear fashion with increasing read frequency (cluster size) per taxon. Hence, this problem will have low impact in datasets with a rather balanced distribution of (few) reads per taxon, but potentially high impact in datasets which contain few to many dominant taxa with large numbers of reads each. Without removal of CaFIE errors, the only way to accurately analyze such data is to employ phylogenetic approaches instead of clustering (Porter and Golding 2011; Powell et al. 2011), including alignment tools such as PyNAST and PaPaRa (Caporaso et al. 2010b; Berger and Stamatakis 2011). Our results also imply that, contrary to what has been suggested in other studies (Kunin et al. 2010; Tedersoo et al. 2010; Unterseher et al. 2011), singleton reads should not be excluded as potential artifacts but instead contribute substantially to accurate estimates of taxonomic diversity, since sequencing errors in singletons have no adverse effect in diversity estimates based on clustering methods. Thus, while the previous studies suggested to exclude singletons and focus on clusters to estimate diversity, we conclude that singletons should by default considered to represent real biological diversity, whereas clusters should be carefully examined for effects of sequencing errors by analyzing them with alignmentbased phylogenetic methods. The huge amount of NGS data to be dealt with poses a challenge to phylogenetic approaches, but this can be surmounted by using clustering and blasting techniques to pre-digest the data and then applying alignment-based phylogenetic analysis on portions of the data to determine biological units. Phylogenetic analysis using maximum likelihood can now be applied to alignments containing thousands of sequences by using RAxML on public multicore servers (Stamatakis 2006; Stamatakis et al. 2005, 2008; Miller et al. 2010). The alternative approach through removal of sequencing errors using PaPaRa prior to clustering considerably improves the results of clustering compared to original reads; yet, clustering will always only provide a proxy for real diversity, since a predefined threshold ID level has to be applied to make data comparable, whereas a phylogenetic approach is independent of preset ID levels. Phylogenetic methods that provide simultaneous search options for the best alignment-tree combination, such as POY, BaliPhy, and SATe´ (Suchard and Redelings 2006; Liu et al. 2009, 2012; Varo´n et al. 2010), are potentially valuable methods to analyze widely divergent sequences in a sample without the necessity of a

123

J Mol Evol (2014) 78:148–162

multiple fixed alignment (Wang et al. 2011), but these methods are computationally expensive and currently cannot handle very large datasets of tens to hundreds of thousands of reads as the result of typical 454 pyrosequencing runs. The interpretation of most of the variation found here as 454 pyrosequencing errors, with only a small proportion of substitutions apparently reflecting biological variation, is in accordance with studies that reported intragenomic ITS as comparatively uniform (Morzycka-Wroblewska et al. 1985; Borsuk et al. 1988; Ganley and Scott 2002; Ganley and Kobayashi 2007). Such uniformity is assumed to be due either to concerted evolution (Liao 1999, 2008) or to a birth-and-death mechanism with purifying selection (Morzycka-Wroblewska et al. 1985; Borsuk et al. 1988; Ganley and Scott 2002; Rooney and Ward 2005; Ganley and Kobayashi 2011). Other studies suggest intragenomic variation in fungal ITS and the nuclear rDNA (Hughes and Petersen 2001; Ko and Jung 2002; Gomes et al. 2002; Pawlowska 2005, 2007, 2010; Rooney and Ward 2005; Smith et al. 2007; Simon and Weiss 2008; Nilsson et al. 2008; James et al. 2009; Lindner and Banik 2011; Kova´cs et al. 2011a, b; Li et al. 2013; Lindner et al. 2013). In a large-scale analysis of Sanger ITS sequences of the lichenized fungal family Physciaceae, Lu¨cking et al. (2012) found evidence for rDNA (ITS) pseudogenes in a single clade, whereas all other clades were consistent with a single, functional rDNA cistron. Therefore, the lack of evidence for intragenomic variation in some studies and its presence in others suggests that pseudogene evolution is specific to individual lineages, even within the same, larger clades. Thus, the existence of pseudogenes will have to be analyzed for each clade separately, which poses a challenge for the use of ITS as fungal barcoding locus (Hebert et al. 2003; Schoch et al. 2012), since pseudogenes can form novel clades that mimic cryptic speciation. Our method would be a viable approach to resolve this problem for any given clade. Quantitative analysis of 454 pyrosequencing data for single clades or taxa can also shed light on the nature of potential variation and how such variation might affect phylogenetic studies using ITS generated from Sanger sequencing as barcoding locus. In the present case, assuming that substitutions (not indels) detected in the dataset reflect real biological variation, their very low proportion (0.06 % of all base calls) will cause them to disappear as background signal in Sanger sequencing. This is supported by the Sanger sequences obtained from the same samples used in this study, which do not contain any variation. Sanger sequencing thus acts like a filter, which is of advantage when defining phylogeny-based taxa, but suggests genomic uniformity when in fact some level of intragenomic variation exists. The low level of variation

J Mol Evol (2014) 78:148–162

found here would suggest that a small portion of the rDNA (ITS) copies in the genome of C. inversa ‘‘escaped’’ concerted evolution. However, under a model of concerted evolution or birth-and-death mechanism with purifying selection (Morzycka-Wroblewska et al. 1985; Borsuk et al. 1988; Liao 1999, 2008; Ganley and Scott 2002; Rooney and Ward 2005; Ganley and Kobayashi 2007; Ganley and Kobayashi 2011), one would expect that, at any given time, a small proportion of the gene copies to deviate from a uniform pattern. Hence, small amounts of variation do not contradict, but are consistent with, concerted evolution. Acknowledgments Financial support for this study was provided by grant DEB 0841405 from the National Science Foundation to George Mason University: ‘‘Phylogenetic Diversity of Mycobionts and Photobionts in the Cyanolichen Genus Dictyonema, with Empasis on the Neotropics and the Galapagos Islands’’ (PI J. Lawrey; Co-PIs: R. Lu¨cking, P. Gillevet). The biological material was collected as part of grant DEB-0715660 to The Field Museum: ‘‘Neotropical Epiphytic Microlichens—An Innovative Inventory of a Highly Diverse yet Little Known Group of Symbiotic Organisms’’ (PI R. Lu¨cking). Naga Betrapally assisted with the SRA and BioProject submission of the dataset.

References 454 Life Sciences Corporation (2009) Genome sequencer FLX system software manual, version 2.3. 454 Life Sciences Corp., A Roche Company, Brandford, CT Abe T, Ikemura T, Sugahara J, Kanai A, Ohara Y, Uehara H, Kinouchi M, Kanaya S, Yamada Y, Muto A, Inokuchi I (2011) tRNADB-CE 2011: tRNA gene database curated manually by experts. Nucleic Acids Res 39:D210–D213 Ahmadian A, Ehn M, Hober S (2006) Pyrosequencing: history, biochemistry and future. Clin Chim Acta 363:83–94 Amend A, Samson R, Seifert K, Bruns T (2010) Deep sequencing reveals diverse and geographically structured assemblages of fungi in indoor dust. Proc Natl Acad Sci USA 107:13748–13753 Balzer S, Malde K, Jonassen I (2011) Systematic exploration of error sources in pyrosequencing flowgram data. Bioinformatics 27:304–309 Begerow D, Nilsson H, Unterseher M, Maier W (2010) Current state and perspectives of fungal DNA barcoding and rapid identification procedures. Appl Microbiol Biotechnol 87:99–108 Berger SA, Stamatakis A (2011) Aligning short reads to reference alignments and trees. Bioinformatics 27:2068–2075 Berger SA, Krompass D, Stamatakis A (2011) Performance, accuracy and web-server for evolutionary placement of short sequence reads under maximum-likelihood. Syst Biol 60:291–302 Borsuk P, Gniadkowski M, Bartnik E, Stepien PP (1988) Unusual evolutionary conservation of 5S rRNA pseudogenes in Aspergillus nidulans: similarity of the DNA sequence associated with the pseudogenes with the mouse immunoglobulin switch region. J Mol Evol 28:125–130 Bue´e M, Reich M, Murat C, Nilsson RH, Uroz S, Martin F (2009) 454 Pyrosequencing analyzes of forest soils reveal an unexpectedly high fungal diversity. New Phytol 184:449–456 Caporaso JG, Kuczynski J, Stombaugh J, Bittinger K, Bushman FD, Costello EK, Fierer N, Pen˜a AG, Goodrich JK, Gordon JI, Huttley GA, Kelley ST, Knights D, Koenig JE, Ley RE,

159 Lozupone CA, McDonald D, Muegge BD, Pirrung M, Reeder J, Sevinsky JR, Turnbaugh PJ, Walters WA, Widmann J, Yatsunenko T, Zaneveld J, Knight R (2010a) QIIME allows analysis of high-throughput community sequencing data. Nat Methods 7:335–336 Caporaso JG, Bittinger K, Bushman FD, DeSantis TZ, Andersen GL, Knight R (2010b) PyNAST: a flexible tool for aligning sequences to a template alignment. Bioinformatics 26:266–267 Dai M, Hamel C, Arnaud M, He Y, Grant C, Lupwayi N, Janzen H, Malhi SS, Yang X, Zhou Z (2012) Arbuscular mycorrhizal fungi assemblages in Chernozem great groups revealed by massively parallel pyrosequencing. Can J Microbiol 58:81–92 Dal-Forno M, Lawrey JD, Sikaroodi M, Bhattarai S, Gillevet PM, Sulzbacher M, Lu¨cking R (2013) Starting from scratch: evolution of the lichen thallus in the basidiolichen Dictyonema (Agaricales: Hygrophoraceae). Fungal Biol 117(9):584–598 Del-Prado R, Cubas P, Lumbsch HT, Divakar PK, Blanco O, de Paz GA, Molina MC, Crespo A (2010) Genetic distances within and among species in monophyletic lineages of Parmeliaceae (Ascomycota) as a tool for taxon delimitation. Mol Phylogenet Evol 56:125–133 Deng W, Maust BS, Westfall DH, Chen L, Zhao H, Larsen BB, Iyer S, Liu Y, Mullins JI (2013) Indel and carryforward correction (ICC): a new analysis approach for processing 454 pyrosequencing data. Bioinformatics 29(19):2402–2409. doi:10.1093/ bioinformatics/btt434 Eberhardt U (2010) A constructive step towards selecting a DNA barcode for fungi. New Phytol 187:265–268 Edgar RC (2010) Search and clustering orders of magnitude faster than BLAST. Bioinformatics 26:2460–2461 Edgar RC (2013) UPARSE: highly accurate OTU sequences from microbial amplicon reads. Nat Methods 10(10):996–998. doi:10. 1038/nmeth.2604 Engelbrektson A, Kunin V, Wrighton KC, Zvenigorodsky N, Chen F, Ochman H, Hugenholtz P (2010) Experimental factors affecting PCR-based estimates of microbial species richness and evenness. ISME J 4:642–647 Ganley ARD, Kobayashi T (2007) Highly efficient concerted evolution in the ribosomal DNA repeats: total rDNA repeat variation revealed by whole-genome shotgun sequence data. Genome Res 17:184–191 Ganley ARD, Kobayashi T (2011) Monitoring the rate and dynamics of concerted evolution in the ribosomal DNA repeats of Saccharomyces cerevisiae using experimental evolution. Mol Biol Evol 28:2883–2891 Ganley AR, Scott B (2002) Concerted evolution in the ribosomal RNA genes of an Epichloe¨ endophyte hybrid: comparison between tandemly arranged rDNA and dispersed 5S rrn genes. Fungal Genet Biol 35:39–51 Geml J, Laursen GA, Taylor DL (2008) Molecular diversity assessment of arctic and boreal Agaricus taxa. Mycologia 100:577–589 Gilles A, Megle´cz E, Pech N, Ferreira S, Malausa T, Martin JF (2011) Accuracy and quality assessment of 454 GS-FLX Titanium pyrosequencing. BMC Genomics 12:245 Gomes EA, Kasuya MCM, de Barros EG, Borges AC, Araujo EF (2002) Polymorphism in the internal transcribed spacer (ITS) of the ribosomal DNA of 26 isolates of ectomycorrhizal fungi. Genet Mol Biol 25:477–483 Gomez-Alvarez V, Teal TK, Schmidt TM (2009) Systematic artifacts in metagenomes from complex microbial communities. ISME J 3:1314–1317 Gulyas G, Sramko G, Molnar A, Rudnoy S, Illyes Z, Balazs T, Zoltan B (2005) Nuclear ribosomal DNA ITS paralogs as evidence of recent hybridization in genus Ophrys (Orchidaceae). Acta Biol Crac Ser Bot 47:61–67

123

160 Hall TA (1999) BioEdit: a user-friendly biological sequence alignment editor and analysis program for Windows 95/98/NT. Nucleic Acids Symp Ser 41:95–98 Hebert PD, Ratnasingham S, deWaard JR (2003) Barcoding animal life: cytochrome c oxidase subunit 1 divergences among closely related species. Proc Biol Sci 270(Suppl 1):S96–S99 Henn MR, Boutwell CL, Charlebois P, Lennon NJ, Power KA, Macalalad AR, Berlin AM, Malboeuf CM, Ryan EM, Gnerre S, Zody MC, Erlich RL, Green LM, Berical A, Wang Y, Casali M, Steeck H, Bloom AK, Dudek T, Tully D, Newman R, Axten KL, Gladden AD, Battis L, Kemper M, Zeng Q, Shea TP, Gujja S, Zedlack C, Gasser O, Brander C, Hess C, Gunthard HF, Brumme ZL, Brumme CJ, Bazner S, Rychert J, Tinsley JP, Mayer KH, Rosenberg E, Pereya F, Levin JZ, Young SK, Jessen H, Altfeld M, Birren BW, Walker BD, Allen TM (2012) Whole genome deep sequencing of HIV-1 reveals the impact of early minor variants upon immune recognition during acute infection. PLoS Pathog 8(3):e1002529 Hibbett DS, Ohman A, Glotzer D, Nuhn M, Kirk PM, Nilsson RH (2011) Progress in molecular and morphological taxon discovery in fungi and options for formal classification of environmental sequences. Fungal Biol Rev 25:38–47 Hoff KJ (2009) The effect of sequencing errors on metagenomic gene prediction. BMC Genomics 10:520 Huang Y, Niu B, Gao Y, Fu L, Li W (2010) CD-HIT suite: a web server for clustering and comparing biological sequences. Bioinformatics 26:680 Hughes KW, Petersen RH (2001) Apparent recombination or gene conversion in the ribosomal ITS region of a Flammulina (Fungi, Agaricales) hybrid. Mol Biol Evol 18:94–96 Hurst LD, Smith NGC (1998) The evolution of concerted evolution. Proc Biol Sci 265:121–127 Huse SM, Huber JA, Morrison HG, Sogin ML, Welch DM (2007) Accuracy and quality of massively parallel DNA pyrosequencing. Genome Biol 8:R143 Huse SM, Welch DM, Morrison HG, Sogin ML (2010) Ironing out the wrinkles in the rare biosphere through improved OTU clustering. Environ Microbiol 12:1889–1898 James SA, O’Kelly MJ, Carter DM, Davey RP, van Oudenaarden A, Roberts IN (2009) Repetitive sequence variation and dynamics in the ribosomal DNA array of Saccharomyces cerevisiae as revealed by whole-genome resequencing. Genome Res 19:626–635 Katoh K, Misawa K, Kuma K, Miyata T (2002) MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucl Acids Res 30:3059–3066 Katoh K, Asimenos G, Toh H (2009) Multiple alignment of DNA sequences with MAFFT. Meth Mol Biol 537:39–64 Kauserud H, Kumar S, Brysting A, Norde´n J, Carlsen T (2012) High consistency between replicate 454 pyrosequencing analyses of ectomycorrhizal plant root samples. Mycorrhiza 22:309–315 Keegan KP, Trimble WL, Wilkening J, Wilke A, Harrison T, D’Souza M, Meyer F (2012) A platform-independent method for detecting errors in metagenomic sequencing data: dRISEE. PLoS Comput Biol 8(6):e1002541 Kiss L (2012) Limits of nuclear ribosomal DNA internal transcribed spacer (ITS) sequences as species barcodes for Fungi. Proc Natl Acad Sci USA 109:E1811 Ko KS, Jung HS (2002) Three nonorthologous ITS1 types are present in a polypore fungus Trichaptum abietinum. Mol Phylogenet Evol 23:112–122 Ko˜ljalg U, Larsson KH, Abarenkov K, Nilsson RH, Alexander IJ, Eberhardt U, Erland S, Høiland K, Kjøller R, Larsson E, Pennanen T, Sen R, Taylor AF, Tedersoo L, Vra˚lstad T, Ursing BM (2005) UNITE: a database providing web-based methods for the molecular identification of ectomycorrhizal fungi. New Phytol 166:1063–1068

123

J Mol Evol (2014) 78:148–162 Ko˜ljalg U, Nilsson RH, Abarenkov K, Tedersoo L, Taylor AFS, Bahram M, Bates ST, Bruns TD, Bengtsson-Palme J, Callaghan TM, Douglas B, Drenkhan T, Eberhardt U, Duen˜as M, Grebenc T, Griffith GW, Hartmann M, Kirk PM, Kohout P, Larsson E, Lindahl BD, Lu¨cking R, Martı´n MP, Matheny B, Nguyen NH, Niskanen T, Oja J, Peay KG, Peintner U, Peterson M, Oldmaa KP, Saag L, Saar R, Schu¨ssler A, Scott JA, Sene´s C, Smith ME, Suija A, Taylor DL, Telleria MT, Weiss M, Larsson KH (2013) Towards a unified paradigm for sequence-based identification of fungi. Mol Ecol 22:5271–5277. doi:10.1111/mec.12481 Kova´cs GM, Bala´zs TK, Calonge FD, Martı´n MP (2011a) The diversity of Terfezia desert truffles: new species and a highly variable species complex with intrasporocarpic nrDNA ITS heterogeneity. Mycologia 103:841–853 Kova´cs GM, Jankovics T, Kiss L (2011b) Variation in the nrDNA ITS sequences of some powdery mildew species: do routine molecular identification procedures hide valuable information? Eur J Plant Pathol 131:135–141 Kumar S, Carlsen T, Mevik B, Enger P, Blaalid R, Shalchian-Tabrizi K, Kauserud H (2011) CLOTU: an online pipeline for processing and clustering of 454 amplicon reads into OTUs followed by taxonomic annotation. BMC Bioinformatics 12:82 Kunin V, Engelbrektson A, Ochman H, Hugenholtz P (2010) Wrinkles in the rare biosphere: pyrosequencing errors can lead to artificial inflation of diversity estimates. Environ Microbiol 12:118–123 Lawrey JD, Lu¨cking R, Sipman HJM, Chaves JL, Redhead SA, Bungartz F, Sikaroodi M, Gillevet PM (2009) High concentration of basidiolichens in a single family of agaricoid mushrooms. Mycol Res 113:1154–1171 Li W, Godzik A (2006) Cd-Hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22:1658–1659 Li Y, Jiao L, Yao YJ (2013) Non-concerted ITS evolution in fungi, as revealed from the important medicinal fungus Ophiocordyceps sinensis. Mol Phylogen Evol 68(2):373–379 Liao D (1999) Concerted evolution: molecular mechanism and biological implications. Am J Hum Genet 64:24–30 Liao D (2008) Concerted evolution. University of Florida, Gainesville, FL. doi:10.1002/9780470015902.a0005132 Lindner DL, Banik MT (2011) Intragenomic variation in the ITS rDNA region obscures phylogenetic relationships and inflates estimates of operational taxonomic units in genus Laetiporus. Mycologia 103:731–740 Lindner DL, Carlsen T, Nilsson RH, Davey M, Schumacher T, Kauserud H (2013) Employing 454 amplicon pyrosequencing to reveal intragenomic divergence in the internal transcribed spacer (ITS) rDNA region in fungi. Ecol Evol 3:1751–1764 Liu K, Raghavan S, Nelesen S, Linder CR, Warrnow T (2009) Rapid and accurate large-scale coestimation of sequence alignments and phylogenetic trees. Science 324:1561–1564 Liu K, Warnow TJ, Holder MT, Nelesen S, Yu J, Stamatakis A, Linder CR (2012) SATe´-II: very fast and accurate simultaneous estimation of multiple sequence alignments and phylogenetic Trees. Syst Biol 61:90–106 Lu¨cking R, Kalb K, Essene A (2012) The power of ITS: using megaphylogenies of barcoding genes to reveal inconsistencies in taxonomic identifications of genbank submissions. The 7th IAL symposium ‘‘Lichens: from genome to ecosystems in a changing world’’, January 2012, Bangkok. Book of abstracts: 3B-1-O2 Lumini E, Orgiazzi A, Borriello R, Bonfante P, Bianciotto V (2010) Disclosing arbuscular mycorrhizal fungal biodiversity in soil through a land-use gradient using a pyrosequencing approach. Environ Microbiol 12:2165–2179 Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, Bemben LA, Berka J, Braverman MS, Chen YJ, Chen Z, Dewell SB, Du

J Mol Evol (2014) 78:148–162 L, Fierro JM, Gomes XV, Godwin BC, He W, Helgesen S, Ho CH, Irzyk GP, Jando SC, Alenquer ML, Jarvie TP, Jirage KB, Kim JB, Knight JR, Lanza JR, Leamon JH, Lefkowitz SM, Lei M, Li J, Lohman KL, Lu H, Makhijani VB, McDade KE, McKenna MP, Myers EW, Nickerson E, Nobile JR, Plant R, Puc BP, Ronan MT, Roth GT, Sarkis GJ, Simons JF, Simpson JW, Srinivasan M, Tartaro KR, Tomasz A, Vogt KA, Volkmer GA, Wang SH, Wang Y, Weiner MP, Yu P, Begley RF, Rothberg JM (2005) Genome sequencing inmicrofabricated high-density picolitre reactors. Nature 437:376–380 McGuire KL, Fierer N, Bateman C, Treseder KK, Turner BL (2012) Fungal community composition in neotropical rain forests: the influence of tree diversity and precipitation. Microb Ecol 63:804–812 Miller MA, Pfeiffer W, Schwartz T (2010) Creating the CIPRES Science Gateway for inference of large phylogenetic trees. In: Proceedings of the gateway computing environments workshop (GCE), New Orleans, pp 1–8 Moncada B, Lu¨cking R, Sua´rez A (2013) Molecular phylogeny of the genus Sticta (lichenized Ascomycota: Lobariaceae) in Colombia. Fungal Div. doi:10.1007/s13225-013-0230-0 Morzycka-Wroblewska E, Selker EU, Stevens JN, Metzenberg RL (1985) Concerted evolution of dispersed Neurospora crassa 5S RNA genes: pattern of sequence conservation between allelic and nonallelic genes. Mol Cell Biol 5:46–51 Nilsson RH, Kristiansson E, Ryberg M, Larsson KH (2005) Approaching the taxonomic affiliation of unidentified sequences in public databases—an example from the mycorrhizal fungi. BMC Bioinformatics 6:178 Nilsson RH, Ryberg M, Kristiansson E, Abarenkov K, Larsson KH (2006) Taxonomic reliability of DNA sequences in public sequence databases: a fungal perspective. PLoS ONE 1:e59 Nilsson RH, Kristiansson E, Ryberg M, Hallenberg N, Larsson KH (2008) Intraspecific ITS variability in the kingdom Fungi as expressed in the international sequence databases and its implications for molecular species identification. Evolut Bioinformatics 4:193–201 Nilsson RH, Ryberg M, Abarenkov K, Sjo¨kvist E, Kristiansson E (2009) The ITS region as a target for characterization of fungal communities using emerging sequencing technologies. FEMS Microbiol Lett 296:97–101 Nilsson RH, Tedersoo L, Lindahl BD, Kjøller R, Carlsen T, Quince C, Abarenkov K, Pennanen T, Stenlid J, Bruns T, Larsson K-H, Ko˜ljalg U, Kauserud H (2011) Towards standardization of the description and publication of next-generation sequencing datasets of fungal communities. New Phytol 191:314–318 Niu B, Fu L, Sun S, Li W (2010) Artificial and natural duplicates in pyrosequencing reads of metagenomic data. BMC Bioinformatics 11:187 O’Brien HE, Parrent JL, Jackson JA, Moncalvo JM, Vilgalys R (2005) Fungal community analysis by large-scale sequencing of environmental samples. Appl Environ Microbiol 71:5544–5550 Pawlowska TE (2005) Genetic processes in arbuscular mycorrhizal fungi. FEMS Microbiol Lett 251:185–192 Pawlowska TE (2007) How the genome is organized in the Glomeromycota. In: Heitman J, Kronstad JW, Taylor JW, Casselton LA (eds) Sex in Fungi: molecular determination and evolutionary implications. ASM Press, Washington,DC, pp 419–430 Pawlowska TE (2010) Population genetics of fungal mutualists of plants. In: Xu J (ed) Microbial population genetics. Horizon Scientific Press, Norfolk Pinto PM, Resende MA, Koga-Ito CY, Ferreira JA, Tendler M (2004) rDNA-RFLP identification of Candida species in immunocompromised and seriously diseased patients. Can J Microbiol 50:504–520

161 Porter TM, Golding GB (2011) Are similarity- or phylogeny-based methods more appropriate for classifying internal transcribed spacer (ITS) metagenomic amplicons? New Phytol 192:775–782 Powell JR, Monaghan MT, Opik M, Rillig MC (2011) Evolutionary criteria outperform operational approaches in producing ecologically relevant fungal species inventories. Mol Ecol 20:655–666 Quince C, Curtis TP, Sloan WT (2008) The rational exploration of microbial diversity. ISME J 2:997–1006 Quince C, Lanze´n A, Curtis TP, Davenport RJ, Hall N, Head IM, Read LF, Sloan WT (2009) Accurate determination of microbial diversity from 454 pyrosequencing data. Nat Methods 6:639–641 Quince C, Lanzen A, Davenport RJ, Turnbaugh PJ (2011) Removing noise from pyrosequenced amplicons. BMC Bioinformatics 12:38 Reeder J, Knight R (2010) Rapidly denoising pyrosequencing amplicon reads by exploiting rank-abundance distributions. Nat Methods 7:668–669 Rooney AP, Ward TJ (2005) Evolution of a large ribosomal RNA multigene family in filamentous fungi: birth and death of a concerted evolution paradigm. Proc Natl Acad Sci USA 102:5084–5089 Rossman AY (2007) Report of the planning workshop for all fungi DNA Barcoding. Inoculum 58:1–5 Rothberg JM, Leamon JH (2008) The development and impact of 454 pyrosequencing. Nat Biotechnol 26:1117–1124 Schloss PD, Westcott SL, Ryabin T, Hall JR, Hartmann M, Hollister EB, Lesniewski RA, Oakley BB, Parks DH, Robinson CJ, Sahl JW, Stres B, Thallinger GG, Van Horn DJ, Weber CF (2009) Introducing mothur: open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl Environ Microbiol 75:7537–7541 Schoch CL, Seifert KA, Huhndorf S, Robert V, Spouge JL, Levesque CA, Chen W, Fungal Barcoding Consortium (2012) Nuclear ribosomal internal transcribed spacer (ITS) region as a universal DNA barcode marker for Fungi. Proc Natl Acad Sci USA 109:6241–6246 Seifert KA (2008) The all-fungi barcoding campaign (FunBOL). Persoonia 20:106 Sikaroodi M, Gillevet PM (2012) Quality control in multi-tag pyrosequencing of microbial communities. Biotechniques 53:381–383 Simon UK, Weiss M (2008) Intragenomic variation of fungal ribosomal genes is higher than previously thought. Mol Biol Evol 25:2251–2254 Smith ME, Douhan GW, Rizzo DM (2007) Intra-specific and intrasporocarp ITS variation of ectomycorrhizal fungi as assessed by rDNA sequencing of sporocarps and pooled ectomycorrhizal roots from a Quercus woodland. Mycorrhiza 18:15–22 Sogin ML, Morrison HG, Huber JA, Welch DM, Huse SM, Neal PR, Arrieta JM, Herndl GJ (2006) Microbial diversity in the deep sea and the underexplored ‘rare biosphere’. Proc Natl Acad Sci USA 103:12115–12120 Stamatakis A (2006) RAxML-VI-HPC: maximum-Likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics 22:2688–2690 Stamatakis A, Ludwig T, Meier H (2005) RAxML-III: a fast program for maximum likelihood-based inference of large phylogenetic trees. Bioinformatics 21:456–463 Stamatakis A, Hoover P, Rougemont J (2008) A fast bootstrapping algorithm for the RAxML web-servers. Syst Biol 57:758–771 Suchard MA, Redelings BD (2006) BAli-Phy: simultaneous bayesian inference of alignment and phylogeny. Bioinformatics 22:2047–2048 Taylor DL, McCormick MK (2008) Internal transcribed spacer primers and sequences for improved characterization of basidiomycetous orchid mycorrhizas. New Phytol 177:1020–1033 Taylor DL, Booth MG, McFarland JW, Herriott IC, Lennon NJ, Nusbaum C, Marr TG (2008) Increasing ecological infrence

123

162 from high throughput sequencing of fungi in the environment through a tagging approach. Mol Ecol Resour 8:742–752 Tedersoo L, Nilsson RH, Abarenkov K, Jairus T, Sadam A, Saar I, Bahram M, Bechem E, Chuyong G, Ko˜ljalg U (2010) 454 Pyrosequencing and Sanger sequencing of tropical mycorrhizal fungi provide similar results but reveal substantial methodological biases. New Phytol 188:291–301 ¨ pik M, Tedersoo L, Moora M, Unterseher M, Jumpponen A, O Dormann C, Schnittler M (2011) Species abundance distributions and richness estimations in fungal metagenomics—lessons learned from community ecology. Mol Ecol 20:275–285 Vandenbroucke I, Van Marck H, Verhasselt P, Thys K, Mostmans W, Dumont S, Van Eygen V, Coen K, Tuefferd M, Aerssens J (2011) Minor variant detection in amplicons using 454 massive parallel pyrosequencing: experiences and considerations for successful applications. Biotechniques 51:167–177

123

J Mol Evol (2014) 78:148–162 Varo´n A, Vinh LS, Wheeler WC (2010) POY version 4: phylogenetic analysis using dynamic homologies. Cladistics 26:72–85 Wallander H, Johansson U, Sterkenburg E, Durling MB, Lindahl B (2010) Production of ectomycorrhizal mycelium peaks during canopy closure in Norway spruce. New Phytol 187:1124–1134 Wang Z, Nilsson RH, Lopez-Giraldez F, Zhuang WY, Dai YC et al (2011) Tasting soil fungal diversity with earth tongues: phylogenetic test of SATe0 alignments for environmental ITS data. PLoS ONE 6(4):e19039 Won H, Renner SS (2005) The internal transcribed spacer of nuclear ribosomal DNA in the gymnosperm Gnetum. Mol Phylogen Evol 36:581–597 Zhou J, Wu L, Deng Y, Zhi X, Jiang YH, Tu Q, Xie J, Van Nostrand JD, He Z, Yang Y (2011) Reproducibility and quantitation of amplicon sequencing-based detection. ISME J 1:11

Multiple ITS haplotypes in the genome of the lichenized basidiomycete Cora inversa (Hygrophoraceae): fact or artifact?

The internal transcribed spacer region (ITS) of the nuclear rDNA cistron represents the barcoding locus for Fungi. Intragenomic variation of this mult...
818KB Sizes 0 Downloads 0 Views