ARTICLE

Assessment of REPLI-g Multiple Displacement Whole Genome Amplification (WGA) Techniques for Metagenomic Applications Sofia Ahsanuddin,1,2 Ebrahim Afshinnekoo,1,2,3 Jorge Gandara,1,2 Mustafa Hakyemezo˘g lu,1,2 Daniela Bezdan,1,2 Samuel Minot,4 Nick Greenfield,4 and Christopher E. Mason1,2,5,* 1

Department of Physiology and Biophysics, 2The HRH Prince Alwaleed Bin Talal Bin Abdulaziz Alsaud Institute for Computational Biomedicine, and 5Feil Family Brain & Mind Research Institute, Weill Cornell Medicine, New York, New York, USA; 3School of Medicine, New York Medical College, Valhalla, New York, USA; and 4One Codex, San Francisco, California, USA Amplification of minute quantities of DNA is a fundamental challenge in low-biomass metagenomic and microbiome studies because of potential biases in coverage, guanine-cytosine (GC) content, and altered species abundances. Whole genome amplification (WGA), although widely used, is notorious for introducing artifact sequences, either by amplifying laboratory contaminants or by nonrandom amplification of a sample’s DNA. In this study, we investigate the effect of REPLI-g multiple displacement amplification (MDA; Qiagen, Valencia, CA, USA) on sequencing data quality and species abundance detection in 8 paired metagenomic samples and 1 titrated, mixed control sample. We extracted and sequenced genomic DNA (gDNA) from 8 environmental samples and compared the quality of the sequencing data for the MDA and their corresponding non-MDA samples. The degree of REPLI-g MDA bias was evaluated by sequence metrics, species composition, and crossvalidating observed species abundance and species diversity estimates using the One Codex and MetaPhlAn taxonomic classification tools. Here, we provide evidence of the overall efficacy of REPLI-g MDA on retaining sequencing data quality and species abundance measurements while providing increased yields of high-fidelity DNA. We find that species abundance estimates are largely consistent across samples, even with REPLI-g amplification, as demonstrated by the Spearman’s rank order coefficient (R2 . 0.8). However, REPLI-g MDA often produced fewer classified reads at the species, genera, and family level, resulting in decreased species diversity. We also observed some areas with the PCR “jackpot effect,” with varying input DNA values for the Metagenomics Research Group (MGRG) controls at specific genomic loci. We visualize this effect in whole genome coverage plots and with sequence composition analyses and note these caveats of the MDA method. Despite overall concordance of species abundance between the amplified and unamplified samples, these results demonstrate that amplification of DNA using the REPLI-g method has some limitations. These concerns could be addressed by future improvements in the enzymes or methods for REPLI-g to be considered a .99% robust method for increasing the amount of high-fidelity DNA from low-biomass samples or at the very least, accounted for during computational analysis of MDA samples. KEY WORDS: jackpot effect, next-generation sequencing, REPLI-g, metagenomics, whole genome amplification

INTRODUCTION

Recent technological advances in the field of metagenomics have enabled rapid analysis and characterization of uncultivable environmental microbial communities.1–3 Metagenomic approaches that use next-generation sequencing technologies allow researchers to analyze the complex microbial and genetic dynamics of living microbial communities in their natural habitats, thereby expanding the field of microbial ecology.4 However, the bottleneck of

*ADDRESS CORRESPONDENCE TO: Christopher E. Mason, Dept. of Physiology and Biophysics, Weill Cornell Medicine, 1305 York Ave., New York, NY 10021, USA (Phone: 203-668-1448; E-mail: [email protected]). doi: 10.7171/jbt.17-2801-008

generating several nanograms of high-quality DNA for library construction and sequencing remains a significant challenge to conducting such studies, although some method exist.5, 6 Although sequencing data quality can be potentially compromised,7 whole genome amplification (WGA) techniques have been increasingly used to multiply low-biomass samples to produce higher yields of highfidelity quantities of DNA.8 Whereas several commercial DNA amplification kits are available to generate quantities of DNA suitable for downstream processing and analyses, amplification-induced biases, such as uneven sequencing coverage, sequencing errors and artifacts, contamination, and incomplete fractions of the genome represented, have been reported. Previous studies, such as the one conducted by Pinard et al.,9 demonstrate that amplification strategies, such as MDA, primer extension Journal of Biomolecular Techniques 28:46–55 © 2017 ABRF

AHSANUDDIN ET AL. / GENOME AMPLIFICATION FOR METAGENOMICS RESEARCH

preamplification PCR,10 or degenerate oligonucleotide primed-PCR,11,12 all induce “significant” levels of bias relative to the unamplified controls.12 However, isothermal MDA produces the least bias, while generating significantly higher yields of amplified DNA compared with the other 2 techniques. Despite its relative efficacy, the limitations of MDA are not completely known for complex metagenomics samples and also may require modifications to maximize yield while optimizing and harmonizing sequencing data quality in relation to the unamplified counterparts. Here, we evaluate the efficacy of Qiagen’s REPLI-g MDA technique on amplifying high-quality environmental DNA for urban metagenomics studies. We collected environmental swab samples from high-traffic surfaces in a theme park and validated our results using different DNA concentrations of a known titrated mixture/mock community standard that served as our experimental control. We demonstrate that at least in these samples, the REPLI-g MDA amplification mostly maintains the genomic composition of the samples and produces small differences in taxa profiles for all samples and in the sequencing quality. We further demonstrate that the samples cluster by type, regardless of amplification, and that species-level quantification is consistent across the samples. To investigate the known PCR-induced jackpot effect on the MGRG controls, we also investigated the genome-wide coverage of these samples, noting that sequencing coverage becomes more uneven as the DNA input value decreases. These findings demonstrate the application of REPLI-g amplification for metagenomics research, while noting the caution regarding localized coverage variance and species absence that can occur at lower input. MATERIALS AND METHODS Sample Collection

With the use of the sample processing protocol from existing studies in urban metagenomics,13, 14 2 samples were collected in tandem at 4 separate locations (carousel floor, Ferris wheel door, promenade railing, and the Thunderbolt rollercoaster) in Coney Island (Brooklyn, New York, USA). The samples were collected using Copan Liquid Amies Elution Swab (Model #480C; Copan Diagnostics, Corona, CA, USA), which is a nylon-flocked swab that is preserved in a 1 ml transport medium. The transport medium contains calcium chloride, potassium chloride, sodium chloride, magnesium chloride, monopotassium phosphate, disodium phosphate, sodium thioglycollate, and distilled water. It preserves the genetic material of the swab by maintaining a pH of 7.0 6 0.5. All surfaces were swabbed for 3 min, and the surface area covered by the swab was ensured to be at least 1 m2. After each surface was sampled, the swab was placed in the collection tube and stored at 280°C. Microbial reference standard controls created by the MGRG of the Association of Biomolecular JOURNAL OF BIOMOLECULAR TECHNIQUES, VOLUME 28, ISSUE 1, APRIL 2017

Resource Facilities were used to assess the effects of REPLI-g amplification compared with experimental samples. DNA Extraction and Purification

After thawing the samples to room temperature, DNA was isolated, purified, and enriched using the QIAamp DNA Microbiome Kit (Qiagen), according to the manufacturer’s instructions. Reagent DX (100 ml; Qiagen) was added to 15 ml Buffer ATL (Qiagen). The swab samples were swirled in 1 ml transport media for ;20 s, and the swabs were pressed against the sides of the tubes to drain off the excess fluid. Buffer AHL (500 ml; Qiagen) was added to 1 ml sample in a 2 ml tube and incubated for 30 min at room temperature in a thermomixer at 600 rpm. The tubes were centrifuged at 10,000 g for 10 min, and the supernatant was discarded without disturbing the pellet. Buffer RDD (190 ml; Qiagen) and Benzonase (2.5 ml; Qiagen) were added to the pellet and incubated at 37°C for 30 min at 600 rpm in a water bath. Proteinase K (20 ml; Qiagen) was incubated at 56°C for 30 min at 600 rpm in a water bath. The tubes were briefly vortexed to remove condensation, 200 ml Buffer ATL was added, and the bacterial cells were lysed as the sample material was transferred to a Pathogen Lysis Tube L (Qiagen). The Pathogen Lysis Tube L was placed on a vortexer with a microtube foam insert and vortexed for 10 min at maximum speed. The Pathogen Lysis Tube L was then centrifuged at 10,000 g for 1 min, and the supernatant was transferred to a new microcentrifuge tube. Proteinase K (40 ml) was added, vortexed, and incubated at 56°C for 30 min at 600 rpm in a water bath. After 200 ml Buffer APL2 (Qiagen) was added to the mixture and vortexed for 30 s, the mixture was incubated at 70°C for 10 min, and the tubes were briefly vortexed. Ethanol (200 ml) was added to the lysate, and the contents of the tubes were mixed thoroughly by vortexing for ;30 s. Two washes of 700 ml mixture were transferred to a QIAamp UCP Mini Column (Qiagen) and centrifuged at 6000 g for 1 min. The flowthrough was discarded during each wash. The QIAamp UCP Mini Column was added to a new 2 ml collection tube, and 500 ml Buffer AW1 (Qiagen) was added to the column without wetting the rim. The tubes were centrifuged at 6000 g for 1 min. The QIAamp UCP Mini Column was added to a fresh 2 ml collection tube, and the filtrate was discarded. Buffer AW2 (500 ml; Qiagen) was added to the mix and centrifuged at 20,000 g for 3 min. The filtrate was discarded, and the QIAamp UCP Mini Column was added to a fresh 2 ml collection tube and centrifuged again at 20,000 g for 1 min. The mini column was then transferred to a fresh 1.5 ml tube, 50 ml elution buffer was added to the center of the membrane, and the tubes were incubated at room temperature for 5 min. Finally, the tubes were centrifuged at 6000 g for 1 min to elute the DNA. 47

AHSANUDDIN ET AL. / GENOME AMPLIFICATION FOR METAGENOMICS RESEARCH

TABLE 1

Library Preparation

The TruSeq Nano DNA Library Preparation Kit (FC-1214001; Illumina, San Diego, CA, USA) was used to prepare sequencing libraries. Fragmentation of the DNA to 500 nt was performed using the Covaris (v2.0; Covaris, Woburn, MA, USA), and small DNA fragments (,200 bp) were removed by AMPure bead cleanup (1.83). In brief, the ensuing steps in library preparation involved A-tailing, adaptor ligation, PCR amplification, bead-based library size selection, and cleanup again. A Bioanalyzer 2100 (Agilent Technologies, Santa Clara, CA, USA) was used to visualize the DNA fragments and to ensure that the libraries were within the range of 450–650 bp. Qubit quantification was performed again to ensure that excess nucleotides and primers were removed.

DNA yields for non-MDA and MDA samples Sample name Carousel floor normal Ferris wheel door normal Railing normal Thunderbolt normal MGRG, normal MGRG, 10 ng MGRG, 5 ng MGRG, 1 ng MGRG, 0.5 ng

Original DNA concentration, ng

REPLI-g DNA concentration, ng

15.50 19.40 21.00 18.10 7.10

85.80 80.20 71.00 54.80 41.20 36.4 37.80 43.00

REPLI-g DNA Amplification

Sequencing and Quality Control

MDA was performed on 1 sample from each location using the Qiagen REPLI-g WGA kit. The f29 DNA Polymerase catalyzes the strand-displacing reactions, which generated a 16 h-long reaction in isothermal conditions (30°C). Unlike PCR-based methods, REPLI-g MDA does not require repetitive cycles of denaturing and annealing temperatures. In brief, 2.5 ml of a 1:8 dilution of Solution A (0.4 M KOH, 10 mM EDTA) was added and incubated for 3 min at room temperature. Stop Solution (5 ml; 1:10 dilution of REPLI-g Solution B), 32.4 ml water, 15 ml REPLI-g 43 mix, and 0.6 ml REPLI-g polymerase were added to the reaction. This solution (40 ml) was added to 10 ml gDNA. The solutions were gently mixed and incubated at 30°C for 16 h and heat inactivated at 65°C for 10 min, and the DNA was purified by precipitation induced by sodium acetate. The DNA was resuspended in 25 ml of 10 ml Tris, which has a pH of 7.5.

Shotgun sequencing was performed using the Ilumina MiSeq (v2) machine with 300 3 300 bp reads, generating at least 2 million reads per sample. Raw data were processed using the Illumina Real-Time Analysis (RTA) software and CASAVA 1.8.2, and all reads were checked for standard CASAVA quality control parameters. All reads were quality trimmed with the FASTX-Toolkit to ensure that all samples had high quality (.Q20) values, i.e., 99% baselevel accuracy. Computational Analysis

To generate comparative analyses of species abundance and to cross validate our results, samples were analyzed with the One Codex platform15 and MetaPlAn (v2.0).16 The former compares input nucleotide sequences against the One Codex reference database (31-mer) comprised of ;40,000 complete

TABLE 2 Overview of FastQC sequencing data quality for experimental samples Location Repli-G amplified carousel floor Unamplified carousel floor REPLI-g-amplified Ferris wheel door Unamplified Ferris wheel door REPLI-g-amplified railing Unamplified railing REPLI-g-amplified Thunderbolt rollercoaster Unamplified Thunderbolt rollercoaster

48

Total number Species-level Genera-level Family-level of sequences GC content, % reads (% classified) reads (% classified) reads (% classified) Duplicates, % 2,558,428

44

670,894 (42)

1,440,552 (88)

1,445,359 (90)

$6.58

2,085,652 2,073,920

52 40

1,068,026 (68) 642,068 (35)

1,646,600 (93) 1,767,487 (97)

1,495,466 (95) 1,772,649 (98)

$4.16 $13.33

190,4052

41

717,343 (43)

1,635,046 (98)

1,640,799 (98)

$14.03

2,515,628 2,515,628 2,839,156

54 54 46

550,848 (20) 84,962 (4.1) 47,307 (6.3)

2,473,773 (89) 1,912,697 (92) 185,733 (25)

2,647,859 (95) 2,013,811 (97) 259,627 (35)

$26.92 $19.63 $17.09

2,462,662

48

66,924 (12)

179,978 (32)

259,829 (46)

$13.78

JOURNAL OF BIOMOLECULAR TECHNIQUES, VOLUME 28, ISSUE 1, APRIL 2017

AHSANUDDIN ET AL. / GENOME AMPLIFICATION FOR METAGENOMICS RESEARCH

TABLE 3 Overview of FastQC sequencing data quality and classified reads for MGRG controls Sample type REPLI-g-amplified MGRG, 10 ng REPLI-g-amplified MGRG, 5 ng REPLI-g-amplified MGRG, 1 ng REPLI-g-amplified MGRG, 0.5 ng Unamplified MGRG, 5 ng

Total GC Species-level Genera-level Family-level sequences content, % reads % classified reads % classified reads % classified Duplicates, % 3,216,104

45

1,222,498

47

1,786,147

68

2,258,392

86

$23.87

2,478,166

45

940,613

47

1,368,289

68

1,720,180

86

$22.63

2,811,864

46

1,033,714

46

1,495,168

67

1,898,550

85

$26.76

2,814,482

44

1,027,243

46

1,470,911

66

1,747,748

79

$38.75

2,946,836

68

1,895,246

75

2,148,482

85

2,212,014

87

$29.81

microbial genomes (bacteria, viruses, fungi, archaea, and protists). In addition to the number of reads assigned to each detected organism and the predicted genome-size abundances of those organisms, One Codex is used to perform whole genome alignment via Burrows-Wheeler Aligner. Those alignments are used to calculate the depth of sequencing, the amount of the genome covered by reads, and the mean percent identity of those reads. Additionally, the platform provides a visual representation of the distribution of raw reads across the genome. As a top-hit-only approach carries a risk of producing false-positive hits (read mapping to a relative of the organism present in the sample, rather than the actual organism), all hits (up to 3 mismatches) produced by the alignment program were kept and analyzed. Whereas the One Codex platform uses a k-mer-based computational approach to taxonomic classification, MetaPhlAn is a marker-based method that determines species

abundance using unique clade-specific marker genes from 3000 reference genomes. MetaPhlAn was run with default parameters for taxa classification, and results were visualized with a heatmap. For exact code and scripts, please see Supplemental Methods. RESULTS

We collected paired (adjacent swabs) environmental samples from 4 different locations in Coney Island from the Thunderbolt rollercoaster, promenade railing, carousel floor, and Ferris wheel door. We examined the results from the amplified versus unamplified replicates using several criteria as discussed below. Amplification Yields

DNA concentrations were determined using the Qubit 2.0 fluorometer after DNA extraction and after REPLI-g amplification. The value of input DNA was held constant

FIGURE 1

FastQC quality plots and nucleotide composition with REPLI-g for Thunderbolt rollercoaster samples. (Top) Sequence quality scores (x-axis) were plotted as a function of the number of reads with each bin (y-axis). (Bottom) The percentage of bases (A, C, G, T) for each bin of sequence length (y-axis) was shown before (left) and after (right) REPLI-g reactions, with the composition only showing slight changes.

JOURNAL OF BIOMOLECULAR TECHNIQUES, VOLUME 28, ISSUE 1, APRIL 2017

49

AHSANUDDIN ET AL. / GENOME AMPLIFICATION FOR METAGENOMICS RESEARCH

TABLE 4 One Codex species abundance and depth estimates for experimental samples Most abundant species

Sample type REPLI-g-amplified Thunderbolt rollercoaster Unamplified Thunderbolt rollercoaster REPLI-g-amplified railing Unamplified railing REPLI-g-amplified Ferris wheel door Unamplified Ferris wheel door REPLI-g-amplified carousel floor Unamplified carousel floor

All reads, %

No. of reads

Estimated depth, 3

Estimated abundance, %

S. epidermidis

0.88

25,081

30.325

78.73

S. epidermidis E. cloacae E. cloacae A. baumannii A. baumannii A. baumannii A. baumannii

0.88 21.09 1.20 16.93 16.92 3.55 1.93

21,647 611,918 30,309 351,204 322,072 90,925 40,312

16.409 N/A 126.877 125.543 105.636 78.463 28.457

63.71 N/A 91.64 90.04 82.89 53.90 24.47

at 5 ng for all swab samples before beginning REPLI-g amplification. According to Table 1, the average DNA concentration of the 4 non-MDA samples was 18.5 ng and 72.95 ng for the 4 REPLI-g amplified samples. Thus, REPLI-g produced at least a 4-fold amplification of the original DNA concentration. We next examined the presence of amplification-induced biases, such as uneven sequencing coverage and alterations in taxa detection, to determine the overall robustness of REPLI-g on amplifying high-quality DNA. Sequencing Data Quality, GC Content, and Base Composition

We analyzed the nucleotide composition of the samples, with (+) and without (2) REPLI-g amplification to determine the impact of the f29 DNA Polymerase enzyme’s MDA reaction on sequencing data quality. Whereas the sequence duplication level generally increased with amplification, the REPLI-g reaction often had minimal effects on the sequencing data quality, as indicated by comparisons of the FastQC data (Tables 2 and 3). However, the “spread” of the GC content occasionally shifted by 0.5% (Fig. 1). We next examined the duplicate rates in detail. Whereas the percent of duplicates shifted from 19.63 to 26.92% with amplification for the railing sample, the mean GC content

remained the same at 54% (Tables 2 and 3). For the non-MDA and MDA Thunderbolt rollercoaster samples, the total numbers of sequences were 2,462,662 and 2,839,156, respectively. The percent of GC content and sequence duplication level were 48% and 13.78% for the unamplified sample and 46% and 17.09% for the amplified sample, respectively. For the carousel floor non-MDA and MDA samples, the total number of sequences was 2,085,652 and 2,558,428. The percent of GC content and sequence duplication level were 52% and 4.16% for the non-MDA sample and 44% and 6.58% for the MDA sample. For the Ferris wheel non-MDA and MDA samples, the total number of sequences was 1,904,052 and 2,073,920, respectively. The percent of GC content and the sequence duplication level for the non-MDA sample were 41% and 14.03% and 40% and 13.33% for the MDA sample. We then examined the effect of alternating the input amount of a known titrated mixture on sequencing data quality. We found that the percent of GC content decreased with amplification at the various input values, whereas the total number of sequences generally increased with amplification. For the unamplified MGRG sample, the total number of sequences was 2,946,836, and the percent of GC content was 68% (Table 3). For the 10 ng REPLI-g MGRG sample, the total number of sequences was 3,216,104, and percent of GC content was 45%. For the 5 ng REPLI-g

TABLE 5 One Codex species abundance and depth estimates for MGRG samples Sample type

Most abundant species

REPLI-g-amplified MGRG, 10 ng REPLI-g-amplified MGRG, 5 ng REPLI-g-amplified MGRG, 1 ng REPLI-g-amplified MGRG, 0.5 ng Unamplified MGRG, 5 ng

S. epidermidis H. halophilus H. halophilus H. halophilus M. luteus

50

All reads, %

No. of reads

Estimated depth, 3

Estimated abundance, %

4.6 29.09 28.85 29.58 4.26

148,013 720,803 811,153 832,622 125,585

71.974 54.129 62.953 65.939 156.605

38.28 37.84 40.11 47.18 88.92

JOURNAL OF BIOMOLECULAR TECHNIQUES, VOLUME 28, ISSUE 1, APRIL 2017

AHSANUDDIN ET AL. / GENOME AMPLIFICATION FOR METAGENOMICS RESEARCH

FIGURE 2

Heatmaps of MetaPhlAn analysis. (A) Topspecies hits across the Coney Island dataset. Note the differences and similarities between the REPLI-g-amplified and unamplified data. (B) Topspecies hits across the MGRG standard at the various DNA input levels.

MGRG sample, the total number of sequences was 2,478,166, and the percent of GC content was 45%. For the 1 ng REPLI-g MGRG sample, the total sequences were 2,811864, and the percent of GC content was 46%. Finally, the total number of sequences for the 0.5 ng REPLI-g MGRG sample was 2,814,482, and the percent of GC content was 44%. The sequence duplication level was 23.87% for 10 ng MGRG sample, 22.63% for 5 ng MGRG sample, 26.76% for 1 ng MGRG sample, 38.75% for 0.5 ng sample, and 29.81% for the unamplified sample. There were no significantly over-represented sequences reported in the FastQC files for any of the samples except for the MDA carousel floor sample. The over-represented sequences were attributed to the TruSeq Adapter Index 2 and comprised a total of 0.11% of the overall reads.

genetically related and phenotypically close to A. baumannii.19 Both species are described in the literature as potential nosocomial pathogens.

Species Abundance and Diversity Estimates

We used 2 different classification tools (MetaPhlAn and One Codex) to cross validate species abundance and detection in all 8 samples and the positive controls, which we found to be largely consistent across both MDA and non-MDA samples. With the use of the One Codex platform, Staphylococcus epidermidis was the most abundant species present in both the REPLI-g-amplified and -unamplified Thunderbolt rollercoaster samples (Tables 4 and 5), which is expected from surfaces that are frequently exposed to human skin.13, 17 MetaPhlAn also detected S. epidermidis in both samples but to a smaller degree than Enhydrobacter aerosaccus, which is a gram-negative, facultative, anaerobic organism that contains gas vacuoles (Fig. 2A).18 The most abundant species detected in the railing samples using MetaPhlAn and One Codex was Enterobacter cloacae, which is gram negative and often associated with the normal gut flora of humans. One Codex detected Acinetobacter baumannii as the most abundant species in both the Ferris wheel and carousel floor MDA and non-MDA samples. However, MetaPhlAn detected Acinetobacter nosocomialis, which is

MetaPhlAn and One Codex species abundance overlap. (A–D) The MetaPhlAn results of the MDA and non-MDA carousel floor, Ferris wheel door, railing, and Thunderbolt rollercoaster samples. Orange represents the REPLI-g-amplified sample, whereas blue represents the unamplified sample. Note that in most cases, the REPLI-g hits fall in the unamplified hits, with 6 and 1 species being found only in the REPLI-g sample and not in the unamplified results (A and D, respectively). (E) The high concordance of the presence of species before and after amplification is shown, as well as some loss of species diversity after amplification.

JOURNAL OF BIOMOLECULAR TECHNIQUES, VOLUME 28, ISSUE 1, APRIL 2017

51

FIGURE 3

AHSANUDDIN ET AL. / GENOME AMPLIFICATION FOR METAGENOMICS RESEARCH

Species abundance estimates using MetaPhlAn and One Codex were mostly harmonized for the MGRG samples. The most abundant species present in the unamplified sample was Micrococcus luteus. Halobacillus halophilus was the most abundant species in the 0.5, 1, and 5 ng MGRG samples, whereas S. epidermidis was present at the highest concentration in the 10 ng MGRG sample. More specifically, MetaPhlAn identified a total of 137 taxa and 40 species across the Coney Island dataset (see Supplemental Methods for full MetaPhlAn results). Figure 2 depicts heatmaps showing the MetaPhlAn results across the Coney Island and MGRG standard datasets. We observed that the taxonomic profiles cluster according to the sample source. Furthermore, the Venn diagrams in Fig. 3 compare the total species identified by MetaPhlAn in the REPLI-g-amplified (orange/“B”) and -unamplified samples (blue/“A”) in the Coney Island dataset. In both the Ferris wheel door and promenade railing samples, the amplified samples did not have any new species compared with the unamplified samples. However, the carousel- and Thunderbolt-amplified samples had 2–3 additional species that were not detected in their unamplified counterparts. However, despite relatively high concordance of species abundance in the samples, REPLI-g WGA often resulted in fewer species detected. Figure 3 illustrates the concordance of species detection using MetaPhlAn in the MDA and non-

MDA samples. There is a loss of diversity present in the amplified samples, which poses a significant concern. More specifically, there was a subset of organisms detected in the One Codex platform following WGA. Of the 16 species detected in the unamplified Ferris wheel door sample, 15 were detected in the paired REPLI-g sample. No other unique species were detected in the paired REPLI-g sample. Of the 39 species detected in the unamplified Thunderbolt sample, only 12 were detected in the paired REPLI-g sample. Two additional species were detected in the REPLI-g sample that were not detected in the paired, unamplified sample. Of the 110 species detected in the unamplified carousel sample, 61 were detected in the paired REPLI-g sample. Six additional species were detected in the REPLI-g samples that were not detected in the paired unamplified sample. Whereas 15 species were detected in the unamplified railing sample, the sequencing coverage was not even enough in the paired REPLI-g sample to perform automated abundance estimation and falsepositive filtering. With the use of the unfiltered read counts as an alternative measure of detection (that does not filter false positives as robustly), 14 of those species were detected with at least 0.01% of the raw read counts in the paired REPLI-g sample. Of the 9 species detected in the unamplified MGRG sample, 7 were detected in at least 1 of the paired REPLI-g samples. Two additional species were detected in at least 1 of the paired REPLI-g

FIGURE 4

Spearman rank R2. The Spearman rank was consistently above 0.94 (R2 range was 0.94–0.99), which indicates high concordance of species abundance between the amplified and unamplified pairwise samples. Note how the samples cluster by type.

52

JOURNAL OF BIOMOLECULAR TECHNIQUES, VOLUME 28, ISSUE 1, APRIL 2017

AHSANUDDIN ET AL. / GENOME AMPLIFICATION FOR METAGENOMICS RESEARCH

2.5 Mb

2.0 Mb

1.5 Mb

1.0 Mb

60.0×

0.5 Mb

A

50.0× 40.0× 30.0× 20.0× 10.0× 0.0×

FIGURE 5

2.5 Mb

2.0 Mb

1.5 Mb

1.0 Mb

0.5 Mb

B

40.0× 30.0× 20.0× 10.0× 0.0×

2.5 Mb

2.0 Mb

1.5 Mb

1.0 Mb

0.5 Mb

C 50.0× 40.0× 30.0× 20.0× 10.0× 0.0×

2.5 Mb

2.0 Mb

1.5 Mb

1.0 Mb

0.5 Mb

D

15.0×

One Codex whole genome coverage plots for MGRG DNA. These coverage plots illustrate the alignment of the samples’ reads against the M. luteus genome. (A) The unamplified MGRG sample has the most even coverage, which is indicated by its classification as an isolate/low-complexity sample comprised of 59.22% of reads (n = 1,745,213) specific to M. luteus. The depth is 210.5 6 81.63 with 99.5% coverage and 97.5% of reads aligned to reference genomes. (B) The MGRG 10 ng sample has a depth is 24.6 6 243.63 with 38.3% of the genome covered and 98.1% of identity. (C) The MGRG 5 ng sample had 75.99% of 2,478,166 as classified reads. The depth is 18.8 6 188.93 with 30.5% coverage and 98.1% identity. (D) The MGRG 1 ng sample was classified as a mixed/metagenomic sample, where 75.53% of 2,811,864 reads are classified. The depth is 23.5 6 1267.23 with 24.1% coverage and 98.1% identity. (E) The MGRG 0.5 ng sample had 68.9% of 2,814,482 reads classified and 7.8 6 1106.83 depth, 13.3% coverage, and 97.9% identity.

10.0× 5.0×

2.5 Mb

2.0 Mb

1.5 Mb

1.0 Mb

0.5 Mb

0.0×

E 500× 400× 300× 200× 100× 0×

samples that were not detected in the unamplified sample. This may be because of the introduction of artifacts into the samples as a result of low-input and/or the filters from the One Codex platform (Fig. 4).

Deeper inspection of individual genomes via sensitive whole genome alignment revealed a highly localized pattern of genome region enrichment following REPLI-g WGA. Short reads were aligned against the M. luteus genome

JOURNAL OF BIOMOLECULAR TECHNIQUES, VOLUME 28, ISSUE 1, APRIL 2017

53

AHSANUDDIN ET AL. / GENOME AMPLIFICATION FOR METAGENOMICS RESEARCH

for the unamplified and REPLI-g-amplified MGRG samples, as M. luteus was found to be the most abundant species in the unamplified MGRG sample. Whereas the raw reads align evenly across the length of the genome in the unamplified sample, each of the amplified replicates shows a consistent pattern of enrichment in specific genomic regions. This pattern of enrichment is more pronounced with the lower levels of DNA input (1 ng) compared with the higher levels of DNA input (10 ng) (Fig. 5). The complete set of One Codex analysis and results can be accessed at https://app.onecodex. com/projects/ahsanuddin2016. DISCUSSION

Whereas REPLI-g amplification produced negligible differences in species abundance estimates and overall sequencing data quality, the loss of diversity (i.e., missing species) in the amplified samples is a major concern that must be addressed in future work. This has implications not only for urban metagenomics projects, as described above, but also for low-input clinical samples, rare cells sorted from complex mixtures, or even potentially single-cell methods. Indeed, even for samples with abundant DNA, these protocols can help get more use from precious or rare samples, but they must always be compared against the known catalog of laboratory contaminants.20 However, these data and samples are not necessarily representative of all types of samples or variations of the MDA protocols, and other questions remain that can be addressed by additional studies. In particular, the use of long reads, instead of short reads, could ameliorate some of these biases in the data or drops in coverage, as well as improved genome assemblies.21 Furthermore, it is possible that nonspecific amplification may lead to increases in specificity for more informative markers (percentage of reads mapping to reference databases) and could still be completely unaffected by the MDA reaction. There have also been studies showing differential DNA methylation marks present in various strains of bacteria, which could impact extraction efficiency or amplification of specific loci, and modeling of differentially methylation regions in species can potentially help discover and correct for these effects.22, 23 Finally, there are as many samples to test as there are milieu in the world, and the general applicability of MDA measures for other metagenomic and microbiome samples should be examined. The generally high correlation of the species composition and their representation across the test and control samples indicate a potentially broad use of the technology. However, there are clear areas of the genome that cannot be amplified as a result of their being GC rich or because specific sequences become enriched or deleted during library preparation or amplification. Furthermore, there were often 54

less species detected after amplification compared with before, indicating that a narrowing of the detection of species present is a risk with amplification. This could indicate that the enzymes for MDA could be optimized or altered, reaction volume could be changed, or other reagents could be altered. Nonetheless, these data indicate a ready use for low-biomass samples in a variety of contexts and will likely be of general use for metagenomics and other genomics research.

ACKNOWLEDGMENTS The authors thank Christopher Mountanos, Caleb Gordan, Ike Lewis, Sophie Dornbaum, Alina Sheikh, Laolu Ogunnaike, Christian Kendall, and Jason Sperry for collecting the samples. The authors also thank the Epigenomics Core Facility at Weill Cornell Medicine, as well as the Starr Cancer Consortium (Grants I7-A765, I9-A9-071), and funding from the Irma T. Hirschl and Monique Weill-Caulier Charitable Trusts, Bert L. and N. Kuggie Vallee Foundation, WorldQuant Foundation, Pershing Square Sohn Cancer Research Alliance, NASA (NNX14AH50G, NNX17AB26G), U.S. National Institutes of Health (R25EB020393, R01NS076465, R01AI125416, R01ES021006), Bill and Melinda Gates Foundation (OPP1151054), and Alfred P. Sloan Foundation (G-201513964).

DISCLOSURES

The authors herein declare that this research was conducted in the absence of any financial or commercial interests that could be potentially regarded as a conflict of interest.

REFERENCES 1. Thomas T, Gilbert J, Meyer F. Metagenomics—a guide from sampling to data analysis. Microb Inform Exp 2012;2:3. 2. Hutchison III CA, Venter JC. Single-cell genomics. Nat Biotechnol 2006;24:657–658. 3. Hugenholtz P. Exploring prokaryotic diversity in the genomic era. Genome Biol 2002;3:REVIEWS0003. 4. Kunin V, Copeland A, Lapidus A, Mavromatis K, Hugenholtz P. A bioinformaticians guide to metagenomics. Microbiol Mol Biol Rev 2008;72:557–578. 5. Gonzalez JM, Portillo MC, Saiz-Jimenez C. Multiple displacement amplification as a pre-polymerase chain reaction (prePCR) to process difficult to amplify samples and low copy number sequences from natural environments. Environ Microbiol 2005;7:1024–1028. 6. Rhee M, Light YK, Meagher RJ, Singh AK. Digital droplet multiple displacement amplification (ddMDA) for whole genome sequencing oflimited DNA samples. PLoS One 2016; 11:e0153699. 7. Hammond M, Homa F, Andersson-Svahn H, Ettema TJ, Joensson HN. Picodrop partitioned whole genome amplification of low biomass samples preserves genomic diversity for metagenomic analysis. Microbiome 2016;4:52. 8. Dean FB, Nelson JR, Giesler TL, Lasken RS. Rapid amplification of plasmid and phage DNA using Phi29 DNA polymerase and multiply-primed rolling circle amplification. Genome Res 2001;11:1095–1099. 9. Pinard R, de Winter A, Sarkis GJ, et al. Assessment of whole genome amplification-induced bias through high-throughput, JOURNAL OF BIOMOLECULAR TECHNIQUES, VOLUME 28, ISSUE 1, APRIL 2017

AHSANUDDIN ET AL. / GENOME AMPLIFICATION FOR METAGENOMICS RESEARCH

10.

11.

12.

13. 14.

15.

massively parallel whole genome sequencing. BMC Genomics 2006;7:216. Zhang L, Cui X, Schmitt K, Hubert R, Navidi W, Arnheim N. Whole genome amplification from a single cell: implications for genetic analysis. Proc Natl Acad Sci USA 1992:89: 5847–5851. Telenius H, Carter NP, Bebb CE, Nordenskj¨old M, Ponder BA, Tunnacliffe A. Degenerate oligonucleotide-primed PCR: general amplification of target DNA by a single degenerate primer. Genomics 1992;13:718–725. Cheung VG, Nelson SF. Whole genome amplification using a degenerate oligonucleotide primer allows hundreds of genotypes to be performed on less than one nanogram of genomic DNA. Proc Natl Acad Sci USA 1996;93:14676–14679. Afshinnekoo E, Meydan C, Choudhury S, et al. Geospatial resolution of human and bacterial diversity with city-scale metagenomics. Cell Syst 2015;1:72–87. MetaSUB International Consortium. The Metagenomics and Metadesign of the Subways and Urban Biomes (MetaSUB) International Consortium inaugural meeting report. Microbiome 2016;4:24. Minot S, Greenfield N. One Codex: a sensitive and accurate data platform for genomic microbial identification. bioRxiv beta Sept. 28, 2015. Available at: http://biorxiv.org/content/early/2015/09/ 28/027607. doi:http://dx.doi.org/10.1101/027607

JOURNAL OF BIOMOLECULAR TECHNIQUES, VOLUME 28, ISSUE 1, APRIL 2017

16. Segata N, Waldron L, Ballarini A, Narasimhan V, Jousson O, Huttenhower C. Metagenomic microbial community profiling using unique clade-specific marker genes. Nat Methods 2012;9: 811–814. 17. Otto M. Staphylococcus epidermidis: the ‘accidental’ pathogen. Nat Rev Microbiol 2009;7:555–567. 18. Staley JT, Irgens RL, Brenner DJ. Enhydrobacter aerosaccus gen. nov., sp. nov., a gas-vacuolated, facultatively anaerobic, heterotrophic rod. Int J Syst Bacteriol 1987;37:289–291. 19. Chen TL, Lee YT, Kuo SC, Yang SP, Fung CP, Lee SD. Rapid identification of Acinetobacter baumannii, Acinetobacter nosocomialis and Acinetobacter pittii with a multiplex PCR assay. J Med Microbiol 2014;63:1154–1159. 20. Salter SJ, Cox MJ, Turek EM, et al. Reagent and laboratory contamination can critically impact sequence-based microbiome analyses. BMC Biol 2014;12:87. 21. Rosenfeld J, Mason CE, Smith T. Limitations of the human genome reference. PLoS One 2012;7:e40294. 22. Li S, Garrett-Bakelman FE, Akalin A, et al. An optimized algorithm for detecting and annotating regional differential methylation. BMC Bioinformatics 2013;14(Suppl 5):S10. 23. Feng Z, Fang G, Korlach J, et al. Detecting DNA modifications from SMRT sequencing data by modeling sequence context dependence of polymerase kinetic. PLOS Comput Biol 2013;9: e1002935.

55

Assessment of REPLI-g Multiple Displacement Whole Genome Amplification (WGA) Techniques for Metagenomic Applications.

Amplification of minute quantities of DNA is a fundamental challenge in low-biomass metagenomic and microbiome studies because of potential biases in ...
966KB Sizes 1 Downloads 11 Views