Nature Reviews Genetics | AOP, published online 4 February 2014; doi:10.1038/nrg3625

REVIEWS

A P P L I C AT I O N S O F N E X T- G E N E R AT I O N S E Q U E N C I N G

The impact of whole-genome sequencing on the reconstruction of human population history Krishna R. Veeramah1,2 and Michael F. Hammer1

Abstract | Examining patterns of molecular genetic variation in both modern-day and ancient humans has proved to be a powerful approach to learn about our origins. Rapid advances in DNA sequencing technology have allowed us to characterize increasing amounts of genomic information. Although this clearly provides unprecedented power for inference, it also introduces more complexity into the way we use and interpret such data. Here, we review ongoing debates that have been influenced by improvements in our ability to sequence DNA and discuss some of the analytical challenges that need to be overcome in order to fully exploit the rich historical information that is contained in the entirety of the human genome. Mitochondrial DNA (mtDNA). A circular piece of non-recombining DNA of ~16,000 bp that is found in the mitochondrion and that is inherited exclusively from the maternal parent.

Arizona Research Laboratories Division of Biotechnology, Room 231, Life Sciences South, 1007 East Lowell Street, University of Arizona, Tucson, Arizona 85721, USA. 2 Present address: 650 Life Sciences Building, Stony Brook University, Stony Brook, New York 11794–5245, USA. Correspondence to M.F.H.  e‑mail: [email protected]. edu doi:10.1038/nrg3625 Published online 4 February 2014 1

Studies that infer human population history have used various types of genetic information (TABLE 1). The first of these used classical markers such as ABO blood groups and protein allomorphs1. Although surveys of human polymorphism flourished in the latter half of the last century, the research field gained wide attention with the development of DNA-based analyses and the demonstration in 1987 of an African root to the human mitochondrial DNA (mtDNA) tree2. Many mtDNA studies that examined various historical and anthropological questions from the matrilineal viewpoint quickly followed, and complementary analyses of the non-recombining portion of the Y  chromosome (NRY) soon provided a patrilineal perspective of the past. Although the study of these uniparentally inherited systems generated substantial enthusiasm and many tens of thousands of mtDNA and NRY have now been characterized to high resolution, it is important to note that they offer only a limited ‘snapshot’ of the genealogical information in the human genome. Indeed, researchers have questioned the power of inference that is based on a single locus because of the inherent stochasticity of the evolutionary process. Instead, it has been suggested that a robust testing of demographic hypotheses requires appropriate statistical methods that simultaneously consider multiple loci3. In the early 2000s, new insights into human genetic diversity appeared with the characterization of hundreds of short tandem repeat (STR) loci that are distributed across the genome4,5. With the subsequent

development of DNA hybridization microarray technology, which was primarily used to map disease alleles in genome-wide association studies, focus shifted to the interrogation of hundreds of thousands of singlenucleotide polymorphisms (SNPs)6. Despite the ability to analyse human population structure at a much higher resolution at both local and regional scales, ascertainment biases that are inherent in the application of SNPgenotyping microarray data limit power for inferring the evolutionary processes that underlie the patterns of genetic diversity. These limitations primarily result from the design of the arrays, which only interrogate SNPs that have been discovered in small panels of individuals from a few populations and therefore miss much of the diversity of global human populations7. The much more recent application of massively parallel short-read (that is, second-generation) sequencing technology overcomes many of the limitations of earlier methodologies. The collection of contiguous DNA sequence data from multiple chromosomes at thousands of loci reduces ascertainment biases and substantially increases power to infer demographic processes, regardless of whether such data are obtained from the entire genome or from large portions of it. However, many challenges remain with regard to data quality and analyses. In particular, second-generation sequencing introduces errors at a fairly high rate relative to traditional Sanger sequencing, and methods that fully exploit the historical information in the genome are still underdeveloped8.

NATURE REVIEWS | GENETICS

ADVANCE ONLINE PUBLICATION | 1 © 2014 Macmillan Publishers Limited. All rights reserved

REVIEWS Table 1 | Types of genetic data that are used to infer population history Non-recombining portion of the Y chromosome (NRY). The middle ~95% of the Y chromosome that is passed from father to son and that does not undergo recombination during meiosis, thereby allowing inheritance of genetic ancestry to be traced exclusively down the paternal line.

Data type

Uses and advantages

Limitations

mtDNA

• The absence of recombination allows reconstruction of a gene tree • Smaller Ne than autosomal DNA, which allows better discrimination between populations • Samples from many thousands of individuals can be characterized at low cost • High copy number makes it amenable for ancient DNA extraction and analyses

• A single genealogy contains little information about the underlying population history • Likely to be subjected to the effects of natural selection • High uncertainty in mutation rates

NRY

• The absence of recombination allows reconstruction of a gene tree • Smaller Ne than autosomes, which allows better discrimination between populations • Samples from many thousands of individuals can be characterized at low cost

• A single genealogy contains little information about the underlying population history • Likely to be subjected to the effects of natural selection • High uncertainty in mutation rates and mutation model • Ascertainment bias results from the genotyping of specific SNPs

Autosomal STRs

• Hundreds of independent STRs can be genotyped in many individuals, which reduces the effect of evolutionary stochasticity • Their high mutation rate is useful for inferring recent demographic events and for distinguishing between closely related populations

• Limited inference of demographic events at deep timescales • High uncertainty in mutation rates and mutation model

SNP microarrays

• Hundreds of thousands of SNPs can be genotyped • Large ascertainment bias results from the in a single experiment haphazard way by which SNPs were discovered • Unprecedented resolution of population structure • Less powerful for making inferences in populations that are diverged from those in which SNPs were discovered

Secondgeneration sequencing

• Massive amounts of relatively unbiased sequence data can be obtained from targeted regions or entire genomes compared with Sanger sequencing • High throughput and does not require a targeted pre-PCR step, which allows sequencing of ancient DNA • Lowest per-base cost of any current sequencing methodology

• Relatively error prone compared with Sanger sequencing • Biases may arise with regard to regions that are preferentially sequenced • Sequencing is through short reads (100–150 bp), which restricts the use of methods that require haplotype-phased data

Thirdgeneration sequencing

• Can generate long sequence reads (>10 kb) • Some methods can sequence DNA from single cells, which is particularly useful for very ancient samples • Long reads may also allow de novo assembly and thus reduce reference biases

• Per-base cost is currently more expensive than second-generation sequencing • Bioinformatic tools have not yet been fully developed to cope with the increased read length

Uniparentally inherited systems Genetic material in organisms with distinct sexes that is passed on to offspring through inheritance only from one sex; that is, mitochondrial DNA and the non-recombining portion of the Y chromosome.

Short tandem repeat (STR). A DNA sequence that contains a variable number (typically ≤50) of tandem repeated short sequence motifs of 2–6 bp, such as (GATA)n.

Population structure The distribution of individuals into partially isolated local subpopulations or demes that are interconnected by migration.

Phylogeographical approaches Methods that use the geographical distribution of genetic lineages, which are deduced from phylogenetic methods, to infer the demographic history of a set of individuals or populations.

Model-based inference methods Analyses that specify demographic models, investigate the model that best fits the genetic data and infer parameters of interest (such as population size changes, divergence times and migration events) for the best-fitting model.

Anatomically modern humans (AMHs). Individuals that are classified as Homo sapiens on the basis of the set of morphological characteristics that distinguish them from other, now extinct, members of the genus Homo (that is, archaic humans). According to the fossil record, AMHs emerged ~ 200,000– 150,000 years ago.

mtDNA, mitochondrial DNA; Ne, effective population size; NRY, non-recombining portion of the Y chromosome; SNP, single-nucleotide polymorphism; STR, short tandem repeat.

As whole-genome sequencing (WGS) data have begun to emerge, some inferences that were based on earlier methodologies have been confirmed, whereas others have been overturned or must now be reinterpreted in the context of more complex models. Here, we review examples of important historical questions and examine the contribution that WGS data and other large-scale genomic data sets have made to these debates. Although studies of uniparentally inherited systems9 and SNP arrays6 have shaped the context of some of the debates discussed here, we do not provide a thorough review of these areas. We also do not review arguments for the relative merits of phylogeographical approaches compared with more quantitative model-based inference methods that have been fiercely debated10,11,12. Rather, we focus on methods that have proved useful or that have the potential to take full advantage of WGS data for making historical inferences.

Origins of anatomically modern humans Perhaps the most controversial questions that have been addressed with genetic data are when, where and how members of the genus Homo underwent the transition from archaic humans to anatomically modern humans (AMHs). Before the advent of genetic data, there were two prevalent competing models based on fossil data: the multiregional evolution (MRE) model and the recent African origin (RAO) model13,14. In the MRE model, anatomically modern features arose in various parts of Eurasia where archaic humans lived, and these features were brought together in a single anatomically modern form as a result of gene flow and natural selection that occurred in the past ~1–2 million years14. By contrast, the RAO model posits that anatomically modern features arose specifically in Africa, which resulted in the emergence of AMHs ~200,000 years ago. Subsequently, some AMH individuals moved into Eurasia and

2 | ADVANCE ONLINE PUBLICATION

www.nature.com/reviews/genetics © 2014 Macmillan Publishers Limited. All rights reserved

REVIEWS completely replaced all other existing archaic Homo spp., such as Neanderthals, without any interbreeding.

Reciprocal monophyly The phenomenon whereby all lineages within a species are genealogically closer to each other (in this context, on the basis of the sharing of common genetic variants) than they are to any lineages in other species that are considered in a phylogeny.

Clades Groups of entities (such as genes or organisms) in a phylogenetic tree that have all arisen from a common ancestor.

Introgression Gene flow between populations or species whose individuals hybridize.

Range expansions Increases in the geographical distribution of a population through time from some region of origin.

Linkage disequilibrium (LD). The nonrandom association of alleles that are carried at different loci. LD can arise for various reasons (such as novel mutations, genetic drift, natural selection and admixture), but recombination is the main process that removes it.

Founder events Scenarios in which a new population is founded by a small number of incoming individuals. Similarly to a bottleneck, the founder effect severely reduces genetic diversity and increases the effect of random drift.

Admixture Gene flow between two or more groups that have been separated for a long enough period of time to be genetically distinct.

Coalescence A process that describes the genealogy of chromosomes or genes under a particular demographic model. The genealogy is constructed backwards in time and starts with the present-day sample. Lineages coalesce until the most recent common ancestor of the sample is reached.

Early single-locus studies. The landmark mtDNA study 2 in 1987 had a large influence on the debate of human origins, which had remained unresolved given the fossil evidence that was available at the time. The report of a phylogenetic tree that relates all human mtDNA to a common African ancestor that lived ~200,000 years ago was interpreted to strongly support the RAO model and arguably drove the debate for the next two decades. A recent African origin of mtDNA that is in line with both the 200,000 years ago figure above and the greater diversity observed in African mtDNA than in non-African mtDNA was confirmed in various subsequent mtDNA studies using increased sequence resolution15,16. Meanwhile, results from the NRY17,18 and from individual autosomal loci were also consistent with an African origin of AMHs19. In addition, ancient mtDNA was isolated and sequenced from a range of geographically disparate Neanderthal fossils20–22 and was shown to be genealogically distinct from all known extant mtDNA sequences, which was initially interpreted to indicate a lack of interbreeding between AMHs and Neanderthals. However, it was also recognized that single-locus data were not inconsistent with alternative models of human origins, such as the MRE model and intermediate models that recognized Africa as the chief source of anatomically modern features while allowing some degree of gene flow with archaic humans23,24. The reciprocal monophyly of human and Neanderthal mtDNA clades did not necessarily rule out some archaic introgression into modern human genetic diversity 25, and one study 26 showed that 50–100 independent loci would be required to reliably reject a model of archaic introgression. However, it has also been argued that, under models with range expansions27, Neanderthal mtDNA should be detectable with low levels of interbreeding. Intriguingly, a small number of autosomal single locus studies began to identify deep diversity within modern humans that, together with a lack of evidence for recombination between the most ancient branches, suggested low levels of archaic introgression28. Multilocus genome-wide data. As more genome-wide data from globally distributed modern-day individuals were becoming available, there was an emerging pattern of decreased genetic diversity as a function of the distance from eastern or southern Africa, as shown by a fairly linear decrease in heterozygosity 29,30 and increase in linkage disequilibrium (LD)31. A commonly asserted explanation for this pattern that is consistent with the RAO model is that AMH populations experienced a series of founder events as they expanded out of Africa, which is known as the serial bottleneck model. However, the seemingly linear decrease in genetic diversity at increasing distance from sub-Saharan Africa is not necessarily inconsistent with other models32, particularly those that include low levels of archaic admixture33. Indeed, studies using multilocus sequence data and

coalescence-based

simulations that are more sensitive to identifying potentially introgressed DNA segments detected a possible contribution of up to 5% in modern Eurasian and even African genomes from divergent hominin group (or groups)34,35. Whole-genome sequencing of extinct hominins. In 2010, a group published draft genomes of two extinct species that had their DNA successfully extracted from bone and/or tooth material: Neanderthals36 and a previously unknown form called Denisova, which was named after the cave where remains were discovered in the Altai Mountains of Russia37. Two major findings were that the Denisovan and Neanderthal genomes are more closely related to each other than either is to the genomes of AMHs, and that all non-Africans shared 1–4% of Neanderthal DNA, whereas Melanesians also shared 3.5% of their DNA with a Denisovan-like population. The latter inference was made primarily on the basis of the conceptually simple but powerful D statistic38, which showed that chromosomes from non-African individuals shared, on average, more derived alleles with those from Neanderthals and/or Denisovans than chromosomes from African individuals. This was interpreted as evidence for two interbreeding events: one between Neanderthals and the ancestors of all non-Africans in the Middle East, and the other between the ancestors of Melanesians and a Denisovan-like population. Another explanation for the excess sharing of SNPs between non-Africans and Neanderthals is a model of ancient population structure in Africa36,39. In this model, there were two or more subpopulations that experienced limited gene flow in Africa; both Neanderthals and present-day Eurasians were derived from one of these subpopulations, and present-day Africans were derived from the other subpopulation. However, unlike ancient population structure, a model of recent interbreeding is expected to result in a proportion of the genome containing long haplotypes with low divergence between certain Eurasian and Neanderthal chromosomes. The extent of interbreeding and its timing are expected to influence the number and the length of these introgressed haplotypes, respectively. It has since been shown that observed patterns of LD are inconsistent with expectations under models that invoke only ancient population structure in Africa40. A similar claim has also been made on the basis of an allele frequency spectrum (AFS)-based method41. It is important to note that, unlike the Neanderthal case, the excess sharing of derived SNPs between Melanesians and the Denisovan genome cannot be reconciled by a simple model of population structure in Africa37. Using a further derivation of the D statistic and complementary LD‑based methods, one study 42 demonstrated that East Asians possess more genomic regions than Europeans that are likely to have introgressed from Neanderthals. Similarly, the process by which Denisovan DNA entered modern Asian populations may have been complex. Given the large geographical distance between the Denisova Cave and Southeast Asia, one possibility is that a Denisovan-like archaic species may have once extended over a much larger geographical area (BOX 1).

NATURE REVIEWS | GENETICS

ADVANCE ONLINE PUBLICATION | 3 © 2014 Macmillan Publishers Limited. All rights reserved

REVIEWS Melanesians The putative indigenous inhabitants of the islands of Melanesia in the Pacific, which is a subregion of Oceania that includes the modern-day countries Papua New Guinea, Solomon Islands, Vanuatu and Fiji.

D statistic A statistic that detects admixture by examining patterns of allele sharing between single genomes of two sister populations and a more diverged third population that has putatively experienced gene flow to a greater degree with one of these two sister populations since they diverged; an outgroup is also used to determine the ancestral state of alleles.

An intriguing alternative model that is not mutually exclusive from models that involve interbreeding between AMHs and Denisovan-like populations comprises admixture between Denisovans and a currently unsampled ‘ghost’ archaic hominin37 that lived at a geographically intermediate location between the Denisova Cave and Melanesia. Examples of these ghost archaic hominins include Homo erectus, which is thought to have resided in Southeast Asia between 1.8 million years ago and 3 million years. The recent publication of a high-coverage (that is, 52×) Neanderthal genome from the Denisova Cave has provided more refined insights into many of the archaic admixture processes mentioned above45. First, analysis of this genome gave a revised estimate of 1.5–2.1% of Neanderthal-derived DNA in non-Africans. In

Box 1 | Mechanisms of Denisovan introgression Two of the major questions to arise from the finding of an excess number of shared single-nucleotide polymorphisms (SNPs) between the Denisovan genome and the genomes of anatomically modern human (AMH) individuals from Melanesia37 are where and when interbreeding may have occurred. One possibility is that the Denisova Cave specimen was a representative of a more widespread population and that a dispersing group of AMHs encountered and interbred with such a population somewhere in Southeast Asia152. This scenario fits with the observation that Denisovan genetic material is only present in individuals from Melanesia, Australia and some islands of Southeast Asia. The absence of Denisovan genetic material in other parts of Asia is attributed to a second wave of AMHs dispersing into Asia that did not involve interbreeding with a Denisovan-like population. By contrast, there is evidence45,153 for a very low level of Denisovan ancestry in southern continental East Asian populations, which suggests a model that involves two independent Denisovan introgression events153. Other possible models involve back migration from admixed Melanesians and continuous admixture throughout East and Southeast Asia46. However, it is important to consider the striking concordance between the cline of Denisovan ancestry152 and the steep Asian–Melanesian ancestry cline in eastern Indonesia that has been detected at both autosomal and X‑linked SNPs154. The Asian– Melanesian ancestry cline matches the phenotypic gradient that was initially observed by Alfred R. Wallace between Malaysians and Papuans, and the cline was attributed to the mixing of two long-separated ancestral source populations — one descended from the initial Melanesian-like inhabitants of the region and the other related to Asian groups that immigrated more recently during the Paleolithic period and/or with the spread of agriculture155. This suggests that any introgression signal that derives from interbreeding between AMHs and a Denisovan-like population west of the Indonesian ancestry cline (including the Asian mainland) may have been either highly diluted (which is consistent with the inference of a low level of Denisovan DNA in some East Asian populations153) or completely erased by recent dispersals of non-admixed AMH populations. This would make it difficult to determine where or when archaic admixture occurred before these agricultural expansions. Interestingly, a chromosome 21 sequence and a series of genome-wide SNPs that were obtained from a 40,000‑year-old AMH specimen from Tianyuan Cave outside Beijing, China, lacks any sign of Denisovan ancestry156.

addition, the effective population size (Ne) of Neanderthals was shown to be ~20% of that of AMHs. Neanderthals and AMHs were estimated to have diverged from a common ancestor 383,000–257,000 years ago, compared with a Neandethal–Denisovan split time of 236,000–190,000 years ago. Second, the observation that East Asians have greater levels of Neanderthal admixture than Europeans42 was confirmed, and there was also evidence of low levels of Neanderthal gene flow (~0.5%) into Denisovans. Third, a model that involved introgression into Denisovans from a ghost archaic population44 was found to provide the best fit of all the WGS data that have so far been assembled. This ghost archaic population is estimated to have diverged from the ancestors of AMHs, Neanderthals and Denisovans 4–0.9 million years ago and to have contributed to 0.5–0.8% of the Denisovan genome. FIG. 1 shows an updated model46 of archaic introgression, including a possible contribution from H. erectus to Denisovans. Further support for these models and their parameters awaits discovery of ancient DNA from more divergent hominin forms.

Out of Africa Given that most of our ancestry traces to Africa, an important area of research has focused on when and by which routes early AMH migrants dispersed from that continent. As the reconstructed mtDNA genealogy consisted of a single branch that led to all non-African mtDNA lineages, much of the early mtDNA work was interpreted to be consistent with a single dispersal out of Africa into Eurasia and Oceania, which was contrary to some interpretations of the archaeological record47. On the basis of the geographical distribution of putative ‘founder’ non-African mtDNA haplogroups, which were suggested to have originated in South Asia ~60,000 years ago48, this single migration out of Africa was posited to have taken a southern route, possibly through the horn of Africa because the Levantine corridor would have probably been impenetrable at this time. As described above, both early autosomal STR marker data and analyses of LD decay based on SNP array data were also interpreted to support a single-wave serial bottleneck model29,30. An Australian Aboriginal genome. One of the more compelling pieces of evidence for multiple waves of migration came from a WGS study of a 100‑year-old lock of hair from an Australian Aborigine49. The pattern of segregating sites between this genome and African, European and Asian genomes as assessed by the D 4P statistic was shown to be consistent with an early branching of Australian Aborigines (that is, before an European– Asian split 50), with the initial non-African divergence occurring ~75,000–62,000 years ago. It is important to note that these results did not necessarily provide evidence for separate dispersal waves out of Africa; rather, they only indicate two major waves of movement of early non-African populations. Divergence time of African and non-African genomes. The initial timing of the dispersal from Africa — regardless of whether such dispersal was in a single wave or in

4 | ADVANCE ONLINE PUBLICATION

www.nature.com/reviews/genetics © 2014 Macmillan Publishers Limited. All rights reserved

REVIEWS Denisova Cave Admixture between Denisovans and H. erectus

Derived alleles Alleles that arise in a population following the mutation of an ancestral allele. Ancestral alleles can be distinguished from derived alleles, as the ancestral allele will typically be present in an outgroup species (for example, the chimpanzee sequence when examining variants in humans).

Admixture with archaic African hominins

Neanderthal admixture

Possible ranges of archaic forms Admixture with Denisovan-like population or H. erectus

Neanderthals Denisovans H. erectus Archaic African hominins

Allele frequency spectrum (AFS). A distribution of the counts of single-nucleotide polymorphisms with a given frequency in a single population or in multiple populations.

‘Ghost’ archaic hominin Archaic hominin species for which there is no current available genetic sequence data, although there may be fossil evidence. Such species may improve the current fit, or at least provide an equally good fit, when considering demographic models that examine anatomically modern humans and archaic species for which there are sequence data.

Wallace's phenotypic boundary

Figure 1 | A possible model of archaic introgression based on the latest analysis using second-generation Nature Reviews | Genetics sequencing.  Red arrows indicate initial colonization events across the Old World after the origination of anatomically modern humans (AMHs) in Africa, including two movements into Asia. Approximate positions of introgression events are represented by coloured circles and are not intended to be accurate. This model portrays the hypothesis that portions of the Denisovan genome entered the human gene pool through hybridization with more widespread populations of archaic hominins (such as Homo erectus), which also interbred with the Denisovan population. Models that involve interbreeding directly between Denisovans and AMHs can be found in REF. 46. The black arrow shows a more recent expansion of Asian farming populations (that is, 200,000 years ago until as recently as 13,000 years ago74,75. However, the warmer and more humid environments of much of sub-Saharan Africa result in poor DNA preservation in ancient samples76; thus, future inferences about archaic introgression in Africa will probably have to be made by using additional WGS data from extant populations and by testing robust models of introgression with computational methods34,35.

Europe as a genetic ‘palimpsest’ Archaeological evidence indicates that three major prehistoric demographic processes that involve AMH populations have probably affected Europe (FIG. 3a). The first is the initial colonization of the continent through the Middle East ~45,000 years ago77,78. The second involves the retraction of, followed by the re‑expansion from, three or four southern refugia during the Last Glacial Maximum (LGM) ~26,000–15,000 years ago towards the end of the Paleolithic period79. The third involves the movement of farming culture into Europe (perhaps from Anatolia) during the Neolithic period 9,000–6,000 years ago. The relative genetic contribution

NATURE REVIEWS | GENETICS

ADVANCE ONLINE PUBLICATION | 7 © 2014 Macmillan Publishers Limited. All rights reserved

REVIEWS a

5,000 years ago (65 Mb)

of these three processes to the modern-day European gene pool is a topic of great debate80.

5,300–4,400 years ago (27–97 Mb)

5,300 years ago (7.5x WGS)

7,000 years ago (16–41 Mb)

b

100

Finns

PCA1 (0.6%)

50

CEU 0

Great Britons Tucsans Iberians

–50 −50

0

50

PCA2 (0.36%) Naturea |Reviews Figure 3 | Second-generation sequencing in ancient Europeans.  The map| Genetics of Europe shows the major prehistoric migration events that may have influenced modern-day European genetic diversity. These events, in chronological order, are the initial colonization of Europe ~40,000 years ago from the Middle East (blue oval and black arrows); the contraction of humans into four major refugia during the Last Glacial Maximum ~18,000 years ago, followed by the subsequent recolonization of more northern and central parts of Europe (grey ovals and arrows); and the movement of Neolithic famers ~9,000 years ago from Anatolia (red oval and arrows). Also shown are the geographical locations of the Neolithic Scandinavian farmer108 (red circle) and the Late Neolithic or Early Copper Age Ötzi sample105 (red diamond), as well as two Mesolithic107 (blue triangles) and three Neolithic108 (blue circles) hunter-gatherer samples, from which second-generation sequencing data are obtained. The amount of sequence generated for each specimen in megabases (Mb) or the coverage for whole-genome sequencing (WGS) is shown in parentheses. b | A single-nucleotide polymorphism (SNP)-based principal component analysis (PCA) of ancient DNA samples (shown in part a) and 375 reference European samples with WGS data from the 1000 Genomes Project172 is shown. Details on how this plot was generated can be found in Supplementary information S2 (box). Despite being geographically disparate, samples from hunter-gatherers cluster together, as are those from putative farmers, but these two clusters are away from each other. Unlike the samples from putative farmers that show a similarity to southern Europeans, the five hunter-gatherer samples do not overlap with any samples in the reference data set, which suggests that they have at least some ancestry component that is not represented in modern-day populations and that might have been lost when Neolithic farmers moved into Europe. CEU, northern Europeans from Utah. Part a is modified, with permission, from REF. 170 © (2004) Annual Reviews and REF. 171 © (2010) Elsevier.

Evidence from contemporary Europeans. The debate arguably began with the observation of a southeastto‑northwest cline in the synthetic maps81 that integrated classical genetic and geographical data. These maps were interpreted by some to be consistent with a demic diffusion model of Neolithic farmers. However, other authors pointed out that different demographic processes occurring at different times could produce similar clines80 and that clines in such maps may not actually reflect population movements82. mtDNA83–86, NRY87–89 and autosomal STR studies90 of contemporary Europeans have also been inconclusive, and estimates of a Neolithic contribution to Europeans range from 20% to 70%. Rather than showing a southeast-to‑northwest cline, autosomal SNP data in contemporary Europeans are primarily structured by geography 91. This pattern, which is characteristic of an isolation-by‑distance model, seems to reflect the aggregate signal of small amounts of recent genetic drift at tens of thousands of SNPs. However, haplotypes in the same SNP data show evidence of greater diversity in the south than in the north92, and such a pattern is equally consistent with the Paleolithic, Neolithic and post-LGM expansions discussed above, as well as with known migrations from North Africa93. A recent analysis of shared tracts of ancestry or identity by descent in a large sample of contemporary Europeans suggests that the Barbarian invasions following collapse of the Roman Empire may have also had some influence94. Evidence from ancient DNA. Ancient mtDNA has provided a more direct assessment of European population history 95. There seem to be substantial mtDNA discontinuities between Paleolithic, Mesolithic and Neolithic hunter-gatherers compared with Neolithic farmers96,97. However, there has also been a noticeable change in the mtDNA composition of Early Neolithic farmers compared with that of farmers of the Middle and Late Neolithic period and beyond98–101. There are also differences between Neolithic sites from southern and Central Europe102,103 that are consistent with archaeological evidence that suggests that the spread of Neolithic farming consisted of various phases and routes. Interestingly, 84% (that is, 26 of 31) of Y chromosomes that have been identified from Neolithic individuals are members of a rare haplogroup known as G2a, which is currently only found at a low frequency in populations from the Caucasus, the Middle East and around the Mediterranean95,100,104. The apparently lower NRY diversity relative to mtDNA diversity during the Neolithic period may reflect sex-specific differences in the colonization process80. WGS was recently carried out on the Tyrolean Iceman (also known as Ötzi), which is a remarkably well-preserved specimen dated to the Late Neolithic or Early Copper Age (that is, ~5,000 years ago) that was found in the eastern Alps near the Austrian–Italian border. Interestingly, Ötzi was shown to possess a G2a NRY105, whereas his autosomal genome was shown to be genetically most similar to modern-day humans from

8 | ADVANCE ONLINE PUBLICATION

www.nature.com/reviews/genetics © 2014 Macmillan Publishers Limited. All rights reserved

REVIEWS

Cline In the context of genetic data, the exhibition of regular and directional variation in genotype or allele frequencies across a geographical region.

Demic diffusion model A migration model in which populations diffuse into new geographical areas and displace or interbreed with indigenous populations.

Isolation-by‑distance model A model in which the amount of gene flow between two locations decreases as a function of distance. At equilibrium, this model predicts that genetic differentiation increases as a function of geographical distance.

Identity by descent The phenomenon whereby two alleles are a copy of the same allele that was carried in an ancestral individual.

Principal component analysis (PCA). A statistical method that is used to simplify a complex data set by transforming a series of correlated variables into a smaller number of uncorrelated variables known as principal components.

Sardinia — an island that is ~800 km from where Ötzi was found. This suggests that there has been a significant change in the distribution of genetic variation on mainland Europe since at least the Late Neolithic period and that the genetically isolated Sardinians106 may have preserved some of this ancient genetic ancestry. Two studies using a ‘shotgun’ sequencing approach successfully obtained 15–100 Mb of autosomal sequence data per individual; the first was obtained from two ~7,000‑year-old Mesolithic hunter-gatherers in northwestern Spain107 and the second from three post-Neolithic (~5,300–4,000‑year-old) hunter-gatherers and one Neolithic (~5,000‑year-old) farmer who were separated by only 400 km in Scandinavia108 (FIG. 3a). Much like Ötzi, the farmer was most similar to contemporary southern Europeans such as Tuscans and Basques, although he was distinct from Middle Eastern populations. Although the five hunter-gatherer individuals were closest to modernday northern Europeans on a principal component analysis (PCA) plot, they did not overlap with any sampled contemporary northern European population (FIG. 3b; see Supplementary information S2 (box)). It is also noteworthy that the ancient hunter-gatherer genomes do not seem to be similar to the genetically isolated Basque or Sardinian populations106, which brings into question previous admixture analyses that used these modern populations as representatives of Paleolithic populations88,90,109. In summary, these results suggest that the Neolithic genomic signature decreases from south to north in contemporary Europeans, that southern Europeans are characterized mainly by Neolithic DNA from incoming farmers, and that northern Europeans still preserve some portion of their hunter-gatherer ancestry. More samples are needed from Central Europe to determine whether this is a continent-wide or region-specific pattern. Next-generation sequencing of additional ancient samples should continue to provide important insights into the demographic processes that accompany the transition from a hunter-gatherer lifestyle to a farming way of life in Europe.

Into the Americas There is a general consensus that the Americas, which are the last major landmasses to be populated by AMHs, were colonized from Asia through Beringia. However, there are remaining questions about the sources, timing and number of migration waves110. The Beringian land bridge has connected Siberia with North America at periods of low sea levels, the most recent time of which was 30,000–13,000 years ago. Abundant Clovis cultural material in the Americas as early as ~13,000 years ago is consistent with initial migrations over land that coincided with the parting of the Cordilleran and Laurentide ice sheets. However, there are older archaeological sites that predate Clovis culture, such as Monte Verde in southern Chile that has been dated to ~14,800 years ago111, that indicate earlier migrations perhaps by a coastal route. On the basis of linguistic evidence, one study 112 proposed a three-wave settlement of the Americas that involved the ancestors of all modern-day Amerind, Na‑Dene and Eskimo–Aleut (that is, Inuit) speakers.

Although authors who have studied mtDNA, the NRY and autosomal STRs have argued for various migration scenarios, many agreed that the majority of Native American genetic diversity derives from a limited number of Asian ancestors that migrated through Beringia113–116. Nevertheless, explicit demographic modelling has provided support for recurrent bidirectional gene flow across Beringia117. Sequencing of a 4,000‑year-old Paleo-Eskimo. In 2010, one study 118 presented WGS data from a Paleo-Eskimo individual that is dated to around ~4,000 years ago from the now extinct Saqqaq culture. The Saqqaq, the origins of which were mainly unknown before the study, existed in southern Greenland ~4,500–2,800 years ago before it was superseded by the Thule culture, which represents the ancestors of today’s Inuits. Comparison of the Saqqaq genome to autosomal SNP data from a panel of non-African populations showed that this PaleoEskimo individual is most similar to three populations in the Arctic region of Siberia — including the Chukchi who reside just west of the Bering Strait — rather than to modern-day New World populations. Thus, these results were qualitatively consistent with a migration from Siberia into the Americas ~5,000 years ago that was distinct from the movements that founded modernday Amerinds, Inuit and Na‑Dene, although no formal statistical test was applied. A later analysis of a larger SNP array data set from the Americas119 supported the three-wave model 112. The study also showed substantial Amerind admixture in modern-day Eskimo–Aleut and Na‑Dene populations. In contrast to the WGS study of the Paleo-Eskimo individual118, formal admixture tests suggested that the Saqqaq shares deep ancestry with a Na‑Dene population (that is, the Chipewyan) and may thus have descended either from the same migration event into North America or from the same ancestral population. Mixed western Eurasian ancestry of Native Americans. Another potentially interesting aspect of the PaleoEskimo genome was the lack of an apparent western Eurasian component that is common to many modernday Native American groups. This component has generally been attributed to post-Columbian gene flow from Europeans. However, one study 120 recently sequenced the entire genome of a 24,000‑year-old specimen from South-Central Siberia at 1× coverage. Somewhat surprisingly, various analyses of this genome suggested that this individual has genetic components of both contemporary western Eurasians and Native Americans (~14–38%), although it is distinct from modern-day East Asian populations. A similar pattern was revealed in the genome of a ~17,000‑year-old post-LGM specimen from SouthCentral Siberia that was sequenced at 0.1× coverage. One explanation for this pattern of sharing is that Paleolithic Siberians once had genetic diversity that is retained only in contemporary western Eurasians and that Native Americans descended from a population with both East Asian and Paleolithic Siberian ancestry. In addition, application of the D statistic showed that all Amerind

NATURE REVIEWS | GENETICS

ADVANCE ONLINE PUBLICATION | 9 © 2014 Macmillan Publishers Limited. All rights reserved

REVIEWS groups are equally similar to the ~24,000‑year-old specimen, which suggested that admixture between East Asian and Siberian ancestors occurred before the initial migration into the Americas, perhaps on the Beringia land bridge. The lack of an obvious western Eurasian component in the Paleo-Eskimo specimen118 is consistent with a multiple-wave model that involves different source populations, although the Paleo-Eskimo and South-Central Siberian genomes have not yet been analysed simultaneously. Although WGS data from the ancient specimens discussed above are beginning to reveal a complex demographic history for the peopling of the Americas, we anticipate that much more information will be obtained from directly sequencing the genomes of ancient Native American specimens. Fortunately, there is already a rich body of ancient Native American mtDNA data121, including 14,000‑year-old pre-Clovis specimens122.

Singleton A genetic variant that is present in only a single chromosome from the sample analysed.

Sequentially Markovian coalescent model A simplification of the standard model of coalescence with recombination, such that the addition of crossover events while moving along the genome has a Markovian structure; that is, the addition of recombination at a given position along the genome depends only on the previous genealogy in which recombination was considered, rather than on the whole ancestral recombination graph from the beginning of the chromosome.

Haplotype-phased Pertaining to DNA sequence or genotyping data in an individual for which the combination of alleles contributed by each parental chromosome is resolved.

Current problems and future developments Although the introduction of WGS by next-generation sequencing technologies has led to tremendous amounts of new data, it is well appreciated that the error rates are substantially higher than those of Sanger sequencing. Ignoring damage to the actual DNA (which is reviewed in the context of ancient DNA in BOX 3 and REFS 123,124), these errors can arise during the PCR amplification step, the process of sequencing itself or bioinformatic analyses125. Most errors are expected to be random and will thus appear as false singleton calls. In addition, most WGS studies that have so far been carried out are of low to medium coverage (that is, 5–20×), which only have a limited ability to accurately call the genotype at heterozygous sites126. The major challenge in processing second-generation sequencing data is to effectively eliminate false positives using bioinformatic approaches that do not undercall true positives. For applications in medical disease genetics, accurate calling of genotypes is important; therefore, strategies to deal with sequencing errors often involve setting conservative thresholds to balance false negatives and false positives. However, it is often the aggregate signal across the genome, particularly the number of singleton polymorphisms, that is important for inferring demography. Thus, most emphasis in population genetics has been placed on the development of algorithms to accurately estimate the AFS126–129 and other aspects of sequence diversity from low-coverage data130–133. These algorithms are designed to avoid biases that are inherent in standard SNP-calling methods that lead to poor estimates of the relative number of rare variants, especially singletons. The estimated AFS also can be used to facilitate individual calls for variants and genotypes, which can then be integrated into downstream population genetic analyses in a probabilistic framework126. However, such methods require either some error model or a known relationship between quality scores and the probability of sequencing errors, the knowledge of which is still somewhat uncertain for second-generation sequencing data125. One study 53 attempted to circumvent this problem by using high-coverage whole-exome sequencing

data to empirically correct the AFS from low-coverage WGS data. The ability to correct for the higher error rate in low-coverage data allows finite sequencing resources to be applied to larger sample sizes, which are particularly useful for inferring more recent processes in population history (see Supplementary information S1 (box)). Unfortunately, the cost of even low-coverage WGS is still somewhat prohibitive for most investigators. Therefore, alternative methods such as reduced representation134 and pooled sample sequencing 135 may remain useful in the near future, especially as technical hurdles such as unequal DNA concentrations are overcome136. Coalescent theory informs us that large sample sizes are unlikely to contribute much extra information for inferring more ancient processes in human evolution and that considerable insights can be gained by analysing only one or a few whole genomes. For example, the D statistic38 has proved informative for inferring parameters of the admixture process, and coalescence-based Bayesian analyses of tens of thousands of independent 1‑kb sequences from single representative genomes of different populations (that is, as implemented in the program G‑PhoCS55) can provide estimates of ancient divergence times, as well as estimates of current and ancestral Ne, with narrow confidence intervals. Despite this promise, none of the approaches discussed above (that is, the D statistic, G‑PhoCS and AFSbased methods) exploit the extra historical information that is provided by incorporating patterns of linkage and recombination. It is notoriously difficult to make inference using the coalescence with recombination8, even for relatively short stretches of sequence in a single population. Fortunately, new methods that incorporate recombination in their analytical framework are being developed to take advantage of WGS data. One innovative strategy makes use of hidden Markov models (HMMs) to infer how patterns of coalescence change along the genome as a result of recombination. This family of methods is informative for estimating changes in population sizes through time56, divergence times and levels of gene flow 137. One disadvantage of these methods is that, with an increasing amount of input sequence data, the HMM state space quickly becomes intractable, and their application has therefore been limited to a single diploid genome or to a small number of haploid genomes. However, great advances are being made to extend the inference to more recent timeframes by incorporating larger sample sizes138. This has mostly been aided by the ability to approximate the coalescence with recombination by using the sequentially Markovian coalescent model139. Other areas of development that incorporate a model of recombination include analyses of the distribution of identity-by‑state tracts54, identityby‑descent tracts140 and haploblock or migrant tracts141. Another complication is that these methods require haplotype-phased sequence data, which are typically obtained through the use of statistical methods142 that can both miss real recombination events and report false ones. One approach is to use library preparation

10 | ADVANCE ONLINE PUBLICATION

www.nature.com/reviews/genetics © 2014 Macmillan Publishers Limited. All rights reserved

REVIEWS Box 3 | Using next-generation sequencing to analyse ancient DNA Obtaining reliable sequence data from ancient DNA (aDNA) specimens is challenging because of DNA degradation after the death of an organism. The degradation process has three major consequences: low levels of DNA, degradation of DNA into small fragments (for example,

The impact of whole-genome sequencing on the reconstruction of human population history.

Examining patterns of molecular genetic variation in both modern-day and ancient humans has proved to be a powerful approach to learn about our origin...
1MB Sizes 0 Downloads 0 Views