Available online at www.sciencedirect.com

ScienceDirect Impact and insights from ancient repetitive elements in plant genomes Florian Maumus and Hadi Quesneville Transposable elements and other repeated sequences are predominant contributors to most plant genomes. The vast majority of repeated elements accumulate mutations to the extent of becoming anonymous sequences, also known as ‘genomic dark matter’ which is also thought to contribute significantly to the composition of plant genomes. This review aims to highlight recent methods and analyses suggesting that ancient repeats have profound effects on plant genome biology. Address INRA, UR1164 URGI — Research Unit in Genomics-Info, INRA de Versailles-Grignon, Route de Saint-Cyr, Versailles 78026, France Corresponding author: Maumus, Florian ([email protected])

Current Opinion in Plant Biology 2016, 30:41–46 This review comes from a themed issue on Genome studies and molecular genetics Edited by Yves van de Peer and J Chris Pires

http://dx.doi.org/10.1016/j.pbi.2016.01.003 1369-5266/# 2016 Elsevier Ltd. All rights reserved.

From burst to genomic dark matter Junk DNA (the genomic fraction with no evident function) is paramount in determining the size of eukaryotic genomes [1]. Junk DNA is composed of repeated sequences (collectively referred to as the repeatome) including transposable elements (TEs), endogenous viruses and tandem repeats as well as yet non-annotated DNA dubbed ‘genomic dark matter’ [2]. Because of their relatively high duplication rates compared to other genomic features, TEs are major components of the repeatome [3]. In large plant genomes, TEs are primary contributors to total DNA content (or C-value) such as in maize where 85% of the genome is composed of TEs [4]. In general, there is a strong positive correlation between TE content and genome size [5]. TEs follow an evolutionary model dubbed ‘burst and decay’ meaning that a given TE family can multiply rapidly in a genome, followed by relatively low duplicative periods. Because most TE copies are not functional www.sciencedirect.com

(meaning necessary to their host) they accumulate mutations in an independent manner [6]. This decay is due to the cumulative effect of nested insertions, deletions, short insertions and deletions (indels) as well as transitions and transversions occurring in TE copies. As a result, TE copies become increasingly fragmented and mutated over time until they eventually ‘melt down’ completely into uncharacterized DNA, that is, genomic dark matter (Figure 1). Although recently addressed in plants [7,8], the evolution, putative function and means of detection of these aging repeats are still poorly established. This review intends to summarize recent findings in this area of research, and to stress the importance of considering repeats in an evolutionary framework towards enhanced understanding of plant biology.

Repeatome complexity: different history, different challenges The complexity of the repeatome can be highly variable between plants, and therefore different species represent different challenges in terms of repeatome annotation. At one extreme, a hypothetical genome ‘A’ (maize-like) has undergone recent, massive amplification of a few repeat families (Figure 2a). As a result, it accumulates a high number of identical or highly similar elements that constitute templates for inter-element homologous recombination (Figure 2b). Such non-allelic recombination events cause the loss of the intervening DNA unless it contains biologically required functions (e.g. protein-coding genes). As a result, a great part of the intergenic material that contains more ancient repeats can be eliminated from the genome. Thus, the repeatome of genome A is mainly composed of high copy numbers of a subset of TEs while a substantial fraction of older copies have been eliminated due to a significant turnover of intergenic and pericentromeric DNA. On the other side, a hypothetical genome ‘B’ has undergone limited recent TE activity so that most of its TE fraction is rather ancient and thus fragmented and denatured. Such copies have a low recombination potential and are present in a limited number so that the overall turnover is weak compared to A-type genomes. Thus, comparatively, the complexity of the repeatome is far greater in B-type than in A-type genomes (Figure 2a). It is therefore way more challenging to annotate the repeatome of B-type genomes compared to A-type genomes.

State of the art in repeatome annotation The annotation of repeated elements in genomes typically follows two main steps. The first step consists in building a library of consensus sequences that are representative of Current Opinion in Plant Biology 2016, 30:41–46

42 Genome studies and molecular genetics

Figure 1

Current Opinion in Plant Biology

Repeat decay. Metaphorically, repeats are composted over time towards becoming genomic humus.

Figure 2

(a)

Genome A

Genome B

Turnover ++ Recent activity +++

Turnover Recent activity -

(b)

Recent

+ Recombinant DNA

Lost DNA

Ancient Current Opinion in Plant Biology

Stratigraphy of the repeatome. (a) A-type and B-type genomes present contrasted composition of layers of repeated sequences. (b) Inter-element recombination mediates DNA loss. Young, highly similar TEs are represented as red arrows.

the repeats contained in the assembly. Because TE copies evolve independently from each other, the alignment of several copies of a defined TE family enables the construction of consensus sequences that resemble the ancestral (i.e. native) sequence. It is evident that the number of copies and their conservation is crucial in the generation of consensus sequences from multiple sequence alignments and therefore more problematic in the case of B-type genomes. Different types of algorithms enable the construction of consensus sequences of repetitive elements. Current Opinion in Plant Biology 2016, 30:41–46

The k-mer based tools such as RepeatScout [9] search for repeated k-mers in a genome assembly and extend an alignment on each side of these seeds. By contrast, the clustering-based tools such as those embedded in the REPET package [10] screen for repeated sequences using BLAST and sort them into groups sharing high similarity. We have recently shown the greater performance of combining RepeatScout and REPET in order to perform a comprehensive repeatome annotation in B-type genomes [8]. www.sciencedirect.com

Evolution of repetitive elements over time Maumus and Quesneville 43

The genome of the model Brassicaceae Arabidopsis thaliana is a good representative of such ‘sleepy’ genomes. It is small and reducing, and is shows low recent TE activity [11,12]. In this evolutionary context, combining RepeatScout and REPET for the construction of a library of repeated elements enabled annotating a greater portion of the A. thaliana genome compared to the use of a single program [8]. Qualitatively, this approach happened to be more efficient to retrieve a reference annotation established using a manually curated library of repeated elements [13]. It was also more successful in covering the map of the regions that match 24-nt small RNAs (sRNAs), which are involved in the RNA-dependent DNA methylation (RdDM) pathway that targets genomic repeats for epigenetic silencing [14,15]. Additionally, exploratory approaches have shown to be similarly and consistently efficient. For instance, it appears that using relaxed parameters when performing all-vs-all genome BLAST for the construction of a consensus library with clustering-based tools enables the representing of an extended repertoire of repeated sequences [8]. Furthermore, the individual repeat copies are by definition more representative of a repeatome than a library of consensus sequences that enabled their detection. As a consequence, performing a reiterative annotation using genomic copies detected by consensus sequences as input library proved to provide highly sensitive results in A. thaliana [8]. Together, these results illustrate the potential benefits of using combined and/or alternative approaches towards more sensitive annotation of complex repeatomes. The second step in the annotation of repeated elements consists in using the library of consensus sequences as an in silico probe to search for similar sequences in the genome with dedicated alignment programs such as RepeatMasker [16] and BLASTER [17]. It is however a challenging task to measure the accuracy of repeatome annotations because of the lack of standard benchmarks as highlighted by a recent call to the community [18]. TEs typically have a low G+C content and commonly contain satellite repeats and low complexity regions that are prone to generate false positives. However, ancient elements are proportionally divergent from the consensus sequences. The goal is to provide an optimal tradeoff between sensitivity and specificity by being able to annotate highly divergent elements while discarding unspecific alignments [18]. Because repeated and repeat-derived sequences become increasingly fragmented, the alignment-based approaches have intrinsic limitations for the annotations of most degraded elements which can be very short. In addition, the alignment process can be quite computationally intensive. The k-mer based approaches look directly for repeated short sequences (mers) of size ‘k’ in genome sequences www.sciencedirect.com

and hence enable the detection of highly crumbled but highly duplicated sequences. Pollock and colleagues recently developed a program based on the generation of clusters or ‘clouds’ comprising highly repeated k-mers together with highly similar though less frequent k-mers. The P-clouds program [19] revealed that about two-thirds of the human genome derives from TEs although this high sensitivity came together with low specificity [20].

Deep repeatome annotation provides insight into genome evolution We recently attempted to produce a deep repeatome annotation of the A. thaliana genome by combining alignment-based approaches together with P-clouds. This exploratory study demonstrated that the A. thaliana repeatome is significantly larger than observed using conventional approaches thereby suggesting that a significant amount of the A. thaliana dark matter is of repetitive origin (Figure 3). Remarkably, the deep annotation unveils the presence of regions of high repeat content besides those defined by the pericentromeres, especially one peak on chromosome 1 (referred to as At1R2 region on Figure 3a). A. thaliana has five chromosomes while its common ancestor with the closely related Brassicaceae Capsella rubella and Arabidopsis lyrata that diverged about 5–15 mya has eight. Ancestral genome reconstruction shows that the C. rubella karyotype [21] is similar to the ancestral one while the A. thaliana genome is the result of several chromosome fusions. The fusion of two chromosomes generates chimeric structures with more than one cluster of centromeric repeats (centromeres), each being accompanied by a significant amount of junk DNA (contributing the pericentromeres). While supernumerary clusters of centromeric repeats are expected to be quickly eliminated, the fate of the pericentromeres after chromosome fusion remains to be investigated. Interestingly, by positioning the deep repeatome annotation in the context of karyotype evolution, we established that the peak found at At1R2 actually corresponds to the position of ancestral pericentromeric repeats. Furthermore, we have shown that At1R2 presents a high deletion rate compared to the remainder of chromosome arms. Together these results suggest that a component of the At1R2 repeatome is composed of ancient pericentromeric repeats that are still disappearing [22]. In addition to this, we reported that the distribution of the repeated and repeat-derived elements detected by Pclouds but not by alignment-based methods present a heterogeneous distribution along at least two A. thaliana chromosomes (Figure 4) [8]. Indeed the additional annotations contributed by P-clouds appear to be enriched on one side only of the pericentromeric space. We hypothesize that this asymmetry reflects a dynamic Current Opinion in Plant Biology 2016, 30:41–46

44 Genome studies and molecular genetics

Figure 3

(a)

100

Genome coverage (%)

90

Deep repeatome Regular repeatome

At1cen

80

At1R2

70 60

At1R1

At1L1

At1R3

At1L2

50 40 30 20 10 0

A. thaliana chromosome 1 (30.4 Mb)

0

Cr

ub

_1

0

(b)

10

10

At_1

20

0

10

30

Cru

b_2

Current Opinion in Plant Biology

Deep repeatome view of genome evolution. (a) Distribution of regular and deep repeatome annotations along A. thaliana chromosome 1. Repeat density landscape can be arbitrarily sliced into six contrasted contiguous regions that are positioned as colored intervals. (b) Regular (gray) and deep (dark gray) repeatome annotations on outmost track are positioned along the A. thaliana chromosome 1 (right) and the C. rubella chromosomes 1 (top left) and 2 (bottom left). The intervals and their color code defined in (a) are indicated on the inner band. The central ribbons connect orthologous genes identified by best reciprocal hits and synteny conservation. Ribbons are bold when deep repeatome density in C. rubella exceeds 50%.

Current Opinion in Plant Biology 2016, 30:41–46

www.sciencedirect.com

Evolution of repetitive elements over time Maumus and Quesneville 45

Figure 4

Genome coverage (%)

100

P-clouds only Deep annotation

50

0

A. thaliana chromosome 5 (27 Mb) Current Opinion in Plant Biology

Pericentromere asymmetry. Distribution of deep repeatome (blue) and P-cloud-specific annotations (gray) along A. thaliana chromosome 5.

evolution of the pericentromeric structure, span and composition. Altogether, these recent works allow a better understanding of the origin of some genomic dark matter in plants, of which the nature and function remain largely cryptic. Indeed, part of this genetic material appears to form a continuum with repetitive DNA, suggesting that an additional fraction that remains beyond detection possibilities is probably of similar origin (albeit more ancient). It proposes that besides the detectable fraction, TEs have played a major role in the evolution of plant genome size and composition and probably in the emergence of genes and regulatory elements as well, thereby renewing our perception of the possible impacts of repetitive elements on plant genome evolution. In addition, it appears that deep repeatome annotation can provide relevant clues regarding the evolutionary dynamics of genomes and chromosomes. Especially it brings evidence that questions the dynamic of pericentromeres in the context of karyotype and chromosome evolution.

Determining the ‘age’ of repeats How to evaluate the age of repeats? It is assumed that those that are highly fragmented and only detectable using non-conventional approaches are arguably very old. Concerning the repeat annotations obtained with regular methods, we recently used the identity between copies and consensus sequences as a proxy of age: as an aging copy accumulates mutations, it increasingly diverges from the native sequence. This method offers a first step towards a comprehensive assessment of repeat age in plant genomes [7]. Such estimations have long been restricted to full-length long terminal repeats (LTR) retrotransposons that have a pair of identical LTRs upon insertion, each accumulating mutations independently onward [23]. This widely applied approach, sound and useful, is however extremely limitative because it only deals with LTR retrotransposons. Furthermore, it can www.sciencedirect.com

only be applied to LTR elements which are full length, that is, theoretically evolutionarily young. Therefore, we argue for the relevance of using the ‘divergence to consensus’ approach that permits a wide and homogeneous assessment of insertion time as applied and validated in mammals [24]. This system still needs to be optimized in order to infer ages in terms of time from divergence data. Efforts are needed towards the collection and analysis of data regarding the mutational spectra of repeated elements so as to define substitution models that take into account for instance the mutational bias of G:C!A:T transitions through deamination that appears to be a predominant type of mutations in TEs [25]. The reconstruction and use of a consensus sequence that resembles native elements is another crucial aspect for the success and accuracy of this approach and recent studies suggest that improvements of this step are to be considered and propose some solutions [26,27].

The impact and function of old repeats Using a series of alignment-based approaches, we recently determined that the majority of the A. thaliana repeatome is ancient, that is, it was acquired vertically from a Brassicaceae ancestor [7], thereby reinforcing the idea that this species constitutes a good model for the search and analysis of very ancient repeats. Measuring the identity between copies and respective consensus sequences enabled estimating of relative insertion times and thus discriminating young versus old copies. Remarkably, we observed that ancient TEs of different families have low G+C content compared to more recent copies. This result is consistent with prolonged periods of G:C!A:T mutational bias that are probably supported by the persistent DNA methylation of repetitive elements and the consequent, progressive deamination of methylcytosines. Indeed, we determined that the DNA methylation of repeated sequences can last over prolonged periods (i.e. millions of years) and that methylation of ancient elements appears to be supported by a rich pool of 24-nt sRNA [7]. These silencing RNAs are therefore diverging together with repetitive elements that are under genetic drift and thus become increasingly diversified as well. Such a diversified pool of sRNAs could serve as sentinels protecting against the emergence of new invasive TEs. At the chromosome level, it was noticed that recent insertions are virtually restricted to the A. thaliana pericentromeres while chromosome arms contain mainly ancient repeats [7]. Remarkably, this heterogenous distribution is mirrored by a heterogeneous G+C content of the repeatome along each of the five A. thaliana chromosomes: the repeatome is G+C poor in chromosome arms compared to the fraction distributed in pericentromeric regions. Altogether, this analysis established that the ancient proliferation of repeat families has long-term consequences on plant biology and genome composition [7]. These results are Current Opinion in Plant Biology 2016, 30:41–46

46 Genome studies and molecular genetics

highly significant in the context of the identification of epigenome-associated quantitative trait loci and translational research, and will help in addressing the epigenetic impact of TEs on plant adaptation and domestication in an evolutionary perspective. It was known that epigenetic states can be transmitted vertically but this work suggests that this transmission is frequent and lasts for millions of years. This step forward will foster the study of the conservation of DNA methylation in specific regions of plant genomes over evolutionary timescales, supported for instance by paleogenomic approaches that enable identifying of conserved intergenic regions across evolution. In conclusion, the analysis of repetitive elements in the context of genome and epigenome evolution promises to reveal unexpected facets of plant biology.

References and recommended reading Papers of particular interest, published within the period of review, have been highlighted as:  of special interest  of outstanding interest 1.

Ohno S: So much ‘‘junk’’ DNA in our genome. Brookhaven Symp Biol 1972, 23.

2.

Britten RJ, Kohne DE: Repeated sequences in DNA. Hundreds of thousands of copies of DNA sequences have been incorporated into the genomes of higher organisms. Science 1968, 161:529-540.

3.

Orgel LE, Crick FHC: Selfish DNA: the ultimate parasite. Nature 1980, 284:604-607.

4.

Schnable PS, Ware D, Fulton RS, Stein JC, Wei F, Pasternak S, Liang C, Zhang J, Fulton L, Graves TA et al.: The B73 maize genome: complexity, diversity, and dynamics. Science 2009, 326:1112-1115.

5. 

Elliott TA, Gregory TR: What’s in a genome? The C-value enigma and the evolution of eukaryotic genome content. Philos Trans R Soc Lond B Biol Sci 2015, 370. This article investigates the correlation between a collection of genomic features and genome size in several species and shows that TE content is positively correlated with C-value.

6.

Charlesworth B, Charlesworth D: The population dynamics of transposable elements. Genet Res 1983, 42:1-27.

7. 

Maumus F, Quesneville H: Ancestral repeats have shaped epigenome and genome composition for millions of years in Arabidopsis thaliana. Nat Commun 2014, 5:4104. This work shows that the ancient proliferation of repeats has profoundly impacted the A. thaliana genome and epigenome.

8. 

Maumus F, Quesneville H: Deep investigation of Arabidopsis thaliana junk DNA reveals a continuum between repetitive elements and genomic dark matter. PLoS One 2014, 9:e94101. In this article, the authors have shown that a combination of approaches enable deep repeatome annotation and that the A. thaliana repeatome is significantly greater than previously estimated.

9.

Price AL, Jones NC, Pevzner PA: De novo identification of repeat families in large genomes. Bioinformatics 2005, 21(Suppl 1):i351-i358.

10. Flutre T, Duprat E, Feuillet C, Quesneville H: Considering transposable element diversification in de novo annotation approaches. PLoS One 2011, 6:e16526. 11. Oyama RK, Clauss MJ, Formanova N, Kroymann J, Schmid KJ, Vogel H, Weniger K, Windsor AJ, Mitchell-Olds T: The shrunken genome of Arabidopsis thaliana. Plant Syst Evol 2008, 273: 257-271.

Current Opinion in Plant Biology 2016, 30:41–46

12. Hu TT, Pattyn P, Bakker EG, Cao J, Cheng JF, Clark RM, Fahlgren N, Fawcett JA, Grimwood J, Gundlach H et al.: The Arabidopsis lyrata genome sequence and the basis of rapid genome size change. Nat Genet 2011, 43:476-481. 13. Buisine N, Quesneville H, Colot V: Improved detection and annotation of transposable elements in sequenced genomes using multiple reference sequence sets. Genomics 2008, 91:467-475. 14. Lister R, O’Malley RC, Tonti-Filippini J, Gregory BD, Berry CC, Millar AH, Ecker JR: Highly integrated single-base resolution maps of the epigenome in Arabidopsis. Cell 2008, 133: 523-536. 15. Matzke M, Kanno T, Daxinger L, Huettel B, Matzke AJ: RNAmediated chromatin-based silencing in plants. Curr Opin Cell Biol 2009, 21:367-376. 16. Smit AFA, Hubley R, Green P: RepeatMasker Open-3.0. http:// www.repeatmasker.org 1996–2010. 17. Quesneville H, Bergman CM, Andrieu O, Autard D, Nouaud D, Ashburner M, Anxolabehere D: Combined evidence annotation of transposable elements in genome sequences. PLoS Comput Biol 2005, 1:166-175. 18. Hoen DR, Hickey G, Bourque G, Casacuberta J, Cordaux R,  Feschotte C, Fiston-Lavier AS, Hua-Van A, Hubley R, Kapusta A et al.: A call for benchmarking transposable element annotation methods. Mob DNA 2015, 6:13. In this paper, a consortium of experts argue for the establishment of benchmarks in order to allow the standard assessment of the sensitivity and specificity of annotations of repeated elements in genomes. 19. Gu W, Castoe TA, Hedges DJ, Batzer MA, Pollock DD: Identification of repeat structure in large genomes using repeat probability clouds. Anal Biochem 2008, 380:77-83. 20. de Koning AP, Gu W, Castoe TA, Batzer MA, Pollock DD: Repetitive elements may comprise over two-thirds of the human genome. PLoS Genet 2011, 7:e1002384. 21. Slotte T, Hazzouri KM, Agren JA, Koenig D, Maumus F, Guo YL, Steige K, Platts AE, Escobar JS, Newman LK et al.: The Capsella rubella genome and the genomic consequences of rapid mating system evolution. Nat Genet 2013, 45:831-835. 22. Murat F, Louis A, Maumus F, Armero A, Cooke R, Quesneville H,  Crollius HR, Salse J: Understanding Brassicaceae evolution through ancestral genome reconstruction. Genome Biol 2015, 16:1-17. This work provides an enhanced resolution of the ancestral Brassicaceae genome and shows that following the fusion of two ancestral chromosomes, a region corresponding to an ancestral pericentromere overlaps with increased repeat density in a modern chromosome and presents high deletion and recombination rates in A. thaliana. 23. SanMiguel P, Gaut BS, Tikhonov A, Nakajima Y, Bennetzen JL: The paleontology of intergene retrotransposons of maize. Nat Genet 1998, 20:43-45. 24. Giordano J, Ge Y, Gelfand Y, Abrusan G, Benson G, Warburton PE: Evolutionary history of mammalian transposons determined by genome-wide defragmentation. PLoS Comput Biol 2007, 3:e137. 25. Ossowski S, Schneeberger K, Lucas-Lledo JI, Warthmann N, Clark RM, Shaw RG, Weigel D, Lynch M: The rate and molecular spectrum of spontaneous mutations in Arabidopsis thaliana. Science 2010, 327:92-94. 26. Le Rouzic A, Payen T, Hua-Van A: Reconstructing the evolutionary history of transposable elements. Genome Biol Evol 2013, 5:77-86. 27. Wacholder AC, Cox C, Meyer TJ, Ruggiero RP, Vemulapalli V,  Damert A, Carbone L, Pollock DD: Inference of transposable element ancestry. PLoS Genet 2014, 10:e1004482. This work used a Bayesian method for inferring TE ancestry and indicates that the number of TE subfamilies is frequently underestimated thereby introducing biases in the analysis of mutations in TE copies.

www.sciencedirect.com

Impact and insights from ancient repetitive elements in plant genomes.

Transposable elements and other repeated sequences are predominant contributors to most plant genomes. The vast majority of repeated elements accumula...
1MB Sizes 1 Downloads 6 Views