Research

Protein domain evolution is associated with reproductive diversification and adaptive radiation in the genus Eucalyptus Anna R. Kersting1,2⁄, Eshchar Mizrachi3⁄, Erich Bornberg-Bauer1 and Alexander A. Myburg3 1

Evolutionary Bioinformatics Group, Institute for Evolution and Biodiversity, University of Muenster, Muenster, Germany; 2Bioinformatics Group, Institute for Computer Science,

Heinrich-Heine-University, Duesseldorf, Germany; 3Department of Genetics, Forestry and Agricultural Biotechnology Institute (FABI), Genomics Research Institute, University of Pretoria, Private Bag X20, Pretoria 0028, South Africa

Summary Author for correspondence: Alexander A. Myburg Tel: +27 12 420 4945 Email: [email protected] Received: 15 October 2014 Accepted: 4 November 2014

New Phytologist (2014) doi: 10.1111/nph.13211

Key words: Eucalyptus, expansion, flower, gene expression, protein domains, root, rosids.

 Eucalyptus is a pivotal genus within the rosid order Myrtales with distinct geographic history and adaptations. Comparative analysis of protein domain evolution in the newly sequenced Eucalyptus grandis genome and other rosid lineages sheds light on the adaptive mechanisms integral to the success of this genus of woody perennials.  We reconstructed the ancestral domain content to elucidate the gain, loss and expansion of protein domains and domain arrangements in Eucalyptus in the context of rosid phylogeny. We used functional gene ontology (GO) annotation of genes to investigate the possible biological and evolutionary consequences of protein domain expansion.  We found that protein modulation within the angiosperms occurred primarily on the level of expansion of certain domains and arrangements. Using RNA-Seq data from E. grandis, we showed that domain expansions have contributed to tissue-specific expression of tandemly duplicated genes.  Our results indicate that tandem duplication of genes, a key feature of the Eucalyptus genome, has played an important role in the expansion of domains, particularly in proteins related to the specialization of reproduction and biotic and abiotic interactions affecting root and floral biology, and that tissue-specific expression of proteins with expanded domains has facilitated subfunctionalization in domain families.

Introduction The genus Eucalyptus represents an extremely successful group of woody plants with a distinct evolutionary history in relation to other plants with sequenced genomes. Native to Australia and adjacent islands to the north, the 894 described Eucalyptus taxa (Pryor & Johnson, 1981; Slee et al., 2006) are primarily the result of rapid radiation during the Mid-Cenozoic (25–10 million yr ago (Ma)), a period that was associated with a shift to cooler, drier and more seasonal climates after the complete tectonic isolation of the Australian mainland (Crisp et al., 2004). In the process, Eucalyptus species diversified and adapted to a unique and wide range of climates and ecological niches, in particular poor, infertile soils and fire-dominated biomes. Evolutionary innovations such as post-fire (epicormic) resprouting probably coincided with the earlier Gondwanan origin of the genus > 50 Ma (Crisp et al., 2011) and probably predisposed it for successful adaptation to changing conditions on the Australian continent. Eucalyptus thus represents a group of plants that have been geographically isolated for > 25 million yr from most other plants for which genomes have been sequenced. This presents a valuable opportunity to study the

*These authors contributed equally to this work. Ó 2014 University of Pretoria New Phytologist Ó 2014 New Phytologist Trust

molecular basis of specific evolutionary strategies that have made eucalypts such a successful group of woody plants. In the context of plant evolution, Eucalyptus represents a rosid order, Myrtales (Zhu et al., 2007), which is informative for comparative genomics studies in the rosids and Eudicots. The E. grandis genome contains a number of unique features including a very high rate of tandem duplication, evidence of an early, lineage-specific whole-genome duplication (WGD) event c. 110 Ma (Myburg et al., 2014) coinciding with the crown of the Myrtales lineage (reviewed in Grattapaglia et al., 2012) and subsequent expansion of gene families such as terpene synthases and those involved in phenylpropanoid metabolism linked to the complex secondary metabolism of eucalypts (Moore & Foley, 2005; Keszei et al., 2010; Kulheim et al., 2011; V. Carocha et al., unpublished; C. Kulheim et al., unpublished). Their great ecological success suggests potent pathogen defense and antiherbivory mechanisms, many of which may represent novel evolutionary innovations responding to diverse biotic interactions encountered on the Australian continent. Generally, analyses in newly sequenced genomes are carried out in terms of expansion or contraction of gene families and by determining Ka : Ks values (the ratio between silent and presumably adaptive mutations) in gene families. However, proteincoding genes also evolve in a modular fashion and the New Phytologist (2014) 1 www.newphytologist.com

New Phytologist

2 Research

arrangement or domains, that is, the linear sequence of domains in proteins, can be substantially changed over time (Moore et al., 2008; Moore & Bornberg-Bauer, 2012; Kersting et al., 2012). This aspect of genome evolution is under-studied compared with sequence divergence-based comparisons. Protein domains are structural units, which, because of their function, are evolutionary conserved. A core set of domains is universal and can be found in essentially all organisms (Levitt, 2009). Other domains evolved specifically within clades. New domains can arise by further evolution of existing domains, by shifting open reading frames, or transcription of former noncoding sequences (Karev et al., 2002; Bornberg-Bauer & Alba, 2013). Furthermore, albeit very rarely, domains can be gained in lineages by horizontal gene transfer from other species (Werren et al., 2010). The sequencing of the E. grandis genome provides an opportunity for testing the impact of evolution in proteome modularity (i.e. gain, loss or expansion of domains and domain arrangements) on specific biological processes within major rosid plant lineages. High-quality reference genomes are available for comparative studies, including those of two major lineages in the Eurosids (Arabidopsis thaliana (The Arabidopsis Initiative, 2000) and Populus trichocarpa (Tuskan et al., 2006)) and a basal rosid (Vitis vinifera (Jaillon et al., 2007)). Moreover, the asterid Solanum tuberosum (The Potato Genome Sequencing Consortium, 2011), as well as genomes in major outgroups representing the monocot, angiosperm and embryophyte lineages allow the reconstruction of proteome modularity for ancestral nodes leading to the rosid branch. In this study, we aimed to characterize the gain, loss and expansion of domains and domain arrangements in the rosid lineage leading to E. grandis in the context of other sequenced rosid genomes, and evaluated the impact that proteome modulation has had on protein functional diversification in this lineage. In addition, we searched for common and unique biological processes related to this modulation, and evaluated the expression of genes encoding modulated proteins in different tissues and organs of E. grandis. We address three main questions. First, what are the relative frequency and types of proteome modulation events reflected in the genome of E. grandis? Secondly, given the high amount of tandem gene duplication in the E. grandis genome (Myburg et al., 2014), what role has tandem duplication played compared with genome-wide duplication in the modulation of the E. grandis proteome? Finally, which biological processes are associated with protein domain evolution in E grandis? As evolution of the proteome at the domain level provides the most fundamental indication of functional adaptation, this study sets the stage for understanding the contribution of proteome evolution to the ecological and environmental adaptation of the genus and wider lineage of Myrtales that can be explored in future comparative genomics studies.

Materials and Methods Domain annotation Protein sequences of nine species (Chlamydomonas reinhardtii (Merchant et al., 2007), Physcomitrella patens (Rensing et al., New Phytologist (2014) www.newphytologist.com

2008), Oryza sativa (Yu et al., 2002), Brachypodium distachyon (Vogel et al., 2010), S. tuberosum, Arabidopsis thaliana, Populus trichocarpa, Eucalyptus grandis and Vitis vinifera were downloaded from Phytozome v.8 (www.phytozome.net). For genes with several splice variants, the longest variant was retained for further analyses. The sequences were screened with HMMSCAN of the HMMER3 (Finn et al., 2010) package against the Pfam database (Punta et al., 2012) with the PfamA suggested gathering threshold. For PfamB domains, an e-value of 0.001 was used as threshold. Found domains had to have a length > 0.3 of the hmm model length (Buljan et al., 2010). Overlapping domains were resolved by preferring the manually curated PfamA over the unannotated PfamB domains and retaining domains with lower e-values (annotation for E. grandis in Supporting Information Table S1). Calculating gain, loss and expansion For tree traversing, the ETE2 package (Huerta-Cepas et al., 2010) was included in a customized python script. A maximum parsimony approach in a strictly bifurcating tree was used for reconstruction of the ancestral domain content (Kersting et al., 2012). The domain contents of two sister nodes were compared from the leaves to the nodes. If the domain was present in both nodes, the domain status at the parent node was set to ‘present’. If a domain was absent in both sister nodes, the domain status at the parent node was set to ‘absent’. If a domain was present in one node, but absent in the sister node, the domain status at the parent node was set to ‘unknown’. The unknown domain status of that parent node was compared with the domain status of its sister node. The domain status of their parent node was set to ‘present’ if the domain was present in the sister, to ‘absent’ if the domain was absent in the sister, and to ‘unknown’ if the domain status was unknown in both sister nodes. This followed a strict majority rule. At the root, all unknown states were set to present. A subsequent traversing step was performed to lead from the root to the leaves – when a node had an unknown domain status, it adopted present/absent state from the parent node. Domain arrangements were defined as the (N- to C-terminal) order of PfamA domains on the predicted amino acid sequence. Repeated domains which belong to the type ‘motif’ or ‘repeat’ were merged into one domain in an arrangement. The ancestral domain arrangements were calculated in the same way as for domains. Gain and loss of domains were analyzed by the comparison of the domain content of a child node with that of its parent node. Domain and arrangement number on the ancestral nodes were calculated with the program COUNT (Csuros, 2010), which uses Wagner parsimony. Domains and arrangements found at two-fold higher frequency in a child node compared with its parent node were considered as expanded in the child node. BLAST2GO (Conesa & Gotz, 2008) was used to associate gene ontology (GO) terms (Ashburner & Lewis, 2002) with proteins, which were used for BLAST similarity searches against the nonredundant protein database NCBI-NR. Overrepresentation of GO terms in a data set was analyzed with TOPGO of the BIOCONDUCTOR package with false discovery rate (FDR) correction Ó 2014 University of Pretoria New Phytologist Ó 2014 New Phytologist Trust

New Phytologist (Alexa et al., 2006) as well as REVIGO (Supek et al., 2011) for data visualization. Overrepresentation of sequences in a sample set was evaluated with a hypergeometric test in R (R Development Core Team, 2008). Classification of mechanism of gene duplication (whole-genome, segmental or tandem) was based on supplementary table 19 in Myburg et al. (2014). Expression of proteins containing expanded domains RNA-Seq transcript abundance data for seven different vegetative and reproductive tissues (shoot tip, young leaf, mature leaf, flower buds, root, phloem and immature xylem) collected, representing biological replications from three E. grandis trees (c. 20 million PE50 reads per biological replicate; http://eucgenie.org), were normalized by percentage of average fragments per kilobase of exon per million fragments mapped (FPKM) of each gene per tissue (Trapnell et al., 2010). The normalized expression data were clustered for co-expressed genes by the K-means algorithm based on Euclidean distance with CLUSTER (Eisen et al., 1998). The gene expression clusters for all genes were compared with those of genes with expanded domains to evaluate whether these genes display different expression patterns. Statistical significance was tested with a hypergeometric distribution in R (R Development Core Team, 2008).

Results

Research 3

C. reinhardtii was included as an outgroup (Fig. 1). In particular, we asked what the main drivers were of protein domain evolution in rosid lineages represented by E. grandis, P. trichocarpa, A. thaliana and V. vinifera compared with earlier land plant evolutionary nodes. Indeed, we found high gain-to-loss ratios of domains and domain arrangements could be seen at the base of the tree (Fig. 1; P. patens and ‘Angiosperms’), and generally domain expansion was highly correlated (r = 0.99) with domain arrangement expansion throughout the tree. Some of this effect is attributable to the fact that expanded single-domain proteins were counted as expanded domains and as expanded arrangements (Kersting et al., 2012). We found 89 gained domains, 140 lost domains and 441 domains that have been expanded in E. grandis more than twofold compared with the ancestral node (Table S2). A similar number of domain arrangements (471) were expanded in the E. grandis genome. Domain arrangement gain events (392) were more frequent than domain gain events in E. grandis (89, Fig. 1). These results are consistent with previous findings, which show that gain of new domains in the Viridiplantae is positively correlated with branch length, and that speciation is more likely to be associated with the addition of new domain arrangements within specific lineages (Kersting et al., 2012). This is, however, the first reported overview of the extent of domain and arrangement expansions, illustrating the important role that this mechanism may have played in land plant evolution.

Proteome modulation in Eucalyptus

The role of tandem duplications in proteome modulation

We investigated gain, loss and expansion of domains and domain arrangements along the evolution of eight plant genomes with a focus on four rosid lineages. For an advanced reconstruction of the ancestral domain content, we included one basal embryophyte (P. patens), two monocot species (O. sativa and B. distachyon) and an asterid (S. tuberosum) in addition to the four rosids. For a deep rooting of the tree, the algal genome

The E. grandis genome sequence contains a large number (> 12 500) and proportion (over 34%; Myburg et al., 2014) of genes in tandem duplicate arrays. Comparison to other major rosid lineages suggests that the rate of tandem duplication in the Eucalyptus lineage has been constant since the divergence of the lineage (Myrtales), but three to five times higher than in other rosid lineages. Tandem duplication and selective loss or tandem

Fig. 1 Gain, loss and expansion of domains and domain arrangements across major land plant lineages. Nine sequenced plant genomes were analyzed to represent major lineages and divergence events. Gain is shown in green, loss in red and expansion numbers in blue. Expansions were defined as domains or arrangements that occur at two-fold higher frequency in the child node compared with its parent node. Values above the line represent domain events and those below the line arrangement events. Gain and loss were calculated with a maximum parsimony approach, expansion was calculated with Wagner parsimony (Csuros, 2010). In this study we focused (light blue shading) on the branch leading from Basal Rosids (represented by Vitis vinifera) to Eucalyptus grandis. Whole-genome duplication (WGD) events are marked by a star. Ó 2014 University of Pretoria New Phytologist Ó 2014 New Phytologist Trust

New Phytologist (2014) www.newphytologist.com

4 Research

duplicate retention may therefore have played an important role in protein domain and arrangement expansion in eucalypts. To investigate this, we asked whether proteins containing an expanded domain or expanded domain arrangement were overrepresented in retained tandem duplicated loci. Indeed, we found that proteins harboring an expanded domain were significantly overrepresented in tandem duplicated loci (Fisher exact test, P < 2.2, e-16), but not in genes retained from segmental or WGD events. For recent segmentally duplicated genes as well as for the genes retained after the hexaduplication event, no overrepresented GO terms could be found. The impact of domain modulation on Eucalyptus biology To better understand how protein domain modulation could be related to adaptation and diversification of the rosids and particularly Eucalyptus, we investigated the overrepresentation of ontology terms in proteins containing gained or expanded domains and arrangements, at three major transitions in dicot evolution (angiosperm to dicot, dicot to rosid and basal rosid to E. grandis; Fig 1). In each case, predicted protein annotations were compared with a universe consisting of annotations from all preceding nodes. The results were compared to identify the most significant GO terms in the categories of ‘domain gain’,

New Phytologist ‘domain arrangement gain’, ‘domain expansions’ and ‘domain arrangement expansions’ at these three nodes (Table S3). For comparison at similar nodes to Eucalyptus, an analogous analysis was carried out for proteins in the domain and arrangement expansion categories in three other rosid lineages, represented by V. vinifera (basal rosid), P. trichocarpa or A. thaliana (Table S4). We found that domain expansion in E. grandis mainly reflects innate immune responses (biotic, including microbial and herbivory responses), protein signaling and regulation (phosphorylation) mechanisms, and recognition of pollen (Fig. 2). All rosid representatives contained domain expansions related to response to biotic and/or abiotic stress. Although ‘recognition of pollen’ was enriched in domain expansions in the ‘dicot’ to ‘rosid’ transition, E. grandis was the only rosid species to display further enrichment in this GO category, while V. vinifera, P. trichocarpa and A. thaliana did not. For a better timing of the domain expansion events we broke down the long branch leading to E. grandis by dividing the genes containing expanded domains into duplicated genes that arose from the hexaduplication (c. 130–150 Ma; Myburg et al., 2014) event (114 genes), duplicated genes that arose from the WGD (1877 genes; c. 110 Ma), recent segmentally duplicated genes (62), and recent tandemly duplicated genes (3559). The most significant ontology overrepresentations for genes retained after the

Fig. 2 Biological functions associated with protein domain expansion in Eucalyptus grandis. Overrepresented ontologies (P < 0.05, FDR corrected) were visualized using the REVIGO tool (http://revigo.irb.hr/). Block size is scaled to log10 P-values, and specific P-values are shown for some highly significant terms. Block colors group related ontologies, and groups are named according to the most significant term in the group. The most overrepresented biological functions were innate immune responses, transmembrane signaling, protein phosphorylation, signal transduction and pollen recognition. New Phytologist (2014) www.newphytologist.com

Ó 2014 University of Pretoria New Phytologist Ó 2014 New Phytologist Trust

New Phytologist WGD included transcription, and response to osmotic, hormone and carbohydrate stimuli (Table S5). The tandemly duplicated genes (consistent duplications and retention over the past c. 50 million yr) were overrepresented in GO terms relating to detection and response to biotic stressors, as well as several floweringrelated ontologies. Genes retained after the WGD are overrepresented in GO terms related to sugar and glucuronoxylan metabolism (Table S5). Importantly, we found that domain expansion is not an artifact of gene expansion, as we found 1663 single genes and 3793 duplicated genes containing expanded domains. One hundred and fifty-one expanded domains are only found in single genes, which were overrepresented in homeostasis-related GO terms such as cell growth, cell cycle phase and cell morphogenesis. Duplicated genes with expanded domains showed overrepresented ontologies relating to secondary metabolism and defense response (Table S5). Next, we investigated the expression status of genes harboring domains that were greater than two-fold expanded in the E. grandis genome using expression data from seven tissues of E. grandis (www.eucgenie.org). Overall, there were 5457 genes in the category ‘domain expansion in E. grandis’, and 4632 genes in the category ‘domain arrangement expansion in E. grandis’ (genes with more than two-fold expanded domains and arrangements, respectively; Table S6). For both of these groups, c. 91% of genes had evidence of expression (FPKM > 1 in at least one tissue, averaged over three biological replicates) in at least one of seven

Fig. 3 K-means clustering of expression in seven tissues/organs of Eucalyptus grandis. Genes were clustered based on relative expression in the different tissues in 30 bins (expression profiles). Bright red boxes represent high expression of the genes in a tissue. Black boxes mean very low or no expression in these tissues. The proportion of all genes (black bars) or genes with expanded domains (red bars) in each expression bin is shown on the right. Significant differences between the two gene groups are displayed (hypergeometric distribution: *, P < 0.005). Ó 2014 University of Pretoria New Phytologist Ó 2014 New Phytologist Trust

Research 5

tissues in E. grandis, a comparable proportion to that of all genes in the genome (92%). We carried out a K-means hierarchical clustering of expression values for all genes in the genome in the seven different tissues and organs (K = 30; Fig. 3). The frequency of genes harboring expanded protein domains was compared with that of all genes in the genome in the context of these expression clusters. Indeed, the main clusters where there was a relative increase of genes with expanded domains were clusters 8, 10, 12, 13 14, 18 and 22 (P < 0.005), where gene expression was almost exclusively in primary tissues, and especially in flowers. Conversely, those clusters including the largest proportion of the genes in the genome (3, 5, 9, 17 and 21) included genes with fairly ubiquitous expression across the different tissues. Thus, a subset of the genes with expanded domains (c. 10% or 400 genes) exhibited higher specificity of expression, most commonly applicable to flower-specific expression, with some additional members specifically expressed in phloem (cluster 1; P < 0.005), shoot tips (clusters 15 and 18; P < 0.005) and roots (clusters 28 and 30; P < 0.005). Biological functions of genes with expanded domains Expanded domains were overrepresented in E. grandis genes related to stress response and reproduction (Fig. 2); moreover, they showed preferential expression in flowers (Fig. 3, genes in clusters 8, 10, 12–15, 18 and 22; P < 0.005). For example, genes

of all genes of genes with expanded domains

New Phytologist (2014) www.newphytologist.com

New Phytologist WGD included transcription, and response to osmotic, hormone and carbohydrate stimuli (Table S5). The tandemly duplicated genes (consistent duplications and retention over the past c. 50 million yr) were overrepresented in GO terms relating to detection and response to biotic stressors, as well as several floweringrelated ontologies. Genes retained after the WGD are overrepresented in GO terms related to sugar and glucuronoxylan metabolism (Table S5). Importantly, we found that domain expansion is not an artifact of gene expansion, as we found 1663 single genes and 3793 duplicated genes containing expanded domains. One hundred and fifty-one expanded domains are only found in single genes, which were overrepresented in homeostasis-related GO terms such as cell growth, cell cycle phase and cell morphogenesis. Duplicated genes with expanded domains showed overrepresented ontologies relating to secondary metabolism and defense response (Table S5). Next, we investigated the expression status of genes harboring domains that were greater than two-fold expanded in the E. grandis genome using expression data from seven tissues of E. grandis (www.eucgenie.org). Overall, there were 5457 genes in the category ‘domain expansion in E. grandis’, and 4632 genes in the category ‘domain arrangement expansion in E. grandis’ (genes with more than two-fold expanded domains and arrangements, respectively; Table S6). For both of these groups, c. 91% of genes had evidence of expression (FPKM > 1 in at least one tissue, averaged over three biological replicates) in at least one of seven

Fig. 3 K-means clustering of expression in seven tissues/organs of Eucalyptus grandis. Genes were clustered based on relative expression in the different tissues in 30 bins (expression profiles). Bright red boxes represent high expression of the genes in a tissue. Black boxes mean very low or no expression in these tissues. The proportion of all genes (black bars) or genes with expanded domains (red bars) in each expression bin is shown on the right. Significant differences between the two gene groups are displayed (hypergeometric distribution: *, P < 0.005). Ó 2014 University of Pretoria New Phytologist Ó 2014 New Phytologist Trust

Research 5

tissues in E. grandis, a comparable proportion to that of all genes in the genome (92%). We carried out a K-means hierarchical clustering of expression values for all genes in the genome in the seven different tissues and organs (K = 30; Fig. 3). The frequency of genes harboring expanded protein domains was compared with that of all genes in the genome in the context of these expression clusters. Indeed, the main clusters where there was a relative increase of genes with expanded domains were clusters 8, 10, 12, 13 14, 18 and 22 (P < 0.005), where gene expression was almost exclusively in primary tissues, and especially in flowers. Conversely, those clusters including the largest proportion of the genes in the genome (3, 5, 9, 17 and 21) included genes with fairly ubiquitous expression across the different tissues. Thus, a subset of the genes with expanded domains (c. 10% or 400 genes) exhibited higher specificity of expression, most commonly applicable to flower-specific expression, with some additional members specifically expressed in phloem (cluster 1; P < 0.005), shoot tips (clusters 15 and 18; P < 0.005) and roots (clusters 28 and 30; P < 0.005). Biological functions of genes with expanded domains Expanded domains were overrepresented in E. grandis genes related to stress response and reproduction (Fig. 2); moreover, they showed preferential expression in flowers (Fig. 3, genes in clusters 8, 10, 12–15, 18 and 22; P < 0.005). For example, genes

of all genes of genes with expanded domains

New Phytologist (2014) www.newphytologist.com

New Phytologist with fully sequenced genomes (e.g. two additional events in A. thaliana, and one additional event in P. trichocarpa; Fig. 1) (Vanneste et al., 2014). Duplication events are often followed by sub- and neofunctionalization, where the two paralogous genes share the ancestral function, or one of the genes adopts a new function (Walsh, 2003). The evolutionary flexibility afforded by duplication events may account for higher overall rates of domain rearrangement observed subsequent to such events. In the case of E. grandis, besides the last ancient WGD, which occurred close to the appearance of the Myrtales, soon after the split from the other rosid lineages (110 Ma; Myburg et al., 2014), tandem duplication has been the dominant source of retained duplicated genes (34% of genes) compared with other species (e.g. A. thaliana and rice, 10% of genes) (Rizzon et al., 2006). The tandemly duplicated genes are mostly recent and connected to the species radiation within the Eucalyptus genus (Myburg et al., 2014). We found that, in the E. grandis genome, expanded domains were significantly overrepresented in tandem duplicated genes, but not in genes remaining from WGD and segmental duplicated genes. As a large proportion of domain expansion happened along the whole branch leading to E. grandis, as they can also be found in single genes, the expansion events may have played a role in the recent history and evolution within the eucalypts. Retained tandem duplicated genes generally differ from genes retained from WGD events in function and in the size of networks in which they are involved (Kliebenstein, 2008). Segmentally duplicated genes that are retained are often involved in tightly linked interaction networks. Based on the gene balance hypothesis, after duplication of networks either all duplicated genes have to be retained or none are, otherwise the disrupted balance of the network can have deleterious effects (Birchler & Veitia, 2010). By contrast, retained tandem duplicate genes are often not linked in larger networks (Freeling, 2009). Our finding of overrepresentation of expanded domains in stress response proteins is consistent with previous reports of tandemly duplicated genes in A. thaliana (Kliebenstein, 2008), but the large expansion in flower development and root-related genes appears to be a unique feature of the rosid lineage represented by E. grandis. Leaning functional results only on GO terms has to be done with caution as GO annotation in plants is mostly based on sequence similarity and gene functions described in A.thaliana or O. sativa. Therefore, we support these claims by demonstrating that many of these genes with expanded domains display flowerand/or root-specific expression. Whether this will apply across other species within Eucalyptus or is specific to E. grandis and/or its close relatives can be tested as more Eucalyptus species genomes are sequenced. In general, genes containing expanded domains in E. grandis were expressed in a more tissue/organ-specific manner, rather than ubiquitously. From annotation and expression profiling we conclude that many of the genes with expanded domains play a role in biotic and abiotic stress responses in flowers and roots, as well as flower biology/development. Many of these genes are located in tandem arrays, with genes having similar expression specificity (e.g. Auxin response factor (ARF) domain- and ABA-WDS Ó 2014 University of Pretoria New Phytologist Ó 2014 New Phytologist Trust

Research 7

domain-containing genes). This suggests that some genes retained after tandem duplication maintain their expression specificity, which could lead to an increase of the gene product. It should be noted, however, that expression levels of tandem duplicates varied on a gene-by-gene basis, and further investigation involving additional expression conditions and phylogenetic analysis would be required to determine which of these genes are undergoing neo- or subfunctionalization, or are in the process of pseudogenization. The Jacalin domain (PF01419) could be an example that fits the subfunctionalization after duplication hypothesis, meaning the child genes have divergent expression patterns from the ancestral gene (Table S6). The functional significance of such expanded domains and domain arrangements identified in this study will have to be elucidated in future studies focusing on individual genes and particularly domains of unknown function that may have contributed to Eucalyptus adaptation and evolution. Conclusions In this investigation, we have highlighted the role of proteome modulation in E. grandis and other rosid species, in particular domain and domain arrangement expansions in these lineages, and the importance of tandem duplication as a mechanism for this process. In E. grandis, these mechanisms are reflective of its genome evolution as well as adaptation and extensive species radiation, pertaining mainly to pest, pathogen and abiotic stress responses in roots and flowers, and flower biology/development. A future genome-wide comparison of different species within the genus Eucalyptus and relatives in the Myrtales would shed light on the biological and adaptive importance of these protein domain modulations.

Acknowledgements This work was supported through the Bioinformatics and Functional Genomics Programme of the National Research Foundation (NRF) and the Department of Science and Technology (DST) of South Africa. E.B.B. acknowledges funding to A.R.K. from DFG project BO1445/4-1.

References Alexa A, Rahnenf€ uhrer J, Lengauer T. 2006. Improved scoring of functional groups from gene expression data by decorrelating go graph structure. Bioinformatics 22: 1600–1607. Ashburner M, Lewis S. 2002. On ontologies for biologists: the gene ontology – untangling the web. Novartis Foundation Symposium 247: 66–80. (discussion 80–63, 84–90, 244–252). Birchler JA, Veitia RA. 2010. The gene balance hypothesis: implications for gene regulation, quantitative traits and evolution. New Phytologist 186: 54–62. Bornberg-Bauer E, Alba MM. 2013. Dynamics and adaptive benefits of modular protein evolution. Current Opinion in Structural Biology 23: 459–466. Buljan M, Frankish A, Bateman A. 2010. Quantifying the mechanisms of domain gain in animal proteins. Genome Biology 11: R74. Chen W, Yu XH, Zhang K, Shi J, De Oliveira S, Schreiber L, Shanklin J, Zhang D. 2011. Male sterile2 encodes a plastid-localized fatty acyl carrier protein reductase required for pollen exine development in Arabidopsis. Plant Physiology 157: 842–853.

New Phytologist (2014) www.newphytologist.com

8 Research Conesa A, Gotz S. 2008. Blast2go: a comprehensive suite for functional analysis in plant genomics. International Journal of Plant Genomics 2008: 619832. Crisp M, Cook L, Steane D. 2004. Radiation of the Australian flora: what can comparisons of molecular phylogenies across multiple taxa tell us about the evolution of diversity in present-day communities? Philosophical Transactions of the Royal Society of London. Series B, Biological Sciences 359: 1551–1571. Crisp MD, Burrows GE, Cook LG, Thornhill AH, Bowman DM. 2011. Flammable biomes dominated by eucalypts originated at the Cretaceous-Paleogene boundary. Nature Communications 2: 193. Csuros M. 2010. Count: evolutionary analysis of phylogenetic profiles with parsimony and likelihood. Bioinformatics 26: 1910–1912. Eisen MB, Spellman PT, Brown PO, Botstein D. 1998. Cluster analysis and display of genome-wide expression patterns. Proceedings of the National Academy of Sciences, USA 95: 14863–14868. Finn RD, Mistry J, Tate J, Coggill P, Heger A, Pollington JE, Gavin OL, Gunasekaran P, Ceric G, Forslund K et al. 2010. The Pfam protein families database. Nucleic Acids Research 38: D211–D222. Freeling M. 2009. Bias in plant gene content following different sorts of duplication: tandem, whole-genome, segmental, or by transposition. Annual Review of Plant Biology 60: 433–453. Grattapaglia D, Vaillancourt RE, Shepherd M, Thumma BR, Foley W, K€ ulheim C, Potts BM, Myburg AA. 2012. Progress in Myrtaceae genetics and genomics: Eucalyptus as the pivotal genus. Tree Genetics & Genomes 8: 463– 508. Guelette BS, Benning UF, Hoffmann-Benning S. 2012. Identification of lipids and lipid-binding proteins in phloem xudates from Arabidopsis thaliana. Journal of Experimental Botany 63: 3603–3616. Huerta-Cepas J, Dopazo J, Gabaldon T. 2010. ETE: a python environment for tree exploration. BMC Bioinformatics 11: 24. Jaillon O, Aury JM, Noel B, Policriti A, Clepet C, Casagrande A, Choisne N, Aubourg S, Vitulo N, Jubinet C et al. 2007. The grapevine genome sequence suggests ancestral hexaploidization in major angiosperm phyla. Nature 449: 463–467. Karev GP, Wolf YI, Rzhetsky AY, Berezovskaya FS, Koonin EV. 2002. Birth and death of protein domains: a simple model of evolution explains power law behavior. BMC Evolutionary Biology 2: 18. Kersting AR, Bornberg-Bauer E, Moore AD, Grath S. 2012. Dynamics and adaptive benefits of protein domain emergence and arrangements during plant genome evolution. Genome Biology and Evolution 4: 316–329. Keszei A, Brubaker CL, Carter R, Kollner T, Degenhardt J, Foley WJ. 2010. Functional and evolutionary relationships between terpene synthases from Australian Myrtaceae. Phytochemistry 71: 844–852. Kliebenstein DJ. 2008. A role for gene duplication and natural variation of gene expression in the evolution of metabolism. PLoS ONE 3: e1838. Kobe B, Deisenhofer J. 1994. The leucine-rich repeat: a versatile binding motif. Trends in Biochemical Sciences 19: 415–421. Kulheim C, Yeoh SH, Wallis IR, Laffan S, Moran GF, Foley WJ. 2011. The molecular basis of quantitative variation in foliar secondary metabolites in Eucalyptus globulus. New Phytologist 191: 1041–1053. Levitt M. 2009. Nature of the protein universe. Proceedings of the National Academy of Sciences, USA 106: 11079–11084. van Loon LC, van Strien EA. 1999. The families of pathogenesis-related proteins, their activities, and comparative analysis of PR-1 type proteins. Physiological and Molecular Plant Pathology 55: 85–97. Merchant SS, Prochnik SE, Vallon O, Harris EH, Karpowicz SJ, Witman GB, Terry A, Salamov A, Fritz-Laylin LK, Marechal-Drouard L et al. 2007. The Chlamydomonas genome reveals the evolution of key animal and plant functions. Science 318: 245–250. Michard E, Lima PT, Borges F, Silva AC, Portes MT, Carvalho JE, Gilliham M, Liu LH, Obermeyer G, Feijo JA. 2001. Glutamate receptor-like genes form Ca2+ channels in pollen tubes and are regulated by pistil D-serine. Science 332: 434–437. Moore AD, Bjorklund AK, Ekman D, Bornberg-Bauer E, Elofsson A. 2008. Arrangements in the modular evolution of proteins. Trends in Biochemical Sciences 33: 444–451.

New Phytologist (2014) www.newphytologist.com

New Phytologist Moore BD, Foley WJ. 2005. Tree use by koalas in a chemically complex landscape. Nature 435: 488–490. Moore DA, Bornberg-Bauer E. 2012. The dynamics and evolutionary potential of domain loss and emergence. Molecular Biology and Evolution 29: 787–796. Myburg AA, Grattapaglia D, Tuskan GA, Hellsten U, Hayes RD, Grimwood J, Jenkins J, Lindquist E, Tice H, Bauer D et al. 2014. The genome of Eucalyptus grandis. Nature 510: 356–362. Padmanabhan V, Dias DM, Newton RJ. 1997. Expression analysis of a gene family in loblolly pine (Pinus taeda L.) induced by water deficit stress. Plant Molecular Biology 35: 801–807. Pryor LD, Johnson LAS. 1981. Eucalyptus, the universal Australian. In: Keast A, ed. Ecological biogeography of Australia. The Hague, the Netherlands: W. Junk, 499–536. Punta M, Coggill PC, Eberhardt RY, Mistry J, Tate J, Boursnell C, Pang N, Forslund K, Ceric G, Clements J et al. 2012. The pfam protein families database. Nucleic Acids Research 40(Database issue): D290–D301. R Development Core Team. 2008. R: a language and environment for statistical computing. Vienna, Austria: The R Foundation. Radauer C, Lackner P, Breiteneder H. 2008. The Bet v 1 fold: an ancient, versatile scaffold for binding of large, hydrophobic ligands. BMC Evolutionary Biology 8: 286. Rensing SA, Lang D, Zimmer AD, Terry A, Salamov A, Shapiro H, Nishiyama T, Perroud P-F, Lindquist EA, Kamisugi Y et al. 2008. The Physcomitrella genome reveals evolutionary insights into the conquest of land by plants. Science 319: 64–69. Rizzon C, Ponger L, Gaut BS. 2006. Striking similarities in the genomic distribution of tandemly arrayed genes in Arabidopsis and rice. PLoS Computational Biology 2: e115. Slee AV, Connors J, Brooker MIH, Duffy SM, West JG. 2006. EUCLID eucalypts of Australia, 3rd edn, CD ROM. Melbourne, Vic., Australia: Centre for Plant Biodiversity Research, CSIRO Publishing. Sudisha J, Sharathchandra RG, Amruthesh KN, Kumar A, Shetty HS. 2012. Pathogenesis related proteins in plant defense response. In: Merillon JM, Ramawat KG, eds. Plant defence: biological control. Dordrecht, the Netherlands: Springer, 379–403. Supek F, Bosnjak M, Skunca N, Smuc T. 2011. REVIGO summarizes and visualizes long lists of gene ontology terms. PLoS ONE 6: e21800. The Arabidopsis Initiative. 2000. Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408: 796–815. The Potato Genome Sequencing Consortium. 2011. Genome sequence and analysis of the tuber crop potato. Nature 475: 189–195. Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, Salzberg SL, Wold BJ, Pachter L. 2010. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nature Biotechnology 28: 511–515. Tuskan GA, Difazio S, Jansson S, Bohlmann J, Grigoriev I, Hellsten U, Putnam N, Ralph S, Rombauts S, Salamov A et al. 2006. The genome of black cottonwood, Populus trichcarpa (Torr. & Gray). Science 313: 1596–1604. Van de Peer Y, Maere S, Meyer A. 2009. The evolutionary significance of ancient genome duplications. Nature Reviews Genetics 10: 725–732. Vanneste K, Baele G, Maere S, Van de Peer Y. 2014. Analysis of 41 plant genomes supports a wave of successful genome duplications in association with the Cretaceous-Paleogene boundary. Genome Research 24: 1334–1347. Vogel JP, Garvin D, Mockler TC, Schmutz J, Rokhsar D, Bevan M. 2010. Genome sequencing and analysis of the model grass Brachypodium distachyon. Nature 463: 763–768. Walsh B. 2003. Population-genetic models of the fates of duplicate genes. Genetica 118: 279–294. Werren JH, Richards S, Desjardins CA, Niehuis O, Gadau J, Colbourne JK, The Nasonia Genome Working Group. 2010. Functional and evolutionary insights from the genomes of three parasitoid Nasonia species. Science 327: 343–348. Wilson ZA, Morroll SM, Dawson J, Swarup R, Tighe PJ. 2001. The Arabidopsis male sterility1 (MS1) gene is a transcriptional regulator of male gametogenesis, with homology to the PHD-finger family of transcription factors. Plant Journal 28: 27–39.

Ó 2014 University of Pretoria New Phytologist Ó 2014 New Phytologist Trust

New Phytologist

Research 9

Yu J, Hu S, Wang J, Wong GK, Li S, Liu B, Deng Y, Dai L, Zhou Y, Zhang X. 2002. A draft sequence of the rice genome (Oryza sativa L. ssp.). Science 296: 92–100. Zhu XY, Chase MW, Qiu YL, Kong HZ, Dilcher DL, Li JH, Chen ZD. 2007. Mitochondrial matR sequences help to resolve deep phylogenetic relationships in rosids. BMC Evolutionary Biology 7: 217.

Table S4 Overrepresented GO terms of proteins with expanded domains in Vitis vinifera, Populus trichocarpa and Arabidopsis thaliana

Supporting Information

Table S6 Proteins with expanded domains, their relative expression between the tissues and their clusters (Fig. 3)

Additional supporting information may be found in the online version of this article. Table S1 Domain annotation of Eucalyptus grandis Table S2 List of gained, lost and expanded domains and arrangements

Table S5 Overrepresented GO terms for duplicated and single Eucalyptus proteins, which contain expanded domains

Please note: Wiley Blackwell are not responsible for the content or functionality of any supporting information supplied by the authors. Any queries (other than missing material) should be directed to the New Phytologist Central Office.

Table S3 Overrepresented GO terms for Eucalyptus proteins, which contain gained or expanded domains and arrangements

New Phytologist is an electronic (online-only) journal owned by the New Phytologist Trust, a not-for-profit organization dedicated to the promotion of plant science, facilitating projects from symposia to free access for our Tansley reviews. Regular papers, Letters, Research reviews, Rapid reports and both Modelling/Theory and Methods papers are encouraged. We are committed to rapid processing, from online submission through to publication ‘as ready’ via Early View – our average time to decision is

Protein domain evolution is associated with reproductive diversification and adaptive radiation in the genus Eucalyptus.

Eucalyptus is a pivotal genus within the rosid order Myrtales with distinct geographic history and adaptations. Comparative analysis of protein domain...
1000KB Sizes 0 Downloads 7 Views