1

Accepted Article

Pseudogenization in pathogenic fungi with different host plants and lifestyles might reflect their evolutionary past.1 Ate van der Burgt1,2, Mansoor Karimi1, Ali H. Bahkali3 and Pierre J.G.M. de Wit1,§

1

Laboratory of Phytopathology, Wageningen University & Research Centre, P.O. Box 16, 6700 AA Wageningen, The Netherlands

2

Applied Bioinformatics, Plant Research International, Wageningen University & Research Centre, P.O. Box 16, 6700 AA Wageningen, The Netherlands

3

King Saud University, Riyadh, Saudi Arabia

§

Corresponding author

Running title: Pseudogenes in fungi

Keywords: pseudogene; sequence error; truncated gene; adaptation; fungal genome; Cladosporium fulvum.

Corresponding author: Pierre J.G.M. de Wit Laboratory of Phytopathology, Wageningen University, P.O. Box 8025, 6700 EE Wageningen, The Netherlands [email protected]

Word count breakdown: Summary:

233

Experimental procedures:

646

Introduction:

652

Acknowledgements:

69

Results:

2,060

Table & Figure legends:

606

This article has been accepted for publication and undergone full peer review but has not been through the copyediting, typesetting, pagination and proofreading process, which may lead to differences between this version and the Version of Record. Please cite this article as doi: 10.1111/mpp.12072 This article is protected by copyright. All rights reserved.

2

Accepted Article

Discussion:

2,653

Total:

6.919

Summary

Pseudogenes are genes with significant homology to functional genes but contain disruptive mutations (DMs) leading to production of non- or partially functional proteins. Little is known about pseudogenization in pathogenic fungi with different lifestyles. Here we report on identification of DMs causing pseudogenes in the genomes of the fungal plant pathogens Botrytis cinerea, Cladosporium fulvum, Dothistroma septosporum, Mycosphaerella fijiensis, Verticillium dahliae and Zymoseptoria tritici. In these fungi we have identified 1740 gene models containing 2795 DMs obtained by an alignment-based gene prediction method. The contribution of sequencing errors to DMs was minimized by analyses of resequenced genomes to obtain a refined data set of 924 gene models containing 1666 true DMs. The frequency of pseudogenes varied from 1 to 5% in the gene catalogues of these fungi, being the highest in the asexually reproducing fungi C. fulvum (4.9%), followed by D. septosporum (2.4%) and V. dahliae (2.1%). The majority of pseudogenes does not represent recent gene

duplications, but members of multi-gene families and unitary genes. In general there was no bias for pseudogenization of specific genes in the six fungi. Single exceptions are those encoding secreted proteins including proteases which appeared more frequently pseudogenized in C. fulvum than in D. septosporum. Most pseudogenes present in these two

phylogenically closely related fungi are not shared suggesting that they are related to adaptation to a different host (tomato versus pine) and lifestyle (biotroph versus hemibiotroph).

This article is protected by copyright. All rights reserved.

3

Accepted Article

Introduction

Pseudogenes show homology to functional genes, but contain disruptive mutations (DMs) leading to non- or partially functional proteins (Yang et al., 2011). A pseudogenization event

caused by a single DM can result in a premature stop codon, frameshift, defective splice junction or distortion of regulatory sequences required for transcription (Yang et al., 2011),

(Zhang et al., 2010). Similarly, a transposon insertion dramatically alters gene continuity, but also represents a single DM event leading to pseudogenization. Most eukaryotic pseudogenes are disabled copies of duplicated parental genes (Gerstein et al., 2007) and their majority will eventually disappear, while some will evolve new functions and might become fixed in an organism (Lynch & Conery, 2000). Unitary pseudogenes are single copy genes that may become non-functional through loss-of-function (LOF) variation caused by various types of mutations (Balasubramanian et al., 2011),(Zhang et al., 2010). A residual biological function might develop for genes encoding multi-domain proteins that have lost only one or a few of their functional domains. However, when a lost domain in a unitary pseudogene is essential and is not compensated for by another protein, the LOF variant will affect the performance of an organism (Zhang et al., 2010). LOF variants and unitary pseudogenes have been reported to cause several inheritable human diseases (Zhang et al., 2010). However, in some cases an organism might also profit from pseudogenization as for pathogens and commensals that need to adapt and co-evolve with their hosts.

When only few DMs are present, pseudogenes still bear all hallmarks of a protein-encoding gene and ab initio gene prediction software will likely predict gene models at these loci, also in case of absence of splice sites or presence of a premature stop. Therefore, DMs will often cause erroneous gene model predictions. This is also true for sequence errors (SEs) in This article is protected by copyright. All rights reserved.

4

Accepted Article

genomic sequences that introduce in-frame stops by erroneous base calling or distortion of reading frames by insertions or deletions (indels). Thus, SEs can cause incorrect assignment of DMs and incorrect assignment of pseudogenes (Balasubramanian et al., 2011).

The extent of pseudogenization in (plant pathogenic) fungi has not been studied at a whole genome scale yet, although numerous reports describe individual genes that were subjected to pseudogenization. Selection pressure imposed on plant pathogenic fungi by plant disease resistance genes has led to rapid development of pseudogenes of which the parental genes encode effectors that are recognized by matching resistance gene-encoded receptor-like proteins (Westerink et al., 2004),(Stergiopoulos et al., 2007a). Repeat-induced point mutations (RIP) can cause pseudogenization by introducing premature stop codons by C to T and G to A transitions. RIP occurs in sexually active fungi mainly belonging to the Ascomycetes, where it was first discovered in Neurospora crassa (Galagan & Selker, 2004).

Genes directly adjacent to repeats are at risk for being pseudogenized when RIP activity slightly protrudes the repeat locus boundaries. This has been shown in the oil seed rape pathogen Leptosphaeria maculans where pseudogenization of the AvrLm1 effector gene is caused by RIP (Gout et al., 2007).

Here we report on identification of DMs causing pseudogenes in the fungal plant pathogens Botrytis cinerea, Cladosporium fulvum, Dothistroma septosporum, Mycosphaerella fijiensis, Verticillium dahliae and Zymoseptoria tritici. From these six fungi we have identified many

DMs obtained by an alignment-based fungal gene prediction method (van der Burgt et al., submitted). The frequency of pseudogenes was highest in in the gene catalogues of the phylogenetically related fungi C. fulvum (4.9%) and D. septosporum (2.4%). There was no

clear bias for pseudogenization of specific genes in these two fungi except for those encoding

This article is protected by copyright. All rights reserved.

5

Accepted Article

secreted proteins including proteases and genes involved in production of secondary metabolites like dothistromin. The biotrophic tomato pathogen C. fulvum shares many genes with the hemi-biotrophic pine pathogen D. septosporum, but the gene set affected by pseudogenization in the two fungi is not shared. A possible role of pseudogenization and eventually gene loss in adaptation to a different hosts and lifestyle is discussed.

Results The genomes of C. fulvum and D. septosporum have recently been released (de Wit et al.,

2012). The Alignment-Based Fungal Gene Prediction (ABFGP) method (van der Burgt et al., submitted) was applied to six fungal genomes in order to identify disruptive mutations (DMs) that would cause pseudogenization. Gene models predicted by ABFGP represent exons which are chained by both introns and DMs. The ABFGP method recognized DMs in genes which resulted in frame shifts (non-3n indels) or would lead to an in-frame stop codon when compared with homologous informant genes from several different fungi lacking the DMs. In multiple protein sequence alignments, the DMs are recognized as extension of conservation (i) throughout annotated introns, (ii) upstream of annotated start codons or (iii) downstream of annotated stop codons (Figure 1). In all cases, high sequence similarity is shared with corresponding exonic parts of informant genes. Predicted DMs coincided predominantly with incorrectly predicted introns ( Figure 1a,1c), truncated predicted proteins (Figure 1a,1b) and rarely in a single gene split into two gene models (1c). Genes with predicted disrupted mutations.

This article is protected by copyright. All rights reserved.

6

Accepted Article

Around 8,000 predicted gene models for each of the six selected fungi were assessed by ABFGP (van der Burgt et al., submitted) using informant genes from up to 28 different fungal

species (Table S1, see Supporting Information). Biological properties and genome statistics of the six fungi belonging to the class of Ascomycetes are shown in Table 1. From this data set,

we retrieved the gene models with predicted DMs resulting in a subset of 1713 genes (ranging from 68 to 567 affected genes per species) containing 2762 DMs in total for the six fungal species. The number of SEs occurring in sequenced genomes is expected to be inversely related to genome coverage. This renders the prediction of DMs in Z. tritici, V. dahliae, M. fijiensis and

B. cinerea (in decreasing order of genome coverage) less reliable than in the genomes of C. fulvum and D. septosporum, which have been sequenced using next generation sequencing techniques at 21-fold and 34-fold coverage, respectively (de Wit et al., 2012). From those genomes with low coverage DMs that could not be confirmed by resequencing (or sequencing related isolates) were scored as incorrect and removed from the DM data set (Method S1, see Supporting Information). This accounted for 39 (34%), 453 (54%), 105 (46%) and 363 (72%) SEs, in the four fungi which indeed correlates with sequence coverage. A 100nt window surrounding a predicted DM in C. fulvum was inspected in the genome assembly for coverage, correct base calling and presence of polypyrimidine tracts. No indication for sequence errors was observed (data not shown). This refinement yielded a final set of 1.662 presumed true DMs in 924 genes which were used throughout this analysis (Table 2). The predicted ancestral protein products of the 924 genes are provided in Datafile S1 (see Supporting Information). As DMs recognized by ABFGP are located in exons of their functional homologues, we conclude that DMs are present in mature mRNAs and not in the introns. For five out of six of

This article is protected by copyright. All rights reserved.

7

Accepted Article

the studied fungi we aligned available unigene data to their genomes to verify whether predicted DMs overlapped with exons or introns (Table S2, see Supporting Information). Many of the identified pseudogenes appeared to be expressed (72% and 74% of the genes from C. fulvum and D. septosporum, respectively). In total 572 DMs were covered by ESTs confirming that they occurred in exons. In all cases where a DM was overlapping with a predicted intron (like the first deletion in Cf195670 shown in Figure 1), EST data indicated absence of splicing. Only eleven DMs (1.9%) matched to introns and have therefore wrongly been predicted as DMs. The latter number reflects the false discovery of DMs by ABFGP. Interestingly, three out of these eleven wrongly predicted DMs matched to alternatively spliced transcripts with intron retention around the DM site. Although examination of unigene data indicated at least 98% accuracy in appointing DMs by ABFGP, we decided to closer examine and experimentally confirm several of them. DMs were not chosen at random, but all predicted DMs in a particular class of genes in C. fulvum,

namely secreted proteases, were selected. Five protease genes with predicted DMs (Figure 2) were resequenced in the type strain and in six additional isolates of C. fulvum originating from different parts of the world (Table S3 and S4, see Supporting Information). All DMs were confirmed and appeared identical in all seven isolates analyzed: two collected in The Netherlands , two collected in Cuba and two collected in Japan. Seven out of eight DMs coincided with introns predicted by GeneMark-ES (Ter-Hovhannisyan et al., 2008), which all were in conflict with observed expression data. This suggests that the predicted introns are incorrect and represent DMs. To validate this, cDNA libraries from the sequenced C. fulvum

reference strain (CBS131901) grown in different conditions were analyzed (Table S3, see Supporting Information). The results confirmed that, except for Cf189824, all genes were clearly expressed and in none of the tested growth conditions support for splicing of any of the wrongly predicted introns was observed (data not shown). For the second DM leading to This article is protected by copyright. All rights reserved.

8

Accepted Article

protein truncation of Cf186241, the cDNA covered the complete ancestral protein, suggesting that the parental gene locus once produced a functional transcript. All genes encode proteins with crucial functional domains interrupted by or downstream of the first encountered DM (Figure 2). Based on these results we conclude that none of them produce mRNAs that can be translated in a functional protease. Analysis of 1.662 disruptive mutations in 924 genes The 1662 DMs identified in 924 genes could be subcategorized as nucleotide substitutions (46%) and indels (54%) (Figure 3a). Indels were based on the DNA sequences of informant genes estimated to represent nucleotide deletions (30%) and nucleotide insertions (24%). The frequencies of these subcategories appeared fairly similar for the different species; they varied from 39 to 50% for substitutions (Figure 3b). The point mutations leading to the stop codons TAG, TGA and TAA accounted for 49, 27 and 23% of in-frame stops, respectively (Figure 3c). These frequencies are as expected based on the notion that transitions occur more frequently than transversions and observed codon usage in C.fulvum and D. septosporum (Method S2, see Supporting Information). We conclude that the observed type of mutations result from random DNA mutations. Remarkably, only fourteen pseudogene models contained long stretches of N-nucleotides which might represent repetitive sequence due to inserted transposons as will be discussed later. Pseudogenes with DMs are evenly distributed over the genome. If transposon insertion or repeat-induced point mutation (RIP) would play a significant role in the creation of pseudogenes, they would occur more frequently in direct vicinity of repeats that might have undergone RIP. Other biased genomic distributions of pseudogenes could point to preference of specific chromosomes, specific parts of chromosomes or gene clusters. Only 105 (11%) of the pseudogenes are located within a distance of 1-kb from a repeat or This article is protected by copyright. All rights reserved.

9

Accepted Article

scaffold end (Figure S1, see Supporting Information), and only 32 pseudogenes (3.4%) are located close to repeat areas that have undergone RIP (Figure S2, see Supporting Information). Only 14 pseudogenes embedded a repeat within their coding sequence (Datafile S1, see Supporting Information), which represent most likely genes inactivated by transposon insertion. On average, pseudogenes were 26.3-kb apart from repeats, and for the extremely repeat-dense C. fulvum genome (de Wit et al., 2012) the average distance was 14.5-kb. Therefore we conclude that presence of repeats and RIP activity were of minor importance on the evolution of pseudogenes genes that we have studied here. The pseudogenes did not only lack a positional bias towards repeats, also no general trends for chromosome enrichment neither a positional enrichment towards other pseudogenes could be observed. In general, pseudogenes are evenly distributed over the chromosomes of Z. tritici and D.

septosporum.(Table S5 and S6, see Supporting Information). No enrichment of pseudogenes on the dispensable chromosomes of Z. tritici was observed. We observed a median distance of one pseudogene per 147-kb, with the exception of C. fulvum, where this number was on

average one pseudogene per 34.8-kb (Figure S3, see Supporting Information). The observed median and average inter-pseudogene distances indicate that pseudogenes do not tend to cluster together, although occasionally (nearly) adjacent gene pairs were pseudogenized. In C. fulvum and D. septosporum, the species with the most pseudogenes, in total 23 pairs of directly adjacent pseudogenes were observed (Table S7 and S8, see Supporting Information). This is slightly more than what could be expected based on chance only (data not shown); therefore all pairs were inspected for being member of a gene cluster. In the pseudogene-rich C. fulvum some clear examples of functionally related, adjacent pseudogenes were found: a quartet of four adjacent pseudogenes which are involved in carbohydrate metabolism (Cf186934- Cf186937) and a triplet that encoded a putative chitinase, amino acid transporter

This article is protected by copyright. All rights reserved.

10

Accepted Article

and phosphodiesterase/alkaline phosphatase (Cf191135-Cf191137), respectively (Table S8, see Supporting Information). A bias for pseudogenization of members of multi-gene families and secreted proteins For each gene and pseudogene, the (global) amino acid similarity to their most similar protein-encoding homologue in the complete protein catalogue was determined (Figure 4a). Additionally, the total number of potential homologues was counted to express membership and size of a multi-gene family. In total 682 pseudogenes, representing 74% of all DMcontaining pseudogenes, share 45% to 75% similarity with at least a single homologous, nonpseudogenized gene which is more than the genomic average. The majority of this class of pseudogenes has more than one homologue (Figure 4b) suggesting that multi-gene families seem more frequently affected by pseudogenization. When comparing the multi-gene family size of this class with all multi-gene families, no significant difference, increase nor decrease, in gene-family size could be observed. Genes encoding proteins that are less than 45% similar are less affected by pseudogenization (Figure 4a). Based on these findings, we have made an arbitrary distinction between recent gene duplicates (>75% similarity), single copy genes (

Pseudogenization in pathogenic fungi with different host plants and lifestyles might reflect their evolutionary past.

Pseudogenes are genes with significant homology to functional genes, but contain disruptive mutations (DMs) leading to the production of non- or parti...
766KB Sizes 0 Downloads 0 Views