Infection, Genetics and Evolution 32 (2015) 330–337

Contents lists available at ScienceDirect

Infection, Genetics and Evolution journal homepage: www.elsevier.com/locate/meegid

Elucidating evolutionary features and functional implications of orphan genes in Leishmania major Sumit Mukherjee a,b, Arup Panda a, Tapash Chandra Ghosh a,⇑ a b

Bioinformatics Centre, Bose Institute, P 1/12, C.I.T. Scheme VII M, Kolkata 700 054, West Bengal, India Department of Physical Sciences, Indian Institute of Science Education and Research-Kolkata, Mohanpur 741246, Nadia, West Bengal, India

a r t i c l e

i n f o

Article history: Received 13 January 2015 Received in revised form 25 March 2015 Accepted 26 March 2015 Available online 2 April 2015 Keywords: Orphan genes Evolutionary rate Protein disorder Interaction and trafficking motifs Host–parasite interaction Lineage-specific adaptation

a b s t r a c t Orphan genes are protein coding genes that lack recognizable homologs in other organisms. These genes were reported to comprise a considerable fraction of coding regions in all sequenced genomes and thought to be allied with organism’s lineage-specific traits. However, their evolutionary persistence and functional significance still remain elusive. Due to lack of homologs with the host genome and for their probable lineage-specific functional roles, orphan gene product of pathogenic protozoan might be considered as the possible therapeutic targets. Leishmania major is an important parasitic protozoan of the genus Leishmania that is associated with the disease cutaneous leishmaniasis. Therefore, evolutionary and functional characterization of orphan genes in this organism may help in understanding the factors prevailing pathogen evolution and parasitic adaptation. In this study, we systematically identified orphan genes of L. major and employed several in silico analyses for understanding their evolutionary and functional attributes. To trace the signatures of molecular evolution, we compared their evolutionary rate with non-orphan genes. In agreement with prior observations, here we noticed that orphan genes evolve at a higher rate as compared to non-orphan genes. Lower sequence conservation of orphan genes was previously attributed solely due to their younger gene age. However, here we observed that together with gene age, a number of genomic (like expression level, GC content, variation in codon usage) and proteomic factors (like protein length, intrinsic disorder content, hydropathicity) could independently modulate their evolutionary rate. We considered the interplay of all these factors and analyzed their relative contribution on protein evolutionary rate by regression analysis. On the functional level, we observed that orphan genes are associated with regulatory, growth factor and transport related processes. Moreover, these genes were found to be enriched with various types of interaction and trafficking motifs, implying their possible involvement in host–parasite interactions. Thus, our comprehensive analysis of L. major orphan genes provided evidence for their extensive roles in host–pathogen interactions and virulence. Ó 2015 Elsevier B.V. All rights reserved.

1. Introduction Orphan genes are protein coding genes that do not share detectable sequence similarity with the genomes of other organisms (Tautz and Domazet-Loso, 2011). Due to their phylogenetic restriction these genes are also called as lineage-specific or taxonomically restricted genes (Wilson et al., 2005). Orphan genes comprise a considerable fraction of genes in all domains of life including Abbreviations: L. major, Leishmania major; BLAST, Basic Local Alignment Search Tool; GRAVY, grand average of hydropathy index; Nc, effective number of codon; CAI, Codon Adaptation Index; FPKM, Fragments Per Kilobase of exon per Million fragments mapped. ⇑ Corresponding author. Tel.: +91 33 2355 6626; fax: +91 33 2355 3886. E-mail address: [email protected] (T.C. Ghosh). http://dx.doi.org/10.1016/j.meegid.2015.03.031 1567-1348/Ó 2015 Elsevier B.V. All rights reserved.

viruses (Khalturin et al., 2009; Wilson et al., 2005; Yin and Fischer, 2008). These genes can be broadly classified into two categories (i) taxon-specific orphan genes (TSOGs) that lack homology outside of a focal taxonomic group and (ii) species-specific orphan genes (SSOGs), a subset of SSOGs sharing no homology with any gene in any other species (Wissler et al., 2013). Several hypotheses have been put forward to explain the origin of orphan genes. For instance, gene duplication and rearrangement processes followed by rapid divergence were considered to be an important pathway for the emergence of orphan genes in primates, Arabidopsis and zebrafish (Donoghue et al., 2011; Toll-Riera et al., 2009; Yang et al., 2013). In primates, it has been found that the majority of orphan genes arise from frequent recruitment of transposable elements (Toll-Riera et al., 2009). Orphan genes may also arise de

S. Mukherjee et al. / Infection, Genetics and Evolution 32 (2015) 330–337

novo from non-coding regions (Cai et al., 2008; Heinen et al., 2009; Knowles and McLysaght, 2009; Neme and Tautz, 2013; Wu et al., 2011; Xie et al., 2012; Yang and Huang, 2011). These genes were also found to emerge from overlapping of anti-sense reading frames and frameshift mutations in protein coding sequences (Wissler et al., 2013). Orphan genes are emerging to play critical roles in lineagespecific adaptation of different species to a broad range of ecological conditions (Khalturin et al., 2009). These genes were reported to play substantial roles in response to a variety of abiotic stresses in plant genomes (Donoghue et al., 2011). Imperative roles of orphan genes were also evidenced in several development processes. For instance, orphan gene products were found to be crucial for human early brain development (Zhang et al., 2011) and also for regulation of tentacle formation in Hydra species (Khalturin et al., 2008). Lineage-specific putative surface antigen of plasmodium were shown to be involved in host–parasite interactions (Kuo and Kissinger, 2008). In 2010, Zhang et al. ectopically expressed 14 Leishmania donovani-specific genes in Leishmania major and observed that two of these genes could increase L. major survival in visceral organs (Zhang and Matlashewski, 2010). Studies conducted on different eukaryotes demonstrated that orphan genes evolve faster than non-orphan genes (Cai et al., 2006; Domazet-Loso and Tautz, 2003; Donoghue et al., 2011; Kuo and Kissinger, 2008; Toll-Riera et al., 2009). An inverse relationship between gene age and protein evolutionary rate has been widely observed in a broad range of organisms including primates (Toll-Riera et al., 2009), mammals (Alba and Castresana, 2005), drosophila (Domazet-Loso and Tautz, 2003), Plasmodium (Kuo and Kissinger, 2008), fungi (Cai et al., 2006) and bacteria (Daubin and Ochman, 2004). Since, orphan genes are younger genes in a particular lineage it was hypothesized that these genes evolve faster mainly due to their recent evolutionary origin (Cai et al., 2006; Domazet-Loso and Tautz, 2003; Toll-Riera et al., 2009). Later it was found that protein evolutionary rate could not be determined by a single factor, rather protein’s intrinsic properties as well as their evolutionary age independently modulate the rates of protein evolution (Toll-Riera et al., 2012). Protein evolutionary rate was shown to correlate with a number of gene level and protein level attributes, such as expression level (Drummond et al., 2005; Drummond et al., 2006; Pal et al., 2001), number of protein–protein interactions (Fraser et al., 2002), protein complex number (Chakraborty and Ghosh, 2013), its centrality in the protein interaction network (Hahn and Kern, 2005), protein dispensability (Hirsh and Fraser, 2001), sequence length (Marais and Duret, 2001), Codon Adaptation Index (CAI), effective number of codons (Nc) (Pal et al., 2001; Wall et al., 2005), protein disorder content (Chen et al., 2011; Podder and Ghosh, 2010), etc. In spite of all these findings, factors determining the evolutionary rate of orphan genes are still under debate and the relative contribution of different genomic and proteomic attributes on the evolutionary rates of orphan genes remains elusive. With the availability of high-throughput genomic sequences together with expression data and bioinformatics prediction tools, it has now become easier to identify and characterize orphan genes in different species. L. major is one of the most important protozoan parasites of the genus Leishmania. It is associated with the disease cutaneous leishmaniasis, affecting more than 2 million people throughout the world every year (Ivens et al., 2005). In spite of multiple research endeavors, till date, there is no available vaccine for this disease. Because of their absence in the host genomes orphan gene products in pathogenic protozoan were considered to be possible therapeutic targets (Kuo and Kissinger, 2008). Therefore, profiling orphan genes of L. major from the perception of protein evolutionary rates and comparing them with non-orphan genes along with understanding their functional roles will

331

be helpful to recognize the molecular signature of parasitic adaptation. With this aim we carried out rigorous analysis to understand the functionality of orphan genes and investigated the evolutionary forces affecting orphan gene evolution. To evaluate the attributes of orphan genes in the evolutionary framework we performed a comprehensive analysis comparing orphan genes with the non-orphan genes. In this study our primary objective is to characterize all the possible determinants that may have shaped the evolutionary rate of orphan genes in L. major. One of the main obstacles to such a study is the limitation of required data on orphan genes. Therefore, in this study we consider several genomic and proteomic attributes that could be easily identified from coding sequences and analyzed their relative influence on the evolutionary rate heterogeneity between orphan and non-orphan genes. Confirming earlier observations our study revealed that orphan genes evolve faster than non-orphan genes (Domazet-Loso and Tautz, 2003; Toll-Riera et al., 2009). However, in contrary to the suggestions of those studies, here, we found that gene age could account for a fraction of variation of their evolutionary rate. Instead, together with gene age, a number of factors like gene expression, codon bias, genic GC content, protein hydropathicity, protein disorder content and protein length were found to have substantial contribution on the evolutionary rate difference between orphan and non-orphan genes. On functional level, we found that sequences of orphan genes are endowed with host targeting motifs, prenylation motifs, heparin-binding consensus sequences, signal peptides and transmembrane domains, implying their possible roles in host–parasite interactions. Thus, our study on orphan genes of L. major shed light on the factors governing pathogen evolution and reveals their contribution in parasitic adaptations. 2. Materials and methods 2.1. Collection of dataset and gene expression data We retrieved the protein coding sequences of L. major (strain Friedlin) from TriTrypDB version 7.0 (http://tritrypdb.org/tritrypdb/) (Aslett et al., 2010). CDS sequences containing internal stop codons and partial codons were removed using CodonW (http://codonw.sourceforge.net). Signal peptide, transmembrane domain, epitope, paralogs and pathway information’s of all L. major genes were downloaded from TriTrypDB version 7.0. To compute gene expression level, we retrieved high-throughput RNA-seq expression profile data of L. major promastigote stage from the dataset of Rastrojo et al. (2013). We searched for protein domains via InterProScan (Zdobnov and Apweiler, 2001). 2.2. Identification of orphan genes To identify orphan gene models which are restricted to the Leishmania genus, we used a systematic way based on homology search. First, BLASTP followed by TBLASTN filtering approach (E < 10 5 and use of low-complexity filters) was used against NCBI nr databases. Additionally, to further screen for similarity between sequences we employed Position-Specific Iterated BLAST (PSI-BLAST) (Altschul et al., 1997) that can detect weaker homologous relationships that would otherwise be missed by the standard BLAST algorithms. 2.3. Calculation of nucleotide substitution rate The ratio of the rate of non-synonymous substitutions (dN) to the rate of synonymous substitutions (dS) was widely used as an

332

S. Mukherjee et al. / Infection, Genetics and Evolution 32 (2015) 330–337

indicator of selective pressure acting on a protein-coding genes (Kryazhimskiy and Plotkin, 2008). dN/dS values of L. major genes were calculated with respect to their one-to-one orthologous sequences in four other Leishmania species: Leishmania infantum, Leishmania braziliensis, Leishmania mexicana and L. donovani. To calculate dN/dS values, each set of orthologous gene pair was aligned using ClustalW (Larkin et al., 2007). dN/dS values were calculated by Yang and Nielsen method using the PAML package v-4 (Yang and Nielsen, 2000). We averaged the dN/dS values of each gene and represented that as their evolutionary rate (Supplementary dataset). 2.4. Codon usage indices calculation Codon Adaptation Index (CAI) and effective number of codons (Nc) of L. major genes were computed using CodonW (Sharp et al., 1986). For calculating CAI values highly expressed gene set of L. major was prepared based on RNA-seq data of promastigote stage (Rastrojo et al., 2013). Overall genic GC content and an average protein hydropathy index were also computed using CodonW (Sharp et al., 1986).

and SVM) predict same localization of a gene we took that as its subcellular localization. We predicted pathogenic ability of the orphan gene using MP3 server (http://metagenomics.iiserb.ac.in/mp3/index.php) (Gupta et al., 2014). This server predicts pathogenic and virulent proteins from genomic and metagenomic datasets using an integrated SVM-HMM approach.

2.8. Identification of interaction and trafficking motifs We investigated for host-cell targeting motifs RXLXE/D/Q (where X is a neutral or a hydrophobic amino acid residue) that were previously reported for their activity to export Plasmodium falciparum proteins from the intracellular parasites (Bhattacharjee et al., 2012) to the surrounding erythrocytes. We also searched for the presence of consensus sequences XBBXBX, XBBBXXBX and XBBBXXBBBXXBBX (where X is a neutral or hydrophobic amino acid residue and B is a basic amino acid residue) which were implicated in hairpin binding (de Castro Cortes et al., 2012). We searched all of these sequence patterns using in-house Perl scripts.

2.5. Calculation of gene age 2.9. Identification of CAAX prenylation motifs To calculate phylogenetic age of L. major genes we used phylostratigraphic approach (Domazet-Loso and Tautz, 2010). Briefly, according to the significant BLAST hits found in most remote species (as documented in NCBI nr databases) L. major genes were classified into four taxonomic levels: genes shared by only Leishmanial species, genes shared by Trypanosomatidae, genes distributed among basal Eukaryota and genes distributed among all organisms. Genes for which BLAST hits were found only in Leishmania genus (genus restricted orphan genes) were considered as the youngest genes; whereas, genes distributed in prokaryotic species were considered as the oldest genes. 2.6. Prediction of protein intrinsic disorder Protein disorder content was predicted using IUPred algorithm (Dosztanyi et al., 2005). Based on pairwise interaction energy IUPred assigns a score to each amino acid (Dosztanyi et al., 2005). For each protein we calculated the proportion of amino acids with disorder score P0.5 and represented this as its disorder content. 2.7. Prediction of GO term, subcellular localization and pathogenic protein

We searched for the CaaX prenylation motifs (‘C’ is Cysteine, ‘a’ is an aliphatic amino acid, and ‘X’ is any amino acid) in orphan genes using PrePS webserver (http://mendel.imp.univie.ac.at/sat/ PrePS) (Maurer-Stroh and Eisenhaber, 2005). This server classified CaaX motifs as farnesyltransferase (FT), geranylgeranyltransferase I (GGT1) and geranylgeranyltransferase II (GGT2) motifs.

2.10. Statistical analysis For correlation analyses we used non-parametric Spearman’s rank correlation q. Significant differences of variables between orphan and non-orphan genes were evaluated using Mann–Whitney U test following their non-parametric distribution (Kolmogorov–Smirnov test, P < 0.05). Here, P < 0.05 denotes the measure of significance at 95% confidence level. All tests were done using the software SPSS (v-13.0).

3. Results and discussion 3.1. Searching for orphan genes in L. major

For Gene Ontology (GO) annotations of orphan genes we primarily focused on TriTrypDB v 7.0. However, we found only 43 orphan genes have annotated GO terms. Therefore, for the rest of orphan genes in our dataset we predicted GO categories using ProtFun 2.2 webserver (http://www.cbs.dtu.dk/services/ProtFun/) (Jensen et al., 2003). Protfun 2.2 is a homology independent method and predicts protein function based on their physicochemical properties. Therefore, this algorithm was considered to be useful for prediction of protein function even of orphan genes (Yang et al., 2013). Subcellular localization of orphan genes was predicted using two independent web servers: CELLO v.2.5 (http://cello.life.nctu. edu.tw/) (Yu et al., 2004) and SubCellProt (http://www.databases. niper.ac.in/SubCellProt) (Garg et al., 2009). CELLO predicts protein subcellular localization using two-level support vector machine (SVM). While, SubCellProt is based on two machine learning approaches, k Nearest Neighbor (k-NN) and Probabilistic Neural Network (PNN). When two of these three approaches (k-NN, PNN

Basic Local Alignment Search Tool (BLAST) is a standalone method to identify orphan genes in any sequenced genome (Tautz and Domazet-Loso, 2011). The number of orphan genes in an organism depends on different filtering procedure during identification steps. Previous studies have used BLASTP and TBLASTN methods to identify orphan genes in various species (Lin et al., 2010; Yang et al., 2013). Following these studies, we used rigorous BLAST searches against NCBI non-redundant (nr) databases to identify orphan genes in L. major genome. To identify matches missed by the BLASTP and tBLASTn searches we performed PSIBLAST search with a cut-off of E-value < 1  10 5. Finally, we identified 881 genus specific orphan genes, corresponding to 10.65% of all L. major protein coding sequences (Supplementary_dataset). According to high-throughput RNA-seq expression data of promastigote stage (Rastrojo et al., 2013), we found gene expression intensity (FPKM value) of 864 (out of 881) orphan genes which indicates that orphan genes are not artifact of genome annotations.

S. Mukherjee et al. / Infection, Genetics and Evolution 32 (2015) 330–337

3.2. Evolutionary rate heterogeneity of orphan and non-orphan genes: effect of gene age To examine whether orphan genes of L. major had come under selective constraints, we computed their evolutionary rate and compared with that of non-orphan genes. Consistent with the previous reports on other genomes (Domazet-Loso and Tautz, 2003; Toll-Riera et al., 2009; Wissler et al., 2013), here we found that evolutionary rate of orphan genes is significantly higher as compared to the rest of the genes in L. major (0.55 ± 0.06 vs. 0.24 ± 0.002 for orphan vs. non-orphan genes, P = 1  10 6, Mann–Whitney U test). Previously, it was attributed that phylogenetically conserved genes are associated with basic cellular processes and are functionally more constrained. Whereas, lineage-specific genes are functionally less constrained and could evolve many new functions (Kuo and Kissinger, 2008). Thus, higher sequence divergence of orphan genes was regarded to be due to lesser selection on functional requirement (Toll-Riera et al., 2009). Since, newly evolved genes tend to be under weaker purifying selection earlier studies reported that proteins with accelerated evolutionary rate are younger in gene age (Alba and Castresana, 2005; Cai et al., 2006). Thus, proteins evolutionary origin i.e. their gene age was considered to be a potential determinant of their evolutionary rate (Vishnoi et al., 2010). Orphan genes are evolutionarily younger genes that appeared at some time within a phylogenetic lineage towards an extant species (Toll-Riera et al., 2009). Therefore, their accelerated evolutionary rates were thought to be due to their recent evolutionary origin (Cai et al., 2006; Tautz and Domazet-Loso, 2011). To test this hypothesis we assigned a phyletic age to all the protein coding genes of L. major according to their phylogenetic distribution (i) genes restricted to Leishmania genus (orphan genes), (ii) trypanosomatidae restricted genes, (iii) extensively distributed eukaryotic genes and (iv) genes distributed among all organisms (including virus, bacteria and all other life forms). Thus, we found that evolutionary rate decrease with increasing gene age (Table 1). Consequently, we found a negative correlation between gene age and evolutionary rate (Spearman’s q = 0.519, P = 1  10 6) which suggests that gene age is an important correlate of protein evolutionary rate. To test its independent impact on the evolutionary rate of L. major proteins, we performed linear regression taking gene age as a predictor variable and protein evolutionary as a dependent variable. Results revealed that gene age could account for 12.4% variation of protein evolutionary rate in our dataset (R2 = 0.124, F = 1054.839, P < 1  10 6). Thus, it is apparent that although gene age has a major contribution on the variation of protein evolutionary rate between orphan and non-orphan genes, there could be other evolutionary force(s) behind this variation. 3.3. Protein evolutionary rate: impact of gene level variants Previously, it was shown that genes with higher GC content tend to evolve at a slower rate as compared to lower GC genes (Xia et al., 2009). Thus, GC content was regarded as a strong predictor of protein evolutionary rate. In many species orphan genes

Table 1 Evolutionary rate of L. major genes according to their phylogenetic distribution. Phylogenetic class of L. major genes

dN/dS (Mean ± SE)

P-values

Genes restricted to Leishmania genus (orphan genes) Trypanosomatidae restricted genes Extensively distributed eukaryotic genes Genes distributed among all organisms

0.55 ± 0.0068

Nc > CAI > protein hydrophilicity > gene expression level > gene age > protein length. 3.6. Functional attributes of orphan genes Vast amount of orphan genes detected in different taxa with their possible lineage-specific functions motivated us to investigate if orphan genes confer any advantage for lineage-specific adaptation of L. major. We searched for the Gene Ontology (GO) annotations of orphan genes to explore their functional significance. Gene ontology (GO) annotations are currently available only for a limited number of L. major orphan genes (43 genes with annotated GO terms out of 881 orphan genes in L. major) (Table 3). Therefore, to trace the potential functional roles of orphan genes we predicted GO annotations from ProtFun 2.2 server (Jensen

Table 2 Categorical regression to illustrate independent influence of different variables on protein evolutionary rate. Parameter Protein level properties Intrinsic disorder content Protein length Protein hydrophilicity Gene age Gene level properties Expression level (FPKM) CAI GC content Nc

b score

P-values

0.371 0.035 0.237 0.072

Elucidating evolutionary features and functional implications of orphan genes in Leishmania major.

Orphan genes are protein coding genes that lack recognizable homologs in other organisms. These genes were reported to comprise a considerable fractio...
367KB Sizes 1 Downloads 10 Views