Plant Cell Rep DOI 10.1007/s00299-014-1569-8

ORIGINAL PAPER

Large scale in-silico identification and characterization of simple sequence repeats (SSRs) from de novo assembled transcriptome of Catharanthus roseus (L.) G. Don Santosh Kumar • Niraj Shah • Vanika Garg Sabhyata Bhatia



Received: 25 September 2013 / Revised: 17 December 2013 / Accepted: 9 January 2014 Ó Springer-Verlag Berlin Heidelberg 2014

Abstract Key message Transcriptomic data of C. roseus offering ample sequence resources for providing better insights into gene diversity: large resource of genic SSR markers to accelerate genomic studies and breeding in Catharanthus. Abstract Next-generation sequencing is an efficient system for generating high-throughput complete transcripts/ genes and developing molecular markers. We present here the transcriptome sequencing of a 26-day-old Catharanthus roseus seedling tissue using Illumina GAIIX platform that resulted in a total of 3.37 Gb of nucleotide sequence data comprising 29,964,104 reads which were de novo assembled into 26,581 unigenes. Based on similarity searches 58 % of the unigenes were annotated of which 13,580 unique transcripts were assigned 5016 gene ontology terms. Further, 7,687 of the unigenes were found to have Cluster of Orthologous Group classifications, and 4,006 were assigned to 289 Kyoto Encyclopedia of Genes and Genome pathways. Also, 5,221 (19.64 %) of transcripts were distributed to 81 known transcription factor (TF) families. In-silico analysis of the transcriptome resulted in Communicated by M. Prasad. S. Kumar, N. Shah contributed equally to this work. NCBI Accession SRA: SRP022128.

Electronic supplementary material The online version of this article (doi:10.1007/s00299-014-1569-8) contains supplementary material, which is available to authorized users. S. Kumar  N. Shah  V. Garg  S. Bhatia (&) National Institute of Plant Genome Research, Aruna Asaf Ali Marg, PO Box 10531, New Delhi 110067, India e-mail: [email protected]

identification of 11,004 SSRs in 26.62 % transcripts from which 2,520 SSR markers were designed which exhibited a non-random pattern of distribution. The most abundant was the trinucleotide repeats (AAG/CTT) followed by the dinucleotide repeats (AG/CT). Location specific analysis of SSRs revealed that SSRs were preferentially associated with the 50 -UTRs with a predicted role in regulation of gene expression. A PCR validation of a set of 48 primers revealed 97.9 % successful amplification, and 76.6 % of them showed polymorphism across different Catharanthus species as well as accessions of C. roseus. In summary, this study will provide an insight into understanding the seedling development and resources for novel gene discovery and SSR development for utilization in marker-assisted selective breeding in C. roseus. Keywords C. roseus (Catharanthus roseus)  Transcriptome  Illumina short read sequence  de novo assembly  Genic SSRs  ESTscan

Introduction Catharanthus roseus (Madagascar periwinkle) is a dicotyledonous tropical perennial plant belonging to Apocynaceae family, an entirely self-pollinated species with high heritability. It is widely known for its pharmacologically important alkaloids in addition to being appreciated as an ornamental plant. The plant exhibits an unsurpassed spectrum of chemodiversity as it produces over 130 terpenoid indole alkaloids (TIAs) in different plant parts like root, shoot and stem (van der Heijden et al. 2004; Shukla et al. 2006). Several of these are used as pharmaceuticals including the anticancer alkaloids vinblastine and vincristine (produced in leaves) and the antihypertensive alkaloids

123

Plant Cell Rep

ajmalicine and serpentine (produced in roots). Moreover, it is also a source of other natural products such as phenolics and triterpenoids (Bayer et al. 2004; Caspi et al. 2010, 2012). Production of such a wide array of complex secondary metabolites is not reported in any other single plant species (van der Heijden et al. 2004; Blasko and Cordell 1990) and therefore, C. roseus serves as a model medicinal plant. The overall goal of research in C. roseus has focused on the development of biological resources which yield valuable TIAs in quantities high enough to be harvested in an economically and commercially viable manner. High alkaloid yielding cultivars of C. roseus may be obtained either by (1) metabolic engineering of the TIA pathway in C. roseus or (2) breeding of improved varieties of C. roseus. Effort at metabolic engineering of the TIA pathway has included over expression of specific genes and treatments with hormones and elicitors (Canel et al. 1998; Whitmer et al. 2002; Sevon and Oksman-Caldentey 2002). However, despite these efforts, a lot needs to be done to unravel the genetic and regulatory mechanisms controlling the TIA pathway. Meanwhile, alternative approaches such as molecular breeding may be used to complement the efforts of enhancing the in vivo TIA yields in C. roseus. Toward this, efforts have been initiated in the recent decade for the generation of genomic resources such as ESTs (Murata et al. 2006; Shukla et al. 2006), molecular markers especially SSRs (Shokeen et al. 2005, 2007, 2011; Mishra et al. 2011), genetic linkage maps (Shokeen et al. 2011; Chaudhary et al. 2011) and QTLs related to TIA yield (Sharma et al. 2012). Simple sequence repeats (SSRs) or microsatellites, comprising 1–6 bp long repeat motifs, occur ubiquitously in plant genomes (Gur-Arie et al. 2000; Toth et al. 2000) and serve as one of the most efficient class of genetic markers (Morgante et al. 2002). Microsatellites are classified according to the type of repeat sequence as perfect, imperfect, interrupted or composite and exhibit length variation as a result of replication slippage and unequal recombination (Richards and Sutherland 1992; Schlotterer and Tautz 1992). Though they are known to occur throughout the genome, their distribution differs between various genomic fractions. Microsatellites are preferentially associated with non-repetitive DNA in plant genomes i.e. they frequently occur within and near genes (Morgante et al. 2002) and certain classes of repeats may be predominant in certain genomic locations (Hancock 1999; Toth et al. 2000). In transcribed regions, according to available information in plant databases, UTRs harbor more SSRs than the coding regions (Metzgar et al. 2000; Morgante et al. 2002). This may be due to different selective pressures acting on 50 and 30 untranslated regions (UTRs) and open reading frames (ORFs) of transcription units. Microsatellite frequency at the 30 UTR region is

123

higher than that expected for the whole genome with tri and tetranucleotides contributing markedly to this increase. However, the 50 UTR region shows a much higher microsatellite frequency than other genomic fractions, and this is due to the presence of di and trinucleotides, principally AG/CT and AAG/CTT repeats. Availability of expressed sequence tag (EST) databases and Next-Generation Sequencing (NGS) technologies for transcriptome sequencing has greatly facilitated development of genic SSR markers and functional annotation by in-silico analysis. The development of genic SSR marker by this method is more preferable today (Temnykh et al. 2001) as they are cost effective, need less time for development and are more informative in comparison to genomic SSRs (Zane et al. 2002). The genic SSR markers not only help in molecular mapping but also provide an opportunity for gene discovery since they show linkages with traits of interest (Thiel et al. 2003) and have higher rates of transferability across species (Portis et al. 2007; Varshney et al. 2005b). However, for any successful application in plant breeding, SSRs are required in large numbers. In C. roseus, generation of SSR has been limited to 423 genomic SSRs (Shokeen et al. 2005, 2007, 2011) and 170 ESTs based SSRs (Mishra et al. 2011). Recently, SSR generation received a boost with the advent of NGS technologies that provide new strategies to analyze the functional complexity of transcriptomes (Cloonan et al. 2008; Haas and Zody 2010; Brenner et al. 2000). RNA-Seq is a recently developed high-throughput sequencing method that uses deep-sequencing technologies to produce millions of short DNA reads which are either aligned to a reference genome or reference transcripts, or assembled de novo without the genomic sequence to produce a genomeniche transcription map that consists of both the transcriptional structure and/or level of expression for each gene (Mortazavi et al. 2008). Furthermore, the RNA-Seq method offers a holistic view of the transcriptome revealing many novel transcribed regions, splice isoforms, single nucleotide polymorphisms (SNPs) and the precise location of transcription boundaries (Cloonan et al. 2008; Wilhelm et al. 2010). Moreover, in the absence of a reference genome, transcriptome sequencing is not only used to identify transcripts involved in specific biological processes but is also an effective way to obtain a large number of molecular makers such as SSRs from assembled non-redundant transcript/unigenes that determine functional genetic variation. These unigene-based microsatellite markers (UGMS) are an accurate reflection of density of SSRs in the transcribed regions of the genome (Parida et al. 2006). In this study, we used NGS (RNA-Seq) technology to survey the poly(A) ? transcriptome of 26-day-old C. roseus seedlings. De novo assembly was performed using the short read sequence assembler. The assembled sequences were

Plant Cell Rep

annotated in various ways to serve as a public information platform for quick gene discovery and gene expression analysis. In addition to this, de novo identification and distribution of SSRs in different sequence components [50 UTRs (un translated regions), 30 -UTRs and coding sequences (CDS)] of C. roseus transcriptome, (without using reference genome) were performed thereby providing a huge resource of functional markers in this medicinal plant.

optimized to generate high-quality unique contigs. The resulting contigs were uploaded in Oases (Version 0.1.8 (http://www.ebi.ac.uk/-zerbino/oases) which clusters the contigs into small groups and then using the paired-end read information transcript isoforms are generated. The longest isoforms were selected to generate a set of nonredundant transcripts. Sequence similarity with available C. roseus sequences and other model plants

Materials and methods Plant material and RNA extraction In this study, four Catharanthus species [C. trichophyllus, C. pusillus, Vinca minor, Thevetia peruviana] and eight accessions of C. roseus [Prabal, Nirmal, Vinca ice brocode, Pacifica light pink, Apricot eye, Pacifica blush, Pacifica coral, Pacifica burgundy] were utilized. Seeds of all the species/accessions were germinated in pots containing peat: vermiculite: perlite (1:1:1) and kept in a growth chamber at 26 °C under a 16-h photoperiod with light intensity of 380 ± 25 mol-2 s-1 generated by overhead fluorescent lamp. Total RNA was isolated from seedlings of C. roseus var. Nirmal (CIMAP-0865) (Kulkarni et al. 1999) as described in Sambrook et al. (1989) for transcriptome sequencing. Leaf tissues were sampled from all the species/accessions for primer validation. cDNA library preparation and sequencing The quality and quantity of RNA sample was assessed on the 2100 Bioanalyzer using Agilent Plant RNA 6000 nano kit. RNA samples with 260/280 ratios between 1.9 and 2.1, 260/230 ratios between 2.0 and 2.5 and RNA integrity number (RIN) more than 8.0 were used for preparation of cDNA. cDNA library was generated using the Clontech kit according to manufacturer’s protocol. The C. roseus transcriptome was sequenced using the Illumina Genome Analyser II platform for which the paired-end mRNA-Seq library was prepared according to the Illumina protocol using the TruSeq Technology reagents. The reads obtained were subjected to quality control checks and high-quality reads based on phred score were refined, and adaptor/vector sequence containing reads were removed using NGS QC Toolkit (Patel and Jain 2012). The sequence data generated in this study has been deposited at NCBI under Accession No. SRP022128. Sequence assembly The short read sequence data was de novo assembled using Velvet 1.2.07 (http://www.ebi.ac.uk/-zerbino/velvet). Various assembly parameters especially k-mer length were

Available EST sequences of C. roseus were downloaded from the NCBI EST database (as on Sept. 2012; 19910 ESTs). Vector and adaptor sequences were trimmed off using Seqclean software and were assembled using CAP3 program with default parameters to form unigenes. The non-redundant transcripts of C. roseus, generated from the Illumina short reads data in this study, were analyzed by searching against the unigenes generated from the C. roseus ESTs using BLASTN with an E value of B1-05. Further, proteomes of Arabidopsis (ftp://ftp.arabidopsis. org/home/tair/Proteins/) and rice (ftp://ftp.plantgdb.org/ download/Genomes/OsGDB/previous_version/OSpepV6.1) were downloaded from the respective portals and were subjected to BLASTX search (B1E-05) with the nonredundant set of C. roseus transcripts. Sequence annotation A comprehensive functional characterization of the contigs was done including protein sequence similarity, Gene Ontology (GO), Cluster of Orthologous Groups (COG) and Kyoto Encyclopedia of Genes and Genomes (KEGG) analysis using BLASTX with E value of B1E-05. C. roseus transcripts were subjected to BLASTX against the non-redundant protein database of UniProtKB/SwisProt to assess the quality of de novo assembly and to deduce the putative function. The unigenes that showed significant BLAST hits with UniProtKB/SwisProt were used for functional annotation and were assigned GO terms based on which they were distributed into three main functional categories of biological process, cellular component and molecular function. For KEGG pathway enrichment, the C. roseus transcripts were aligned with the KEGG database using BLAST X. The KEGG orthology (KO) assignment and the KEGG pathway reconstructions were performed in KAAS (Automatic Annotation Server Ver. 1.6; http://www.gen ome.jp/tools/kaas/) with default parameters. KEGG Automatic Annotation Server (KAAS) provides functional annotation of genes by BLAST comparisons against the manually curated KEGG GENES database. The result contains KEGG Orthology (KO) assignments and

123

Plant Cell Rep

automatically generated KEGG pathways. Alignment with the COG protein database (http://www.ncbi.nlm.nih.gov/ COG) was done to predict and classify the transcripts into functional categories. To assign the putative transcription factor terms to the contigs, the C. roseus transcripts were aligned to the Plant Transcription Factor Database (plntfdb.bio.uni-potsdam.de/ v3.0/downloads.php) version 3.0 using BLAST X. GC content analysis The GC counts of C. roseus transcripts as well as those of Arabidopsis (dicot reference) and rice (monocot reference) were determined using a custom-made perl script. For this, the ESTs of Arabidopsis and rice were downloaded from the NCBI database (as on Sep. 2012), and were assembled into unigenes for estimation of GC count. Gene expression analysis For estimation of transcript abundance, all the reads were mapped onto the non-redundant transcripts using Maq software v 0.7.1 (http://maq.sourceforge.net/index.shtml). A number of reads mapped were normalized and gene expression was measured using RPKM (reads per kilobase per million) method (Mortazavi et al. 2008). According to the RPKM method, contig expression levels are calculated as: RPKM (A) = (10,00,000 9 C 9 1,000)/(N 9 L), where RPKM is expression of gene A, C is the number of reads that uniquely aligned to gene A, N is the total number of reads that uniquely aligned to all genes and L is the number of bases in gene A. The RPKM method is able to eliminate the influence of divergent gene lengths and sequencing discrepancy within the calculation of gene expression.

1 ll of each forward and reverse primer, 0.2 ll of Titanium Taq and milliQ to make the final volume up to 10 ll. A touchdown PCR protocol was used as described earlier (Shokeen et al. 2011). The amplified products were resolved on either 8 % polyacrylamide gels or 3 % Metaphor agarose gels stained with ethidium bromide. Further, to assign the location of SSRs within the genic fractions, custom-made Perl scripts and the gene prediction program ESTScan (Lottaz et al. 2003; Iseli et al. 1999) were used to define the boundaries of UTR regions and coding regions of transcripts followed by localization of SSRs within the transcripts.

Results Illumina paired-end sequencing and de novo assembly cDNA library was prepared from mRNA isolated from 26-day-old C. roseus seedlings and sequenced using the Illumina GAIIx platform. Paired-end sequencing generated 33,447,506 (*33.4 million) paired-end reads, each of length 101 bases encompassing 3.37 GB of nucleotides. To improve the quality of sequence data, reads containing adaptor and/or vector sequence as well as reads having phred quality B30 % were discarded. After filtering, 29,964,104 (89.59 %) high-quality reads in which the average quality score at each base position was above 30 were retained and used for assembly (Table 1). The reads were submitted to the NCBI database and Assigned No. SRP022128. The de novo assembly of the C. roseus transcriptome was optimized after assessing the effect of k-mer values. The high-quality sequence reads were assembled using Velvet program at different k-mer lengths of 31, 35, 39, 43, 47, 51, 55, 59, 63 and 67 and various output parameters

SSR detection, primer designing and validation A microsatellite search program MISA (http://pgrc.ipkga tersleben.de/misa/) was used to identify microsatellite motifs. All types of simple sequence repeats (SSRs) were assembled ranging from dinucleotide to hexanucleotides using the following parameters: dinucleotides C6, trinucleotides C4, tetranucleotides C3, pentanucleotides and hexanucleotides C3. Both perfect (SSRs containing a single repeat motif such as (ATC)n and compound (composed of two or more kinds of motifs) SSRs were identified. Batchprimer3 software was used for primer designing and only those SSRs which had a minimum of 100 bp flanking sequence on both sides were considered. For validation of SSR primers, PCR reaction was performed in a 10-ll reaction. For each primer set, following components were added; 25 ng DNA, 1 ll of 109 PCRbuffer, 1 lL dNTPs,

123

Table 1 Statistics of assembly of C. roseus transcriptome assembled using Velvet and Oases Statistics of assembly Total no. of raw reads

33,447,506

Total no. of filtered reads

29,964,104 (89.59 %)

Total count of assembled bases

26,495,402

Q30 (%)

90

The N50 value

1,575

The N50 index value

5,536

Total no. of non-redundant transcripts The avg. size of transcripts

26,581 996.78

Length of largest transcript

15,529

Length of smallest transcript

102

GC (%)

41.76 %

Plant Cell Rep

No. of transcripts

7000

Table 2 Functional annotation of C. roseus transcripts with known sequence and/or proteins of various plant species

6000 5000 4000 3000 2000 1000 0

Range of transcript length

Annotated database

No. of significant hits

(%)

C. roseus-ESTs (19910; NCBI)

18,806

71

Nr annotation (Arabidopsis)

19,584

73.68

Nr annotation (Rice)

18,861

70.96

UniProt KB/ SwissProt

15,426

58

GO

13,580

88.03

Fig. 1 Size distribution of C. roseus transcripts

Biological function (11,129) Cellular component (11,317)

like number of used reads, nodes, total number of contigs, contigs longer than 100 bp, N50 length, longest contig length and average contig length as a function of k-mer length were analyzed (Table S1). The analysis showed that k-mer length was inversely proportional to the number of contigs. We found that the best assembly was obtained at k = 51. The contigs obtained from the Velvet assembly were further used as an input data for Oases. The final assembly resulted in a total of 26,581 non-redundant transcripts with N50 length of 1,575 bp, largest contig length of 15,529 bp and average contig length of 996.77 bp (Table 1). The size distribution of transcripts ranged from 100 bp to more than 5,000 bp wherein, maximum number of transcripts (10,552) were in the range of 100–500 bp followed by 5,942 transcripts in the range of 500–1,000 bp since the number decreases as the transcript length increases (Fig. 1). The GC content of C. roseus transcripts as well as those of Arabidopsis (dicot reference) and rice (monocot reference) was determined. The average GC content of C. roseus transcripts was 41.76 %, which was approximately equal to that of Arabidopsis (42.5 %) and much lower than rice (55 %) in agreement with those reported earlier for monocots and dicots by Carels and Bernardi (2000). Validation and functional annotation of C. roseus transcripts To validate the assembled transcripts, 19,910 currently available (as of Sep. 2012) C. roseus EST sequences were downloaded from NCBI. These sequences were subjected to a BLASTN search against the 26,581 assembled C. roseus transcripts using the E value of B1E-05. Of the 19,910 C. roseus EST sequences in the NCBI database, 94 % matched with 18,806 (71 %) assembled unigenes at a cut-off E value of B1E-05 while 7,775 (29 %) transcripts were found to be unique in our transcriptome, thereby

Molecular function (6,838) COG

7,687 (101 orthologous groups into 23 COG functional categories)

28.91

KEGG

4,006 (289 predicted KEGG metabolic pathways)

15.07

Transcription factors (TFs)

5,221 (belong to 81 known TF families)

19.64

suggesting that the unigene assembly was highly valid and had a wide coverage of known C. roseus genes. Furthermore, the non-redundant transcript set of C. roseus was analyzed for sequence conservation against the protein datasets of Arabidopsis and rice using BLASTX search at an E value cut-off threshold of 10-5 to define a significant hit. A large number (19,584; 73.68 %) of C. roseus transcripts showed significant similarity with Arabidopsis followed by rice (18,861; 70.96 %) (Table 2). C. roseus transcripts were searched against the non-redundant protein sequences available in the UniProtKB/SwisProt database using BLASTX with E value of B1E-05 in order to assign putative functions to the C. roseus transcripts. Out of 26,581 transcripts, 15,426 transcripts (*58 %) showed significant hits to the UniProtKB/SwisProt dataset (Table 2) thereby showing overall gene conservation in C. roseus. In addition, many C. roseus transcripts showed homology to uncharacterized proteins annotated as unknown, hypothetical and expressed proteins as well. Further, GO terms were assigned to 15,426 annotated transcripts. Out of these annotated transcripts, 13,580 unique transcripts were assigned 5,016 GO terms which were classified into 53 functional groups, belonging to the three main GO categories: molecular function, cellular component and biological process (Fig. 2). As one GO term can be assigned to multiple transcripts and single transcript can have multiple GO terms 11,129 were assigned at least one GO term in biological process

123

Plant Cell Rep 25000

Biological Processes

Cellular Component

Molecular Function

No. of transcripts

20000

15000

10000

5000

translation regulator activity transcription regulator activity enzyme regulator activity ligase activity isomerase activity lyase activity hydrolase activity transferase activity oxidoreductase activity kinase activity antioxidant activity channel activity ion transmembrane transporter activity protein transporter activity protein binding binding transporter activity structural molecule activity receptor activity signal transducer activity helicase activity catalytic activity motor activity nucleic acid binding

external encapsulating structure membrane cell surface cytoplasm chromosome nucleus cell intracellular extracellular space proteinaceous extracellular matrix extracellular region

response to stimulus regulation of biological process secretion macromolecule metabolic process extracellular structure organization cell differentiation cellular process biosynthetic process catabolic process cell death metabolic process behavior multicellular organismal development cell communication cellular membrane fusion cellular component movement transport nucleic acid metabolic process

0

Fig. 2 Gene Ontology (GO) classification of C. roseus transcripts. GO terms were assigned to 13,580 annotated transcripts which were grouped into three main categories: biological process (11,129), cellular component (11,317) and molecular function (6,838)

category, 11,317 in molecular function category and 6,838 in cellular component category (Fig. 2; Table 2). In the biological process category, most of the transcripts were associated with ‘‘cellular process’’ followed by ‘‘metabolic process’’. Similarly in the cellular component category highest number of transcripts was associated with ‘‘cell’’ followed by ‘‘membrane’’ and in the molecular function category the largest number of transcripts was grouped in ‘‘catalytic activity’’ followed by ‘‘binding’’ and ‘‘transferase activity’’. To further predict the function of the transcripts, all 26,581 unigenes were subjected to classification into different protein families based on Clusters of Orthologous Groups (COG) of protein databases. Overall 7,687 putative proteins were classified into 101 orthologous groups. The COG-annotated putative proteins were distributed functionally into at least 23 protein families (Fig. 3; Table 2), of which the cluster for ‘general function prediction’ represented the largest group (1,183), followed by ‘post translational modification, protein turnover, chaperones’ (678), ‘translation, ribosomal structure and biogenesis’ (561) and ‘carbohydrate transport and metabolism’ (398). Pathway–based analysis can help us further understand the biological significance of genes. The Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway database

123

contains systematic analysis of inner-cell metabolic pathways and functions of gene products, which aid in studying the complex biological behaviors of genes. To identify active biological pathways in C. roseus, the unigenes were mapped to the reference canonical pathways in the KEGG database. In total, 4,006 assembled sequences were found to be involved in 289 predicted KEGG metabolic pathways. The number of sequences in each pathway ranged from 1 to 116 (Table S2). The top five metabolic pathways were: ribosome (116 genes), spliceosome (101 genes), RNA transport (90 genes), purine metabolism (82 genes) and oxidative phosphorylation (78 genes) (Table 3). The ‘biosynthesis of secondary metabolites’ (833 genes) was also represented in the KEGG classification. Many genes corresponding to these pathways included ‘purine metabolism’ (82 genes), pyrimidine metabolism (71 genes), ‘arginine and proline metabolism’ (33 genes), ‘glycine, serine and threonine metabolism’ (30 genes), ‘flavonoids biosynthesis’ (10 genes), ‘Tropane, piperidine and pyridine alkaloid biosynthesis’ (8 genes) and ‘Isoquinoline alkaloid biosynthesis’ (7 genes). Some of the genes related to the TIA pathways were represented in KEGG analysis which included ‘terpenoid backbone biosynthesis’ (29 genes), ‘diterpenoid biosynthesis’ (11 genes), ‘monoterpenoid indole alkaloid’ (4 genes), ‘Sesquiterpenoid and

Plant Cell Rep

1400

1200

No.of Transcripts

Fig. 3 Clusters of orthologous groups (COG) classification. In total, 7687 putative proteins were assigned to 101 orthologous groups which were distributed functionally into at least 23 molecular families

1000

800

600

400

200

0

[A] RNA processing and modification [B] Chromatin structure and dynamics [C] Energy production and conversion [D] Cell cycle control, cell division, chromosome partitioning [E] Amino acid transport and metabolism [F] Nucleotide transport and metabolism [G] Carbohydrate transport and metabolism [H] Coenzyme transport and metabolism [Iaa] Lipid transport and metabolism [J] Translation, ribosomal structure and biogenesis [K] Transcription [L] Replication, recombination and repair [M] Cell wall/membrane/envelope biogenesis [N] Cell motility [O] Posttranslational modification, protein turnover, chaperones [P] Inorganic ion transport and metabolism [Q] Secondary metabolites biosynthesis, transport and catabolism [R] General function prediction only [S] Function unknown [T] Signal transduction mechanisms [U] Intracellular trafficking, secretion, and vesicular transport A B C D E F G H I J K L M N O P Q R S T U V Z [V] Defense mechanisms [W] Extracellular structures Functional Class [Y] Nuclear structure [Z] Cytoskeleton

Table 3 KEGG pathway analysis: top 20 active biological pathways obtained in C. roseus transcriptome among a total of 289 KEGGpredicted pathways Pathway

No. of hits

Ribosome

116

Spliceosome

101

RNA transport

90

Purine metabolism

82

Oxidative phosphorylation

78

Protein processing in endoplasmic reticulum

73

Pyrimidine metabolism

71

Ribosome biogenesis in eukaryotes

56

Ubiquitin mediated proteolysis

54

mRNA surveillance pathway

48

RNA degradation

48

Cell cycle Cell cycle-yeast

48 45

Photosynthesis

44

Plant hormone signal transduction

40

Amino sugar and nucleotide sugar metabolism

37

Nucleotide excision repair

37

Arginine and proline metabolism

33

Glycine, serine and threonine metabolism

30

Glycolysis/gluconeogenesis

29

triterpenoid biosynthesis’ (5 genes) and ‘indole alkaloid biosynthesis’ (2 genes). Finally, some genes were also distributed into the plant hormone biosynthesis pathways including ‘zeatin biosynthesis’ (6 genes) and ‘brassinosteroid biosynthesis’ (8 genes). Transcription factors (TFs) affect metabolic flux by regulating gene expression of related encoding enzymes

and their identification provides information which would be helpful in manipulating metabolic pathways in plants. Here, BLASTX with threshold E value of B1E-05 was performed to search against the known Plant Transcription Factor Database (plntfdb.bio-uni-potsdam.de/v3.0/downloads.php). Out of 26,581 transcripts, 5221 transcripts were identified to be TFs that belonged to 81 known TF families representing 19.64 % of C. roseus transcripts (Fig. 4; Table 2). In the most abundant families 381, 378, 302, 285, 267, 241, 178, 154, 142 and 101 unigenes were annotated to C3H, FRI, bHLH, NAC, MADS, MYB-related, C2H, AP2/EREBP, WRKY and MYB families, respectively. Quantification of C. roseus transcripts The digital expression profiling, also called RNA-Seq, is a powerful and efficient approach for gene expression analysis. To determine the level of expression of C. roseus genes, the coverage of each transcript was determined as a function of reads per kilobase per million. The mapping of all the reads onto the assembled, non-redundant set of C. roseus transcripts revealed that the number of reads corresponding to each transcript ranged from 1 (0.036 rpkm) to 603,839 (22,012.19 rpkm) with an average of 1,032 reads (37.62 rpkm) per transcript indicating a very wide range of expression levels of C. roseus transcripts. It also indicated that even the very lowly expressed C. roseus transcripts were also represented in our assembly. The minimum coverage (rpkm) of a C. roseus transcript was 0.22 and maximum was 4,809.13 with an average of 25.41. Expression analysis revealed that the highly expressing genes in C. roseus were those involved in photosynthesis such as Rubisco and photosystem I reaction center subunit.

123

Plant Cell Rep 400

350

No. of transcripts

300

250

200

150

100

50

VOZ ULT SOH1 Rcd1-like MED7 IWS1 MED6 SRS RB NOZZLE MBF1 S1Fa-like C2C2-YABBY Coactivator p15 BBR/BPC PLATZ PBF-2-like HRT ARR-B LIM GeBP TUB Pseudo ARR-B EIL Alfin-like E2F-DP CPP Sigma70-like CSD zf-HD TAZ DDT C2C2-CO-like TIG CAMTA SWI/SNF-BAF60b GRF OFP ARID SWI/SNF-SWI3 BSD ARF Jumonji BES1 HSF TCP HMG LUG C2C2-GATA Trihelix C2C2-Dof DBP LOB CCAAT ABI3VP1 G2-like SBP GRAS Tify GNAT MYB AUX/IAA mTERF bZIP RWP-RK TRAF HB FHA WRKY AP2-EREBP SET Orphans C2H2 SNF2 PHD MYB-related MADS NAC bHLH FAR1 C3H

0

Fig. 4 Putative transcription factors encoding unigenes in the C. roseus transcriptome

Other abundantly present transcripts included genes encoding cytochrome P450, actin, water channel protein (aquaporin) and metallothionein in addition to those encoding housekeeping enzymes (glyceraldehyde 3-phosphate dehydrogenase and fructose-bisphosphate aldolase). On the other hand, genes encoding serine threonine kinase, zinc finger protein, hydrolase family protein/HAD-superfamily protein isoform 1 and cytokinin dehydrogenase were amongst the lowly expressing category of genes. SSR discovery and their location within genic regions The C. roseus non-redundant transcripts (26,581) were mined for the presence of microsatellite motifs using MIcroSAtellite (MISA) tool (http://pgrc.ipk-gatersleben. de/misa). A total of 11,004 SSRs were identified in 7,076 (26.62 %) transcripts of C. roseus (Table 4) with the frequency of one SSR per 2.4 kb. The mononucleotides were not considered for analysis. As expected in the genomic regions, the trinucleotide SSRs represented the largest fraction (52.92 %) followed by dinucleotides (20.63 %) (Fig. 5a). However, among the dinucleotides the most abundant repeat class was AG/CT which accounted for 30.09 % of the repeats and among the trinucleotides, AAG/ CTT (66.67 %) was found to be most abundant (Fig. 5b). The sequences flanking the SSRs were used to design primer pairs and total 2,520 SSR primers were designed. The details of all primers designed are available in Table S3. A set of 48 primer pairs were validated in the parental C. roseus genotype Nirmal. Of the 48 primers, 47 (97.9 %)

123

amplified fragments of the expected sizes. Thirty-six (76.59 %) out of these were used to PCR amplify DNA from four Catharanthus species and eight C. roseus accessions as described in methods. All markers exhibited polymorphism between C. roseus and related species. However, within the eight C. roseus accessions only 22 were found to be polymorphic. Further, we went one step ahead to predict the position of SSRs within the specific genic fractions (50 UTRs, 30 UTRs and coding region) of the 15,426 annotated C. roseus transcripts. Since the complete genome sequence of C. roseus was unavailable to use as reference, therefore it was necessary to first define the positions of the various genic fractions of the C. roseus transcripts. To do this, the ESTScan (Lottaz et al. 2003; Iseli et al. 1999) software was used which can detect and extract coding regions from low quality sequences with high selectivity and sensitivity, and is able to accurately correct frame-shift errors. It is generally used as a tool for quality control and for the assembly of contigs representing the coding regions of genes and gene prediction. Here, 15,426 annotated C. roseus transcripts were subjected to ESTScan which defined the boundaries of coding regions. Next, the regions upstream of the translation start sites (50 -UTRs) and downstream of the translation stop sites (30 -UTRs) were defined using an in-house perl script. Regions of UTRs C 20 bp from the translation start and stop site were identified to be the 50 and 30 UTRs, respectively (Fig. 6). Hence the exact location of SSRs within the genic fractions (50 -UTRs, 30 -UTRs and coding) of C. roseus genes was

Plant Cell Rep Table 4 Statistics of SSRs identified in C. roseus transcripts SSRs mining

50 -UTR

Total

CDS

30 -UTR

Total number of sequences examined

26,581

9,696

15,426

12,833

Total size of examined sequences (bp)

26,495,402

1995470

2098528

3,036,628

Number of SSR containing sequences

7,076 (27.37 %)

2,103 (27.75 %)

3,475 (31.60 %)

1,808 (17.04 %)

Total number of identified SSRs

11,004

2,691

4,867

2,188

Di

2,270

1,061

56

751

Tri

5,823

886

3,898

604

Tetra

1,730

482

372

581

Penta

588

180

146

180

Hexa

593

82

395

72

Distribution of SSRs in different repeat types

90

b Total CDS 5'-UTR 3'-UTR

80 70 60 50 40 30 20 10 0

Frequency of SSRs (%)

Frequency of SSRs (%)

a

90

Total

80

CDS

70

5'-UTR

60

3'-UTR

50 40 30 20 10 0

Di

Tri

Tetra

Penta

Hexa

Fig. 5 Frequency distribution of SSRs based on motif types in different genic regions of C. roseus transcriptome. a The frequency of most abundant repeat motifs and b frequency of most abundant types

among the di and trinucleotide SSR motifs. AAG/CTT among the trinucleotide SSRs while AG/CT among the dinucleotide SSRs was found to be preferentially located in the 50 -UTRs regions

accurately predicted, and 4,867, 2,691 and 2,188 SSRs were identified in the coding, 50 -UTR and 30 -UTR regions, respectively, of transcripts (Fig. 6). Within the coding regions, as expected, the trinucleotide SSRs represented the largest fraction (75.32 %) followed by dinucleotides (5.44 %) (Fig. 5a), and among them the AAG/CTT (24.29 %) and AG/CT (65.69 %) type of SSRs were found to be the most abundant, respectively (Fig. 5b). However, in the regulatory regions such as the 50 -UTR, the dinucleotides (36.05 %) and the trinucleotides (36.51 %) (Fig. 5a) were found to be almost equally abundant. Moreover, a similar pattern was observed in 30 UTRs wherein dinucleotide repeat motifs accounted for 34.32 % followed by trinucleotides 27.60 % (Fig. 5a). Furthermore, the nucleotide composition of SSR motifs was different in the 50 -UTR and 30 -UTR. The AG/CT type dinucleotide repeats, accounting for 84.26 % of the repeats, were found to be more frequent than other repeats in the 50 UTRs and the AAG/CTT type of trinucleotide repeats accounted for 54.18 % (Fig. 5b). On the contrary, within

the 30 -UTRs the AT/TA type of dinucleotide repeats was most abundant accounting for 59.25 % of the repeats and among trinucleotide repeats, AAT/ATT accounted for 27.15 % (Fig. 5b). Functional annotation by GO of genes harboring SSRs in their 50 -UTRs and/or 30 -UTRs revealed that maximum of these were involved in cellular and metabolic processes under the biological process category whereas in molecular function most of them were involved in catalytic and binding activity (Fig. S1). Further, their COG annotation revealed that a majority of them was involved in ‘general function prediction’, ‘carbohydrate transport and metabolism’, ‘signal transduction mechanism’, ‘post translational modification, protein turnover, chaperones’(Fig. S2). Further, the sequences harboring SSRs within 50 -UTRs and 30 UTRs were subjected to a search for transcription factors which showed the maximum number of unigenes to belong to C3H members of TF family followed by MADS, bHLH and AP2-EREBP (Fig. S3). These findings suggest that SSRs were present in the regulatory regions of a wide

123

Plant Cell Rep 15426 annotated C. roseus unigenes ESTScan

Translation start site (TSS)

5’-UTR:

Translation stop site (TSS) Perl script

20nt

upstream to TSS

3’-UTR: CDS

20nt

downstream to TSS

Statistics 5’-UTRs

CDS

3’-UTRs

9,696

15,426

12,833

216 (Avg.length)

1,293 (Avg.length)

238 (Avg.length)

MISA

MISA

MISA

5’-UTR-SSRs

CDS-SSRs

3’-UTR-SSRs

2,691

4,867

2,188

Fig. 6 Flow chart of de novo identification of UTRs (50 - and 30 -) and coding regions from the C. roseus transcripts followed by SSRs analysis in these genic fractions

range of genes including housekeeping and TF genes and may therefore be involved in their regulation at various transcriptional and translational levels.

Discussion Transcriptome is the complete repertoire of expressed RNA transcripts in the cell and its characterization is essential for understanding the functional complexity of the genome. Complete information about transcriptomes can now easily be generated via high-throughput mRNA sequencing technology which is especially suitable for gene expression profiling in non-model organisms that lack genomic sequence data. To the best of our knowledge, genomes of C. roseus and other medicinal plants have not been sequenced. However, transcriptome/metabolome dataset for some plants such as Atropa Belladonna and Digitalis purpurea (http://metnetdb.org/mpmr_public) and few others is available. Moreover, limited genetic and genomic information is available in C. roseus (Shukla et al. 2006; Shokeen et al. 2005, 2007, 2011, Mishra et al. 2011; Moerkercke et al. 2013). Hence, we used RNA-Seq technology to profile the 26-day-old seedling transcriptome, obtained 29,964,104 clean sequencing reads and de novo assembled them to obtain a total of 26,581 unigenes. Transcriptome assemblies can also be utilized to mine for SSRs which serve as important genomic resources for

123

molecular breeding and crop improvement. Their isolation earlier presented many challenges (Zane et al. 2002; Squirrell et al. 2003) but are now greatly facilitated by insilico mining of transcripts. Hence in this study, not only were the 26,581 unigenes annotated to determine their functional significance but were also utilized to develop an SSR resource of 2,520 markers. Further, analysis of the genic location of the SSRs helped in predicting the putative role of 50 -UTR associated SSRs in gene regulation––a newly developing concept generating interest and necessitating investigations (Lue et al. 1989; Epplen et al. 1996; Sangwan and O’Brian 2002; Sharopova 2008). Recently, many plant transcriptomes have been de novo assembled due to the improvement in assembly programs that can effectively assemble relatively short reads coupled with the great advantage of paired-end sequencing. The short read sequence data generated for whole genomes or transcriptomes have been assembled de novo for plant species such as lentil (Verma et al. 2013), safflower (Lulin et al. 2012), rubber tree (Li et al. 2012), sweet potato (Xie et al. 2012), carrot (Iorizzo et al. 2011), maize (Li et al. 2011), soybean (Libault et al. 2010) etc. As no generally accepted protocol for evaluating such an assembly exists, it is therefore a great challenge to make a reliable judgment or carryout the validation of an optimal assembly without the reference genomic sequence. In this scenario, ESTs and genomic survey sequences deposited in the NCBI database can serve as a reference. Fortunately in the case of C. roseus, 19,910 cDNA sequences were available in GenBank (as of Sep. 2012) of which 94 % could be matched with 71 % of the assembled unigenes through nucleotide BLAST, whereas 29 % unigenes could not find matches and represented an additional resource of new genes. These results clearly reflected a valid assembly having a wide coverage of the unigenes in which the major genes had been sequenced and assembled correctly. Further, assembly validation may also be done by comparison with other known protein databases. Hence out of 26,581 transcripts, 15,426 (58.03 %) were successfully aligned with known proteins in the Nr database suggesting their relatively conserved functions. Functional annotation and classification provide predicted information of innercell metabolic pathways and the biological behaviors of genes. Different classification systems may provide distinct insights into gene function. GO is an international standardized gene functional classification system which offers a dynamic-updated controlled vocabulary and a strictly defined concept to comprehensively describe properties of genes and their products in any organism. GO functional classification may help us to understand the distribution of gene functions at the macro level and to predict the physiological role of each unigene. COG is a database in which orthologous gene products are classified. Every protein in

Plant Cell Rep

COG is assumed to be evolved from an ancestor protein, and the whole database is built on coding proteins from complete genomes as well as system evolutionary relationships of bacteria, algae and eukaryotes. GO and COG classifications of the C. roseus dataset revealed that the assembled unigenes had diverse molecular functions and were involved in many metabolic pathways indicating the diversity of the assembled unigenes while reflecting the global landscape of the transcriptome. C. roseus is a multipurpose plant but it is widely known for its pharmaceutically important alkaloids as it produces an unsurpassed spectrum of chemo diversity of more than 130 terpenoid indole alkaloids (TIAs) including the anticancer alkaloids (vincristine and vinblastine) and antihypertensive alkaloids (ajmalicine and serpentine). In terms of medicinal and pharmaceutical use, the properties of C. roseus largely depend on its alkaloid profile and hence, gaining new insights into the biosynthesis and transcriptional regulation of alkaloids in C. roseus should accelerate the engineering of this pathway in the future. Although most of genes of TIA pathway are known to be expressed under stress (Aerts et al. 1994; Roepke et al. 2010), however, some of the genes related to the TIA pathway including terpenoid backbone, indole alkaloid biosynthesis, diterpenoid biosynthesis and monoterpenoid indole alkaloid were able to find a place in the KEGG classification. In addition, genes involved in the biosynthesis of secondary metabolites also found place in the KEGG pathway analysis including flavonoid biosynthesis and carotenoid biosynthesis. Further, some of the important transcription factor families such as the AP2/EREBP, bHLH, MYB and MYBrelated were also detected in this transcriptome. These TF families regulate secondary metabolism in plants and also play an important role in the control of indole alkaloid and tryptophan biosynthesis (Zhu et al. 2003; Zhang et al. 2011). Some of the important TFs that were identified in our study included the octadecanoid-derivative responsive Catharanthus AP2-domain protein 3 (ORCA3), which activates the expression of several genes that encode enzymes involved in indole alkaloid biosynthesis and MEP pathway, e.g., ASI, TDC, and 1-deoxyxylulose 5-phosphate synthase (DXS). Others such as the altered tryptophan regulation 1 (ATR1)––a MYB factor––and altered tryptophan regulation 2 (ATR2)––a bHLH factor––activate genes that function in tryptophan biosynthesis and metabolism in Arabidopsis (Zhu et al. 2003; Zhang et al. 2011) were also reported. The plant specific WRKY TF family is known to be involved in alkaloid biosynthesis (Kato et al. 2007). Recently, CrWRKY I was characterized and reported to be involved in the transcriptional regulation of the C. roseus TIA pathway (Suttipanta et al. 2011) and was identified in our dataset too.

Transcriptomes not only aid in the discovery of novel genes but also serve as reservoirs for mining SSRs, whose discovery earlier through cloning and sequencing presented various challenges (Zane et al. 2002; Squirrell et al. 2003). SSRs occur ubiquitously in eukaryotic genomes and because of their high variability and wide distribution they have been used as genetic markers (Dib et al. 1996; Somers et al. 2004; Temnykh et al. 2001) which not only help in molecular mapping but also provide an opportunity for gene discovery when they show linkage with a trait of interest (Thiel et al. 2003). SSR markers designed from coding regions (ESTs/transcriptomes) are more conserved in comparison to markers that are generated from genomic sequences and therefore show more transferability between species (Portis et al. 2007; Varshney et al. 2005b). ESTSSRs have been extensively used for analysis of genetic diversity, development of molecular maps, QTL analysis and cross-species transferability in plant species like rice (Kantety et al. 2002), bread wheat (Gupta et al. 2003), Capsicum (Minamiyama et al. 2006; Portis et al. 2007), sugarcane (Cordeiro et al. 2001), cotton (Park et al. 2005) etc. In this study of the C. roseus transcriptome, one SSR per 2.4 kb was identified––a frequency similar to that reported earlier in plants (Feng et al. 2009). The most abundant repeat motifs found in the present study were trinucleotides which constituted 53 % of the total SSRs which was in agreement with previous studies (Varshney et al. 2005a; Xie et al. 2012). Among trinucleotides, AAG/ CTT (*30 %) was found to be the most frequent motif as has also been earlier reported in other plants (Li et al. 2004; Wang et al. 2012). Alternatively, mining of SSRs in several other transcriptomes has alternatively reported higher occurrence of dinucleotide repeats (Feng et al. 2009; Wang et al. 2010; Triwitayakorn et al. 2011). The difference in the frequencies of EST-SSRs could be attributed to the ‘‘search criteria’’ used, type of SSR motif, size of sequence data analysis and the mining tool used (Portis et al. 2007; Gupta and Prasad 2009; Toth et al. 2000). In this study, 2,520 (23 %) genic SSR markers were designed using stringent selection criteria. Only a low (23 %) number of primer pairs could be designed since primer designing is governed by several factors (Zane et al. 2002; Squirrell et al. 2003), the foremost being the stringent selection criteria used and the database mining tools (see materials and methods). Moreover a number of primers were also limited probably due to the fact that (1) the primer pairs could not be designed from the sequences which had SSR motifs at the terminal ends and (2) only one SSR per transcript was chosen for marker development. In spite of that, PCR validation of a set of 48 primers revealed 97.9 % successful amplification and 76.6 % of them showed polymorphism across different Catharanthus species as well as accessions of C. roseus. The polymorphism rate

123

Plant Cell Rep

obtained in this study was comparable with the study of Mishra et al. (2011). This constitutes a major genomic resource which will enrich the existing marker repertoire for utilization in C. roseus breeding in future. Numerous lines of evidence have demonstrated that 3–7 % of expressed genes contain SSR motifs mainly within the untranslated regions (UTRs) of the mRNA (Thiel et al. 2003). Genomic distribution of SSRs is nonrandom, presumably as they have roles in chromatin organization, regulation of gene activity, recombination, DNA replication, cell cycle, mismatch repair (MMR) system, etc. (Li et al. 2002). Moreover, SSRs located within different genic regions may have different putative functions, for example, SSR variations in 50 UTRs could regulate gene expression by affecting transcription and translation; SSR expansions in the 30 -UTRs cause transcription slippage and produce expanded mRNA; intronic SSRs can affect gene transcription, mRNA splicing, or export to cytoplasm (Li et al. 2004). Therefore, it was interesting to analyze the distribution of SSRs within various genic regions of C. roseus transcripts. For this, the ESTScan in combination with in-house perl script was utilized for defining boundaries of UTRs and coding regions, respectively, since no reference genome was available. This analysis revealed that the distribution pattern of SSR motifs within the genic fractions was nonrandom in which the majority of dinucleotide repeats especially AG/CT and trinucleotide repeats especially AAG/CTT was preferentially associated with the 50 -UTRs. However, microsatellite motifs such as the AT/TA dinucleotide repeats and ATT/AAT trinucleotide repeats were found to be associated with the 30 -UTRs. This has also been observed in other studies which revealed that motifs such as AG/CT and AAG/CTT were abundant in the transcribed regions of Arabidopsis, soybean, Brassica, rice and Medicago and were preferentially located within the 50 -UTRs (Morgante et al. 2002; Fujimori et al. 2003; Zhang et al. 2004, 2006; Lawson and Zhang 2006). Microsatellites such as AG and AAG present in 50 -UTRs have been shown to have significant roles in the regulation of gene expression (Lue et al. 1989; Epplen et al. 1996; Martienssen and Colot 2001; Hulzink et al. 2002; Sangwan and O’Brian 2002; Sharopova 2008). Moreover, it has also been shown through computational prediction and gene expression analysis that the AG/CT and AAG/CTT repeats act as regulatory elements involved in light and hormonal response including salicylic acid (Zhang et al. 2006). In our study, the analysis of AG and AAG repeats present in the 50 -UTR reiterated these observations since they were found to be preferentially associated with a wide range of regulatory genes including TFs and housekeeping genes. This analysis may prove to be particularly useful for further experiments elucidating regulatory networks in plants.

123

In conclusion, we have determined the transcriptome of C. roseus through the use of high-throughput Illumina paired-end sequencing. Our study resulted in de novo assembly of 26,581 unigenes and demonstrated some important features of the C. roseus transcriptome such as gene annotation, KEGG pathway analysis, quantification of transcripts and identification of important gene families such as TFs. It also served to generate a large genomic resource of SSR markers, predicted locations of which, especially in the 50 UTRs of important regulatory genes, will open up new avenues to explore regulatory pathways not only in members of family Apocynaceae but also in many other medicinally important plants. Acknowledgments We acknowledge the National Institute of Plant Genome Research (NIPGR), New Delhi, India for the funding support and Council of Scientific and Industrial Research (CSIR), India for fellowship grant to SK.

References Aerts RJ, Gisi D, De Carolis E, De Luca V, Baumann TW (1994) Methyl jasmonate vapor increases the developmentally controlled synthesis of alkaloids in Catharanthus and Cinchona seedlings. Plant J 5:635–643 Bayer A, Ma X, Stockigt J (2004) Acetyltransfer in natural product biosynthesis—functional cloning and molecular analysis of vinorine synthase. Bioorg Med Chem 12:2787–2795 Blasko G, Cordell GA (1990) Isolation, structure elucidation, and biosynthesis of the bisindole alkaloids of Catharanthus. In: Brossi A, Suffness M (eds) The alkaloids, vol 37. San Diego, CA Brenner S, Johnson M, Bridgham J, Golda G, Lloyd DH, Johnson D, Luo S, McCurdy S, Foy M, Ewan M, Roth R, George D, Eletr S, Albrecht G, Vermaas E, Williams SR, Moon K, Burcham T, Pallas M, DuBridge RB, Kirchner J, Fearon K, Mao J, Corcoran K (2000) Gene expression analysis by massively parallel signature sequencing (MPSS) on microbead arrays. Nat Biotechnol 18:630–634 Canel C, Lopes-Cardoso I, Whitmer S, van der Fits L, Pasquali G, van der Heijden R, Hoge JHC, Verpoorte R (1998) Effects of over expression of strictosidine synthase and tryptophan decarboxylase on alkaloid production by cell cultures of Catharanthus roseus. Planta 205:414–419 Carels N, Bernardi G (2000) Two classes of genes in plants. Genetics 154:1819–1825 Caspi R, Altman T, Dale JM, Dreher K, Fulcher CA, Gilham F et al (2010) The MetaCyc database of metabolic pathways and enzymes and the BioCyc collection of pathway/genome databases. Nucleic Acids Res 38:D473–D479 Caspi R, Altman T, Dreher K, Fulcher CA, Subhraveti P, Keseler IM et al (2012) The MetaCyc database of metabolic pathways and enzymes and the BioCyc collection of pathway/genome databases. Nucleic Acids Res 40:D742–D753 Chaudhary S, Sharma V, Prasad M, Bhatia S, Tripathi BN, Yadav G, Kumar S (2011) Characterization and genetic linkage mapping of the horticulturally important mutation leafless inflorescence (lli) in periwinkle Catharanthus roseus. Scientia Hortic 129:142–153 Cloonan N, Forrest AR, Kolle G, Gardiner BB, Faulkner GJ, Brown MK, Taylor DF, Steptoe AL, Wani S, Bethel G, Robertson AJ,

Plant Cell Rep Perkins AC, Bruce SJ, Lee CC, Ranade SS, Peckham HE, Manning JM, McKernan KJ, Grimmond SM (2008) Stem cell transcriptome profiling via massive–scale mRNA sequencing. Nat Methods 5:613–619 Cordeiro GM, Casu R, McIntyre CL, Manners JM, Henry RJ (2001) Microsatellite markers from sugarcane (Saccharum spp.) ESTs cross transferable to Erianthus and Sorghum. Plant Sci 160: 1115–1123 Dib C, Faure S, Fizames C, Samson D, Drouot N, Vignal A et al (1996) A comprehensive genetic map of the human genome based on 5,264 microsatellites. Nature 380:152–154 Epplen JT, Kyas A, Maueler W (1996) Genomic simple repetitive DNAs are targets for differential binding of nuclear proteins. FEBS Lett 389:92–95 Feng SP, Li WG, Huang HS, Wang JY, Wu YT (2009) Development, characterization and cross-species/genera transferability of ESTSSR markers for rubber tree (Hevea brasiliensis). Mol Breed 23:85–97 Fujimori S, Washio T, Higo K, Ohtomo Y, Murakami K, Matsubara K, Kawai J, Carninci P, Hayashizaki Y, Kikuchi S, Tomita M (2003) A novel feature of microsatellites in plants: a distribution gradient along the direction of transcription. FEBS Lett 554:17–22 Gupta S, Prasad M (2009) Development and characterization of genic SSR markers in Medicago truncatula and their transferability in leguminous and non-leguminous species. Genome 52:761–771 Gupta PK, Rustgi S, Sharma S, Singh R, Kumar N, Balyan HS (2003) Transferable EST-SSR markers for the study of polymorphism and genetic diversity in bread wheat. Mol Genet Genome 270:315–323 Gur-Arie R, Cohen CJ, Eitan Y, Shelef L, Hallerman EM, Kashi Y (2000) Simple sequence repeats in Escherichia coli: abundance, distribution, composition, and polymorphism. Genome Res 10:62–71 Haas BJ, Zody MC (2010) Advancing RNA—Seq analysis. Nat Biotechnol 28:421–423 Hancock JM (1999) Microsatellites and other simple sequences: genomic context and mutational mechanisms. In: Goldstein DB, Schlo¨tterer C (eds) Microsatellites: evolution and applications. Oxford University Press, New York, pp 1–9 Hulzink RJ, de Groot PF, Croes AF, Quaedvlieg W, Twell D, Wullems GJ, Van Herpen MM (2002) The 50 -untranslated region of the ntp303 gene strongly enhances translation during pollen tube growth, but not during pollen maturation. Plant Physiol 129:342–353 Iorizzo M, Senalik DA, Grzebelus D, Bowman M, Cavagnaro PF, Matvienko M, Ashrafi H, Deynze AV, Simon PW (2011) De novo assembly and characterization of the carrot transcriptome reveals novel genes, new markers, and genetic diversity. BMC Genom 12:389 Iseli C, Jongeneel CV, Bucher P (1999) ESTScan: a program for detecting, evaluating, and reconstructing potential coding regions in EST sequences. Proc Int Conf Intell Syst Mol Biol, pp 138–148 Kantety RV, La Rota M, Matthews DE, Sorrells ME (2002) Data mining for simple sequence repeats in expressed sequence tags from barley, maize, rice, sorghum and wheat. Plant Mol Biol 48:501–510 Kato N, Dubouzet E, Kokabu Y, Yoshida S, Taniguchi Y, Dubouzet JG, Yazaki K, Sato F (2007) Identification of a WRKY protein as a transcriptional regulator of benzylisoquinoline alkaloid biosynthesis in Coptis japonica. Plant Cell Physiol 48:8–18 Kulkarni RN, Baskaran K, Chandrashekara RS, Kumar S (1999) Inheritance of morphological traits of periwinkle mutants with modified contents and yields of leaf and root alkaloids. Plant Breed 118:71–74

Lawson MJ, Zhang L (2006) Distinct pattern of SSR distribution in the Arabidopsis thaliana and rice genomes. Genome Biol 7:R14 Li YC, Korol AB, Fahima T, Beiles A, Nevo E (2002) Microsatellites: genomic distribution, putative functions and mutational mechanisms: a review. Mol Ecol 11:2453–2465 Li YC, Korol AB, Fahima T, Nevo E (2004) Microsatellites within genes: structure, function, and evolution. Mol Biol Evol 6:991–1007 Li YJ, Fu YR, Huang JG, Wu CA, Zheng CC (2011) Transcript profiling during the early development of the maize brace root via Solexa sequencing. FEBS J 278:156–166 Li D, Deng V, Qin B, Liu X, Men Z (2012) De novo assembly and characterization of bark transcriptome using Illumina sequencing and development of EST-SSR markers in rubber tree (Hevea brasiliensis Muell. Arg.). BMC Genom 13:192 Libault M, Farmer A, Joshi T, Takahashi K, Langley R, Franklin LD, He J, Xu D, May G, Stacey G (2010) An integrated transcriptome atlas of the crop model Glycine max, and its use in comparative analyses in plants. Plant J 63:86–99 Lottaz C, Iseli C, Jongeneel CV, Bucher P (2003) Modeling sequencing errors by combining Hidden Markov models. Bioinformatics 19:103–112 Lue NF, Buchman AR, Kornberg RD (1989) Activation of yeast RNA polymerase II transcription by a thymidine-rich upstream element in vitro. Proc Natl Acad Sci USA 86:486–490 Lulin H, Xiao Y, Pei S, Wen T, Shangqin H (2012) The First Illumina-Based De novo transcriptome sequencing and analysis of safflower flowers. PLoS ONE 7:e38653 Martienssen RA, Colot V (2001) DNA methylation and epigenetic inheritance in plants and filamentous fungi. Science 293:1070–1074 Metzgar M, Bytof J, Wills C (2000) Selection against frameshift mutations limits microsatellite expansion in coding DNA. Genome Res 10:72–80 Minamiyama Y, Tsuro M, Hirai M (2006) An SSR-based linkage map of Capsicum annuum. Mol Breed 18:157–169 Mishra RK, Gangadhar BH, Yu JW, Kim DH, Park SW (2011) Development and characterization of EST based SSR markers in Madagascar periwinkle (Catharanthus roseus) and their transferability in other medicinal plants. Plant Omics J 4:154–162 Moerkercke AV, Fabris M, Pollier J, Baart GJE, Rombauts GH, Rischer H, Oksman-Caldentey K-M, Goossens A (2013) CathaCyc, a metabolic pathway database built from Catharanthus roseus RNA-Seq Data. Plant Cell Physiol. doi:10.1093/pcp/ pct039 Morgante M, Hanafey M, Powell W (2002) Microsatellites are preferentially associated with nonrepetitive DNA in plant genomes. Nat Genet 30:194–200 Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B (2008) Mapping and quantifying mammalian transcriptomes by RNASeq. Nat Methods 5:621–628 Murata J, Bienzle D, Brandle JE, Sensen CW, De Luca V (2006) Expressed sequence tags from Madagascar periwinkle (Catharanthus roseus). FEBS Lett 580:4501–4507 Parida SK, Kumar ARK, Dalal V, Singh NK, Mohapatra T (2006) Unigene derived microsatellite markers for the cereal genomes. Theor Appl Genet 112:808–817 Park YH, Alabady MS, Ulloa M, Sickler B, Wilkins TA, Yu J, Stelly DM, Kohel RJ, El-Shihy OM, Cantrell RG (2005) Genetic mapping of new cotton fiber loci using EST derived microsatellites in an interspecific recombinant inbred line cotton population. Mol Genet Genome 274:428–441 Patel RK, Jain M (2012) NGS QC Toolkit: a toolkit for quality control of next generation sequencing data. PLoS ONE 7:e30619 Portis E, Nagy I, Sasva Z, Stagelri A, Barchi L, Lanteri S (2007) The design of Capsicum spp. SSR assays via analysis of In silico

123

Plant Cell Rep DNA sequence, and their potential utility for genetic mapping. Plant Sci 172:640–648 Richards RI, Sutherland GR (1992) Dynamic mutations: a new class of mutations causing human disease. Cell 70:709–712 Roepke J, Salim V, Wu M, Thamm AM, Murata J, Ploss K, Boland W, De Luca V (2010) Vinca drug components accumulate exclusively in leaf exudates of Madagascar periwinkle. Proc Natl Acad Sci USA 107:15287–15292 Sambrook J, Fritsch EF, Maniatis Y (1989) Molecular cloning: a laboratory manual, 2nd edn. Cold Spring Harbor Laboratory Press, Cold Spring Harbor Sangwan I, O’Brian MR (2002) Identification of a soybean protein that interacts with GAGA element dinucleotide repeat DNA. Plant Physiol 129:1788–1794 Schlotterer C, Tautz D (1992) Slippage synthesis of simple sequence DNA. Nucleic Acids Res 20:211–215 Sevon N, Oksman-Caldentey K-M (2002) Agrobacterium rhizogenesmediated transformation: root cultures as a source of alkaloids. Planta Med 68:859–868 Sharma V, Chaudhary S, Srivastava S, Pandey R, Kumar S (2012) Characterization of variation and quantitative trait loci related to terpenoid indole alkaloid yield in a recombinant inbred line mapping population of Catharanthus roseus. J Genet 91:49–69 Sharopova N (2008) Plant simple sequence repeats: distribution, variation, and effects on gene expression. Genome 51:79–90 Shokeen B, Sethy NK, Kumar S, Bhatia S (2007) Isolation and characterization of microsatellite markers for analysis of molecular variation in the medicinal plant Madagascar periwinkle (Catharanthus roseus (L.) G. Don). Plant Sci 172:441–451 Shokeen B, Choudhary S, Sethy NK, Bhatia S (2011) Development of SSR and gene targeted markers for construction of a framework linkage map of Catharanthus roseus. Ann Bot 108:321–336 Shookeen B, Sethy NK, Choudhary S, Bhatia S (2005) Development of STMS markers from the medicinal plant Madagaster periwinkle [Catharanthus roseus (L.) G. Don.]. Mol Ecol Notes 5:818–820 Shukla AK, Shasany AK, Gupta MM, Khanuja SPS (2006) Transcriptome analysis in Catharanthus roseus leaves and roots for comparative terpenoid indole alkaloid profiles. J Exp Bot 57:3921–3932 Somers DJ, Isaac P, Edwards K (2004) A high–density microsatellites consensus map for bread wheat (Tritcum astivum L.). Theor Appl Gent 109:1105–1114 Squirrell J, Hollingsworth PM, Woodhead M, Russell J, Lowe AJ, Gibby M, Powell W (2003) How much effort is required to isolate nuclear microsatellites from plants? Mol Ecol 12:1339–1348 Suttipanta N, Pattanaik S, Kulshrestha M, Patra B, Singh SK, Yuan L (2011) The transcription factor CrWRKY1 positively regulates the terpenoid indole alkaloids biosynthesis in Catharanthus roseus. Plant Physiol 157:2081–2093 Temnykh S, DeClerck G, Lukashova A, Lipovich L, Cartinhour S, McCouch, S (2001) Computational and experimental analysis of microsatellites in rice (Oryza sativa L.): frequency, length variation, transposon associations and genetic marker potential. Genome Res 11:1441–1452 Thiel T, Michalek W, Varshney RK, Graner A (2003) Exploiting EST databases for the development and characterization of gene-

123

derived SSR-markers in barley (Hordeum vulgare L.). Theor Appl Genet 106:411–422 Toth G, Gaspari Z, Jurka J (2000) Microsatellites in different eukaryotic genomes: survey and analysis. Genome Res 10:967–981 Triwitayakorn K, Chatkulkawin P, Kanjanawattanawong S, Sraphet S, Yoocha T et al (2011) Transcriptome sequencing of Hevea brasiliensis for development of microsatellite markers and construction of a genetic linkage map. DNA Res 18:471–482 van der Heijden R, Jacobs DI, Snoeijer W, Hallard D, Verpoorte V (2004) The Catharanthus alkaloids: pharmacognosy and biotechnology. Curr Med Chem 11:607–628 Varshney RK, Graner A, Sorerells ME (2005a) Genic microsatellite markers in plants: features and applications. Trends Biotechnol 23:48–55 Varshney RK, Sigmund R, Boerner A, Korzun V, Stein N et al (2005b) Interspecific transferability and comparative mapping of barley EST-SSR markers in wheat, rye, and rice. Plant Sci 168:195–202 Verma P, Shah N, Bhatia S (2013) Development of an expressed gene catalogue and molecular markers from the de novo assembly of short sequence reads of the lentil (Lens culinaris Medik.) transcriptome. Plant Biotechnol J 11:894–905 Wang XW, Luan JB, Li JM, Bao YY, Zhang CX, Liu SS (2010) De novo characterization of a whitefly transcriptome and analysis of its gene expression during development. BMC Genomics 11:400 Wang S, Wang X, He Q, Liu X, Xu W, Li L, Gao J, Wang F (2012) Transcriptome analysis of the roots at early and late seedling stages using Illumina paired-end sequencing and development of EST-SSR markers in radish. Plant Cell Rep 31:1437–1447 Whitmer S, van der Heijden R, Verpoorte R (2002) Effect of precursor feeding on alkaloid accumulation by a tryptophan decarboxylase over-expressing transgenic cell line T22 of Catharanthus roseus. J Biotechnol 96:193–203 Wilhelm BT, Marguerat S, Goodhead I, Bahler J (2010) Defining transcribed regions using RNA–seq. Nat Protoc 5:255–266 Xie F, Burklew CE, Yang Y, Liu M, Xiao P, Zhang B, Qiu D (2012) De novo sequencing and a comprehensive analysis of purple sweet potato (Ipomoea batatas L.) transcriptome. Planta 236:101–113 Zane L, Bargelloni L, Patarnello T (2002) Strategies for microsatellite isolation: a review. Mol Ecol 11:1–16 Zhang LD, Yuan DJ, Yu SW, Li ZG, Cao YF, Miao ZQ, Qian HM, Tang KX (2004) Preference of simple sequence repeats in coding and non-coding regions of Arabidopsis thaliana. Bioinformatics 20:1081–1086 Zhang L, Zuo K, Zhang F, Cao Y, Wang J, Zhang Y, Sun X, Tang K (2006) Conservation of noncoding microsatellites in plants: implication for gene regulation. BMC Genomics 7:323 Zhang H, Hedhili S, Montiel G, Zhang Y, Chatel G, Pre M, Gantet P, Memelink J (2011) The basic helix–loop–helix transcription factor CrMYC2 controls the jasmonate responsive expression of the ORCA genes that regulate alkaloid biosynthesis in Catharanthus roseus. Plant J 67:61–71 Zhu H, Wang Z, Ma C, Tian J, Fu F et al (2003) Neuroprotective effects of hydroxysafflor yellow A: in vivo and in vitro studies. Planta Med 69:429–433

Large scale in-silico identification and characterization of simple sequence repeats (SSRs) from de novo assembled transcriptome of Catharanthus roseus (L.) G. Don.

Transcriptomic data of C. roseus offering ample sequence resources for providing better insights into gene diversity: large resource of genic SSR mark...
539KB Sizes 0 Downloads 0 Views