Methods 69 (2014) 326–334

Contents lists available at ScienceDirect

Methods journal homepage: www.elsevier.com/locate/ymeth

StemSearch: RNA search tool based on stem identification and indexing Nimrod Milo 1, Sivan Yogev 1, Michal Ziv-Ukelson ⇑ Department of Computer Science, Ben-Gurion University of the Negev, Be’er Sheva, Israel

a r t i c l e

i n f o

Article history: Received 23 March 2014 Revised 11 June 2014 Accepted 15 June 2014 Available online 5 July 2014 Keywords: RNA Search BLAST Index

a b s t r a c t The discovery and functional analysis of noncoding RNA (ncRNA) systems in different organisms motivates the development of tools for aiding ncRNA research. Several tools exist that search for occurrences of a given RNA structural profile in genomic sequences. Yet, there is a need for an ‘‘RNA BLAST’’ tool, i.e., a tool that takes a putative functional RNA sequence as input, and efficiently searches for similar sequences in genomic databases, taking into consideration potential secondary structure features of the input query sequence. This work aims at providing such a tool. Our tool, denoted StemSearch, is based on a structural representation of an RNA sequence by its potential stems. Potential stems in genomic sequences are identified in a preprocessing stage, and indexed. A user-provided query sequence is likewise processed, and stems from the target genomes that are similar to the query stems are retrieved from the index. Then, relevant genomic regions are identified and ranked according to their similarity to the query stem-set while enforcing conservation of cross-stem topology. Experiments using RFAM families show significantly improved recall for StemSearch over BLAST, with small loss of precision. We further demonstrate our system’s capability to handle eukaryotic genomes by successfully searching for members of the 7SK family in chromosome 2 of the human genome. StemSearch is freely available on the web at: http://www.cs.bgu.ac.il/negevcb/StemSearch. Ó 2014 The Authors. Published by Elsevier Inc. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/3.0/).

1. Introduction and motivation An exciting area of biological research in recent years has been the discovery and functional analysis of non-coding RNA (ncRNA) systems in different organisms. First described in plants, ncRNAs are providing new insights into gene regulation in all kingdoms, as recognized by the awarding of the Nobel Prize in Medicine and Physiology to Andrew Fire and Craig Mello in 2006. It is now becoming evident that post-transcriptional regulation of gene expression by ncRNAs is an important control level for protein synthesis, which influences many physiological and pathological processes, including localization, replication, translation, degradation, regulation, and stabilization of biological macromolecules [1,2] and involvement in tumor suppression [3]. Some ncRNAs are not independently transcribed but occur as part of the untranslated regions (henceforth: UTRs) of mRNA. For example, Riboswitches are ncRNA elements that often occur in the 50 UTRs and regulate the transcription of the downstream gene by directly binding to metabolites [4]. It is hypothesized that there ⇑ Corresponding author. 1

E-mail address: [email protected] (M. Ziv-Ukelson). These authors contributed equally to this work.

is in fact an abundance of undiscovered, functional ncRNAs with various catalytic and regulatory functions [5], and the recent discovery of expressed long non-coding RNAs with unknown function in humans [6] supports this hypothesis. Various computational approaches for detecting noncoding genes are under investigation. Within this context, a common problem is that of finding conserved homologues of a given RNA sequence. Existing search tools such as BLAST [7] and BLASTR [8], designed for sequence-homologs search, would often fail to report relevant results for ncRNA search, due to the divergence in RNA primary sequences, as in many cases RNA is more conserved in structure than in sequence. Several tools exist for searching for a given RNA query in the form of a profile or descriptor combining sequence and structure in large genomic sequences [9–13]. However, in practice, often the input query is still a sequence, and the structure is not known in advance. In such cases, one would need to rely on RNA structure prediction tools to first determine the structure of the query sequence, and only then to apply the above structure-based search tools. However, albeit the single RNA strand folding is a success story in the analysis of RNA [14], predicting a global secondary structure from a single sequence is still error-prone, where the best available approaches can correctly predict only up to 73% of the

http://dx.doi.org/10.1016/j.ymeth.2014.06.002 1046-2023/Ó 2014 The Authors. Published by Elsevier Inc. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/3.0/).

N. Milo et al. / Methods 69 (2014) 326–334

base-pairs [15]. Thus, when given a query in the form of an RNA sequence, a ‘‘fold-first then search for a global structure’’ approach cannot be relied upon as an accurate means for homolog detection. Recently, a new tool named LocARNAScan was published [16], marking an important step towards combining structural information in RNA search. The input of LocARNAScan is a probability matrix, generated from either a single RNA sequence or multiple alignment of RNA sequences. This probability matrix is aligned to the genomic target using a scanning variant of LocARNA, a computationally light-weight variant of the Sankoff algorithm [17]. We identify two inherent limitations in this approach. First, there is a scale limitation caused by the need to compute for each query the probability matrix of the genome target, combined with the complexity of applying a (window-bounded) variant of Sankoff’s algorithm on long sequences. In addition, not all structural information can be represented in probability matrix, in particular pseudoknots and alternative conformations. In this paper we describe an inverted-index-based sequencestructure RNA search engine, which, given an RNA sequence as an input query, efficiently searches for similar occurrences in the genome, based on a combination of both sequence and unlimited structural similarity criteria query (representing both pseudoknots and alternative conformations). When indexing and searching genome-wide amounts of data, efficiency must be a main consideration in system design. To this end, this work is inspired by the approach taken by the BLAST [7] algorithm, which allows efficient search of sequence-only DNA queries in large nucleotide databases. BLAST consists of two phases. At first, the query is separated into short DNA tuples (termed ktuples), and occurrences of k-tuples in the reference sequences are retrieved. In the second phase, regions in the database that consist of many occurrences of the query k-tuples are detected and ranked; the top ranked regions are fully aligned to the query, and the best aligned regions are then returned as search results. A crucial aspect of the efficiency of the BLAST runtime is based on the pre-processing phase, which requires the insertion of all DNA ktuples from the reference sequences into a designated data structure that allows fast retrieval and iteration of the query k-tuples. In order to apply the BLAST approach to functional RNA, we need to define a building block of functional RNA, similar to k-tuples in DNA search. There are two major requirements this building block must comply with. First, it must play an important role in all functional RNA molecules. In addition, in order to find regions consisting of many occurrences of the query building blocks, similar functional RNA molecules must share similar building blocks. Given the above requirements, the building-blocks we chose for our search are energetically stable stems. For a stem-containing sub-structure to be active, the stem usually needs to be energetically stable, and this stability should last long enough to be recognized by an interacting molecule or to serve as structural support for another structural RNA unit. Since the number of stems to be evaluated for energetic stability on a genomic scale is huge, selecting the appropriate stability measure should take into consideration the accuracy of the measure as well as the computational effort in computing it. We chose minimum free energy (MFE) as an indicator of stem stability that is relatively accurate for short structures, and in addition is fast to compute [14,18,19]. In contrast to genomic sequences, the length of a user query sequence is expected to be of limited scale, allowing more robust stability calculations to be performed for these sequences. Therefore, in addition to MFE calculation, query sequences are subjected to partition function [20] computation, and query stems that have an average base pair probability lower than a given threshold are filtered out. Note that currently there is no experimentally based knowledge of how to split the genome into putative functional RNA regions.

327

Thus, while MFE calculation can be done locally based on energetic parameters, genome scale calculation of partition function requires either calculating a global partition function, or heuristic selection of the regions for which the partition function is calculated. As a result, base pair probabilities are inferior to stems as building blocks of the search process, and cannot be used to filter stems during the pre-processing phase, only during the search phase. It has been shown that in vivo, folding of RNA is a dynamic process that occurs during the transcription of such molecules. In this co-transcriptional folding, local sub-structures including pseudoknots are sometimes favorable, and they were shown to influence the structure either as sub-optimal components of the final structure or as intermediate products [21]. As ncRNA molecules are bounded in length, potential stems with longer distance between the two arms of the stem are less probable to be a part of an actual ncRNA molecule. Combining this observation with the locality preference of dynamic folding motivates us to focus on the identification of stems within a bounded sequence window, i.e., we restrict the allowed sequential distance between the two arms of a stem. Our engine is therefore based on a ‘‘stem-set’’ representation of an RNA sequence, consisting of potential, highly probable stable stems spanning a bounded sequence length. This representation addresses the dynamic nature of folding by allowing capture and representation of several alternative configurations at once, as well as pseudoknots. A stem-set representation of RNA sequences was first presented by Ji et al. [22], where it was applied as a filtration stage to another application, that of finding conserved structures in a set of unaligned RNA sequences. We extend this representation by allowing bulges and internal loops in the stem building blocks. Several previous ncRNA search tools use indexing as a means to save information on genomes and reduce computational effort for each query [9,10,23,11,12]. In this paper we present an RNA sequence search engine based on genome-wide indexing of all local, stable stems. All this differentiates our engine from search engines that rely on sequence alone (e.g., BLAST, BLASTR, and PLAST-ncRNA [24]), and also from various existing RNA secondary-structure profilesearch engines, which take as input a predefined structure or probability matrix and seek its homologs. The rest of the paper proceeds as follows: In Section 2 we introduce our system, and in Section 3 we describe implementation details. Experimental results are presented in Section 4, and discussion of the work and future research in Section 6. Section 5 contains a detailed description of the methods and algorithms we applied. Supplementary material including figures and detailed examples is attached separately.

2. System After deciding to base the search system on local, energetically stable stems, there are three components to design and implement:  A pre-processing algorithm that efficiently identifies stems below a certain free energy threshold and with a restricted distance between the hybridized sequences that form the two arms of the stem.  A data structure in which the identified stems are to be stored. This data structure should allow efficient iteration over stems similar to the ones given in the query.  A method for identifying candidate regions against the query sequence, followed by a scoring and ranking mechanism to retrieve the best matching regions.

328

N. Milo et al. / Methods 69 (2014) 326–334

Our tool, StemSearch, is based on a two-stage approach (see Fig. 1). The initial, pre-processing phase, utilizes the first two components above, while the second, search phase, utilizes all three components. In what follows, we give a high level overview of our tool, annotating the various stages with the corresponding edge numbers of Fig. 1. The pre-processing phase takes as input a set of genomic (target) sequences which consist of the target genomes (edge 1 in Fig. 1), and applies our tools to identify all potential stems defined by a thermodynamic free energy threshold and a locality criterion (i.e., a restriction on the genomic distance between the two arms of the stems). All stems that comply with the user-defined thresholds undergo a feature-extraction stage (edge 2), based on which they are stored in an inverted index in preparation for future queries (edge 3). The search phase takes as input an RNA query sequence for which we wish to find homologues in the target genomic sequences. Our method processes the query sequence by applying the same stem identification and stem feature-extraction procedures that were described for genomic target sequences in the pre-processing stage (edges 4,5). A component named ‘‘Stem Provider’’ fetches the identified stems and their extracted features (edge 6). As done in [22], in order to reduce the number of query stems we use a partition function [20] computation on the query sequence (edge 7, using RNAfold [25]) to filter out query stems that have an average base pair probability lower than a given threshold (edge 8). For each of the remaining query stems, the Stem Provider iterates through stems in the inverted index (edge 9) and creates a corresponding set of candidate occurrences of similar stems from the genomic sequences (edge 10). These occurrences contain for each candidate stem a similarity score to the corresponding query stem. The last component, ‘‘Region Scorer’’, uses the occurrence lists to find the genomic regions that are most similar to the query (edge 11) in terms of stem content, with restrictions on the topology of query stems and matching genomic stems. One of the main challenges that differentiates our RNA search from BLAST is in the computational cost of the pre-processing phase, as finding stems that comply with our stability criteria requires a much heavier computational effort than splitting the sequence into k-tuples. The most costly computational part in finding such stems is the calculation of hybridization free energy, which needs to be done for many pairs of short sequences. In Section 5.1 we describe a method that reduces the number of such calculations, thus making the entire task feasible for large genomes. Another major difference between sequence-based searches versus our approach is in the characterization of building blocks. While a k-tuple (as in BLAST seeds) is characterized by a single feature (i.e., the tuple sequence), stems are much more complex [26].

Fig. 1. Flow chart of StemSearch. Pre-processing phase components are shown in the dashed box, search phase components are shown in the dotted box. A detailed description is given in the text.

Therefore, the phase in the search in which relevant candidate regions are identified requires more than just iterating over the query building blocks, it must incorporate stem similarity criteria. However, as this is the most time-consuming part of the runtime search, its implementation needs to be very efficient. Our implementation of this task is based on flexible extraction of stem features and the use of the mongoDB [27] for indexing stem features and later iterating over them in an efficient manner. The details of the indexing and iteration processes will be discussed in Section 5.2. The Stem Provider is designed to efficiently find potential stems in the genomic sequences that match the query sequence stems (edges 6, 8, and 9 in Fig. 1). To demonstrate how it works, consider the stem q1 in the example of Fig. 2. In this example, three features were extracted from q1 : one feature reflects the number of stacked base-pairs, the second feature corresponds to an MFE range, and the third feature defines a size-range of the inner-sequence (i.e., the length of the sequence connecting the two arms of the stem). The mongoDB index provides for each of the three features a list of all ‘‘hits’’, i.e., all stems in the genome that carry the same feature. In the example of Fig. 2, three stems were returned: Stem r 1 , which shares the first feature with stem q1 ; stem r 2 , which shares the first and second features with stem q1 ; and stem r 3 , which shares the second feature with stem q1 . None of the stems in the genome shares the inner-sequence length range feature with the query stem q1 . For each stem returned by the index, the stem’s Feature Agreement with the query is defined to be the number of features shared by this target stem and the query stem. For example, the Feature Agreement of stem q1 and stem r1 is 1, while the Feature Agreement of stem q1 and stem r 2 is 2. In this example, the condition set by the Stem Provider was that the minimal Feature Agreement between the query and a surviving stem target P2; thus only stem r2 (highlighted in the figure by a dotted line) survives the feature filtration and is passed as a relevant candidate stem-hit to the next stage. The Stem Provider’s output is therefore the set of stems extracted from the query sequence, and for each such stem a list of the stems in the genome whose Feature Agreement with the query stem is higher than a pre-defined threshold is reported. Each pair of query stem and genomic stem is assigned a similarity score, calculated by aligning the two stems via a stem-alignment algorithm that takes into account both the structure and the sequence

Fig. 2. An example of a stem features iteration. Detailed description is given in the text.

329

N. Milo et al. / Methods 69 (2014) 326–334

Given a query Q and a region R, let q denote the stem set identified in Q, and let r denote the stem set identified in R. The Optimal Mapping Score is defined as:

OptMapScoreðQ ; RÞ ¼

Fig. 3. Topology preserving mapping. The query comprising 4 stems q1 , q2 , q3 , and q4 (black filled arcs), where q2 is nested in q1 , q3 crosses q1 , and q4 overlaps with q3 . Bottom: a genomic region with stems (gray filled arcs) that are similar to the query stems, where r1 , r 3 map to q1 ; r 2 , r 4 map to q2 ; and r 5 maps to q3 .

of the stems. A detailed description of the Stem Provider is given in Section 5.3. Fig. 3 depicts the arc-annotated graph representation of the query sequence by its potential stems, where each arc corresponds to a stem. Note that a stem can be mathematically formulated as 2-interval [28], where 2-interval is the disjoint union of two intervals on the line. Two 2-intervals (or stems) can have exactly one of the following topological relations (exemplified in Fig. 3): nested (@): Both intervals of one stem are contained between the two intervals of the other. For example, q1 and q2 . crossing (ðÞ): Exactly one interval of one stem is contained between the two intervals of the other, e.g., q1 and q3 . adjacent ( wðuÞ, where wðv Þ denotes the weight of node v 2 V; ðu; v Þ R E and NðuÞ # Nðv Þ, where NðuÞ ¼ fv jðu; v Þ 2 Eg. Node u can be removed from G when computing the maximum weighted clique of G.

Proof. We prove by contradiction. Assume that the maximum weighted clique in G is C and it contains node u. Stating that ðu; v Þ R E implies that v R C. We build a clique C 0 by replacing u with v in C. As NðuÞ # Nðv Þ, it follows that for every node s 2 C 0 n v ; ðs; v Þ 2 E; thus C 0 is a clique. The weight of clique C 0 is: WðC 0 Þ ¼ WðCÞ  wðv Þ þ wðuÞ, and since wðv Þ > wðuÞ, clique C 0 has a heavier weight than C, thus contradicting the assumption that C is the maximum weighted clique. h By Observation 5.1 we can reduce the number of nodes in the dual graph G ¼ ðV; EÞ by removing all nodes that are affected by Observation 5.1. The identification of all redundant nodes can be achieved in OðjVj2 Þ. A problem-specific dual graph sparsification heuristic: We define the distortion of a stem topology as follows. Definition 4 (Topological Distortion Ratio). Given two pairs of stems ðq1 ; q2 Þ; ðr1 ; r 2 Þ such that q1 is mapped to r1 ; q2 is mapped to r 2 , and the two pairs of stems follow the same topological order within this mapping. The Topological Distortion Ratio of ðq1 ; q2 Þ versus ðr 1 ; r 2 Þ is maxðdq ; dr Þ=minðdq ; dr Þ, where dq (dr ) is the distance between q1 and q2 (r1 and r 2 , respectively). The distortion distance measurement depends on the type of topology formed by the stems. This is demonstrated in Fig. 7. We use the Topological Distortion value to remove edges connecting vertices in the Dual Graph whose corresponding distortion value is greater than the given threshold. Thus, we reduce the number of edges in the dual graph. Finally, we describe how we further speed up the Region Scorer engine via two (admissible) heuristics exploiting sliding window redundancies. Sliding window heuristic 1: The task of the Region Scorer is to compute maximum weight cliques for all genomic regions of a given width, and to report the regions containing the cliques with the top-k scores. To perform this, we use a window of a pre-specified width, which slides over the genomic sequence, constantly adding and removing matches between query and target stems. During this process, the regions with top-k scores already encountered are maintained. Once we have k regions, a new region will be inserted to the list only if it improves it. Thus, when computing the optimal score for a new region, the initial value of W is set to the minimal score among the current top-k scores.

Sliding window heuristic 2: Additional exploitation of sliding window redundancy is obtained as follows. When computing the maximum weighted clique in the current window, we only consider cliques that contain at least one of the stems that were newly added to the current region. 5.5. p-Value calculation For an effective database search, we need to have p-values for the probability that a hit was obtained by chance. We extend the approach taken by FastR [40], and express the p-value by using the one-sided nonparametric Chebyshev inequality. The bound provided by this inequality is conservative and overestimates the probability of obtaining a similar score by chance. We use Beasley et al.’s [41] modification for the inequality to handle the cases where the population mean and variance are not known but are instead replaced by their sample estimates. During the pre-processing indexing phase, each sequence in the database is shuffled in a manner that preserves its GC-content and its di-nucleotide content. The shuffled sequences are added into an auxiliary index through the same process described in Fig. 1. Then, during search time, each query is run against the auxiliary index, but instead of collecting the best scores, the statistics required for calculation of Chebyshev’s inequality are collected, and used for calculating the p-value. 6. Conclusions In this paper we presented a novel approach for searching for RNA homologues in a genomic sequence, using an inverted-index based sequence-structure RNA search engine. The proposed system is based on identifying potential stems in genomic sequences and storing them in an inverted index. Potential stems in the query sequences are identified similarly, and then used to scan the inverted index and locate regions in the genome most similar to the query in their stem content. In our approach, the query is given as a sequence and does not rely on knowing the structure in advance. Offline experiments comparing StemSearch to BLAST show that the addition of structural consideration to RNA sequence results in significantly improved recall, with small decrease in precision. A website for issuing online searches is available.4 The contribution of this work begins with the ability to efficiently identify local energetically stable stems in genomic sequences. Given this capability, we implemented a novel inverted index based system for storing stems and later iterate over them given different stem features. We defined a similarity score between stems, and defined a dual graph representation that reduces the optimization problem of finding the highest scoring pair of topologically compatible sets of stems between two arc graphs to a weighted max-clique problem. Based on these definitions, we adapted a previously proposed heuristic branch and bound algorithm for weighted max clique and harnessed it to find 4

http://www.cs.bgu.ac.il/negevcb/StemSearch.

334

N. Milo et al. / Methods 69 (2014) 326–334

regions in genomic sequences that best match the query. The maximum weighted clique solver is applied on graphs obtained through a sliding window approach, and we presented improvements that enhance its efficiency under these conditions. Acknowledgments The work of Nimrod Milo and Michal Ziv-Ukelson was partially supported by ISF grant 478/10. Appendix A. Supplementary data Supplementary data associated with this article can be found, in the online version, at http://dx.doi.org/10.1016/j.ymeth.2014. 06.002. References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13]

M. Mandal, R. Breaker, Cell 6 (2004) 451–463. C. Gong, L.E. Maquat, Nature 470 (7333) (2011) 284–288. J.R. Prensner, A.M. Chinnaiyan, Cancer Discov. 1 (5) (2011) 391–407. A. Nahvi, N. Sudarsan, M. Ebert, X. Zou, K. Brown, R. Breaker, Chem. Biol. 9 (9) (2002) 1043–1049. S. Eddy, Non-coding RNA genes and the modern RNA world, Nat. Rev. Genet. 2 (12) (2001) 919–929. B. Bánfai, H. Jia, J. Khatun, E. Wood, B. Risk, W. Gundling, A. Kundaje, H. Gunawardena, Y. Yu, L. Xie, et al., Genome Res. 22 (9) (2012) 1646–1657. S. Altschul, W. Gish, W. Miller, E. Myers, D. Lipman, J. Mol. Biol. 215 (3) (1990) 403–410. G. Bussotti, E. Raineri, I. Erb, M. Zytnicki, A. Wilm, E. Beaudoing, P. Bucher, C. Notredame, Nucleic Acids Res. 39 (16) (2011) 6886–6895. Z. Yao, Z. Weinberg, W. Ruzzo, Bioinformatics 22 (4) (2005) 445. S. Zhang, I. Borovok, Y. Aharonowitz, R. Sharan, V. Bafna, Bioinformatics 22 (14) (2006) e557. F. Meyer, S. Kurtz, M. Beckstette, BMC Bioinform. 14 (1) (2013) 226. R. Klein, S. Eddy, BMC Bioinform. 4 (1) (2003) 44. T. Macke, D. Ecker, R. Gutell, D. Gautheret, D. Case, R. Sampath, Nucleic Acids Res. 29 (22) (2001) 4724.

[14] M. Zuker, P. Stiegler, Nucleic Acids Res. 9 (1) (1981) 133–148. [15] C. Do, D. Woods, S. Batzoglou, Bioinformatics 22 (14) (2006) e90–8. [16] S. Will, M.F. Siebauer, S. Heyne, J. Engelhardt, P.F. Stadler, K. Reiche, R. Backofen, Algorithms Mol. Biol. 8 (1) (2013) 14. [17] D. Sankoff, SIAM J. Appl. Math. 45 (5) (1985) 810–825. [18] S. Freier, R. Kierzek, J. Jaeger, N. Sugimoto, M. Caruthers, T. Neilson, D. Turner, Proc. Natl. Acad. Sci. 83 (24) (1986) 9373. [19] D. Mathews, M. Burkard, S. Freier, J. Wyatt, D. Turner, RNA 5 (1999) 1458. [20] J.S. McCaskill, Biopolymers 29 (6–7) (1990) 1105–1119. [21] H. Isambert, Methods 49 (2) (2009) 189–196. [22] Y. Ji, X. Xu, G. Stormo, Bioinformatics 20 (2004) 1591–1602. [23] F. Meyer, S. Kurtz, R. Backofen, S. Will, M. Beckstette, BMC Bioinform. 12 (1) (2011) 214. [24] S. Chikkagoudar, D.R. Livesay, U. Roshan, Nucleic Acids Res. 38 (Suppl. 2) (2010) W59–W63. [25] I. Hofacker, Nucleic Acids Res. 31 (13) (2003) 3429. [26] S. Eddy, R. Durbin, Nucleic Acids Res. 22 (11) (1994) 2079. [27] mongoDB, 2009. URL: http://www.mongodb.org. [28] S. Vialette, Theor. Comput. Sci. 312 (2) (2004) 223–249. [29] D. Goldman, S. Istrail, C.H. Papadimitriou, in: 40th Annual Symposium on Foundations of Computer Science, 1999, IEEE, 1999, pp. 512–521. [30] D. Karolchik, R. Baertsch, M. Diekhans, T.S. Furey, A. Hinrichs, Y. Lu, K.M. Roskin, M. Schwartz, C.W. Sugnet, D.J. Thomas, et al., Nucleic Acids Res. 31 (1) (2003) 51–54. [31] K. Darty, A. Denise, Y. Ponty, Bioinformatics 25 (15) (2009) 1974. [32] J. Davis, M. Goadrich, in: Proceedings of the 23rd International Conference on Machine Learning, ACM, 2006, pp. 233–240. [33] S. Will, K. Reiche, I. Hofacker, P. Stadler, R. Backofen, PLOS Comput. Biol. 3 (4) (2007) e65. [34] I. Meyer, I. Miklós, BMC Mol. Biol. 5 (1) (2004) 10. [35] J. Kruger, M. Rehmsmeier, Nucleic Acids Res. 34 (Web Server issue) (2006) W451. [36] S. Griffiths-Jones, Nucleic Acids Res. 32 (2003) D109–D111. [37] V. Guignon, C. Chauve, S. Hamel, in: String Processing and Information Retrieval, Springer, 2005, pp. 335–347. [38] P. Östergård, Nordic J. Comput. 8 (4) (2001) 424–436. [39] D. Kumlander, in: Proceedings of the 10th WSEAS International Conference on COMPUTERS, 2006, pp. 938–943. [40] S. Zhang, B. Haas, E. Eskin, V. Bafna, IEEE/ACM Trans. Comput. Biol. Bioinform. (2005) 366–379. [41] T. Beasley, G. Page, J. Brand, G. Gadbury, J. Mountz, D. Allison, J. Roy. Stat. Soc. Ser. C (Appl. Stat.) 53 (1) (2004) 95–108.

StemSearch: RNA search tool based on stem identification and indexing.

The discovery and functional analysis of noncoding RNA (ncRNA) systems in different organisms motivates the development of tools for aiding ncRNA rese...
784KB Sizes 0 Downloads 3 Views