A semi-supervised learning approach for RNA secondary structure prediction.

Accepted Manuscript Title: A semi-supervised learning approach for RNA secondary structure prediction Author: Haruka Yonemoto Kiyoshi Asai Michiaki Hamada PII: DOI: Reference:

S1476-9271(15)00019-5 http://dx.doi.org/doi:10.1016/j.compbiolchem.2015.02.002 CBAC 6394

To appear in:

Computational Biology and Chemistry

Received date: Accepted date:

2-1-2015 3-2-2015

Please cite this article as: Haruka Yonemoto, Kiyoshi Asai, Michiaki Hamada, A semisupervised learning approach for RNA secondary structure prediction, Computational Biology and Chemistry (2015), http://dx.doi.org/10.1016/j.compbiolchem.2015.02.002 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ip t

A semi-supervised learning approach for RNA secondary structure prediction Haruka Yonemotoa , Kiyoshi Asaia,b , Michiaki Hamadac,b,∗ a Graduate

us

cr

School of Frontier Sciences, University of Tokyo, 5–1–5 Kashiwanoha, Kashiwa 277–8562, Japan. b Computational Biology Research Center (CBRC), National Institute of Advanced Industrial Science and Technology (AIST), 2–4–7, Aomi, Koto-ku, Tokyo 135–0064, Japan. c Department of Electrical Engineering and Bioscience, Faculty of Science and Engineering, Waseda University, 55N–06–10, 3–4–1, Okubo Shinjuku-ku, Tokyo 169–8555, Japan.

an

Abstract

RNA secondary structure prediction is a key technology in RNA bioinformatics.

M

Most algorithms for RNA secondary structure prediction use probabilistic models, in which the model parameters are trained with reliable RNA secondary structures. Because of the difficulty of determining RNA secondary structures

d

by experimental procedures, such as NMR or X-ray crystal structural analyses,

te

there are still many RNA sequences that could be useful for training whose secondary structures have not been experimentally determined.

Ac ce p

In this paper, we introduce a novel semi-supervised learning approach for training parameters in a probabilistic model of RNA secondary structures in which we employ not only RNA sequences with annotated secondary structures but also ones with unknown secondary structures. Our model is based on a hybrid of generative (stochastic context-free grammars) and discriminative models (conditional random fields) that has been successfully applied to natural language processing. Computational experiments indicate that the accuracy of secondary structure prediction is improved by incorporating RNA sequences with unknown secondary structures into training. To our knowledge, this is the first study of a semi-supervised learning approach for RNA secondary structure ∗ Corresponding

author Email address: [email protected] (Michiaki Hamada)

Preprint submitted to Journal of LATEX Templates

October 28, 2014

Page 1 of 24

prediction. This technique will be useful when the number of reliable structures

Keywords: elsarticle.cls, LATEX, Elsevier, template

cr

2010 MSC: 00-01, 99-00

ip t

is limited.

1. Introduction

us

RNA secondary structure prediction is one of the most fundamental tasks in RNA informatics. It is frequently used by biologists and is incorporated into many pipelines for analyzing functional non-coding RNAs (e.g., miRNA gene finding) [1]. A number of algorithms and tools for RNA secondary structure

an

5

prediction have, therefore, been proposed [2, 3, 4, 5, 6, 7, 8].

M

Conceptually, the development of algorithms for RNA secondary structure prediction can be divided into two parts. The first is the development of a probabilistic model for the distribution of RNA secondary structures: p(y|x; w) where x is an RNA sequence, y is a secondary structure of x and w indicates the

d

10

parameter(s) of the model. These probabilistic models include the McCaskill

te

model [9], stochastic context-free grammars (SCFGs) [10], conditional random fields (CRFs) [11], hierarchical Dirichlet processes for SCFG (HDP-SCFG) [12],

Ac ce p

and the Boltzmann likelihood (BL) model [13, 14]. The second part is the

15

development of a decoding method by which RNA secondary structures can be predicted from a given probabilistic model p(y|x; w) developed in the first part. Maximizing expected accuracy (MEA) estimators [4], including the γ-centroid estimators [5], provides theoretically optimal RNA secondary structures with respect to accuracy of predicted base-pairs in a predicted secondary structure

20

[5]. Decoding methods have been extensively studied from the viewpoint of maximizing expected accuracy [15, 16]. In this study, we focus on the first part of the algorithm development process; that is, we aim to develop a novel probabilistic model of RNA secondary structures and a learning algorithm for the model. Most learning algorithms

25

use reliably known RNA secondary structures to estimate model parameters.

2

Page 2 of 24

However, there are substantial numbers of RNA sequences whose secondary structure is unknown because their experimental determination by X-ray struc-

ip t

ture analysis or NMR is difficult. These RNA sequences may be useful for

training the parameters of the probabilistic models. Motivated by this background, we have devised a semi-supervised learning algorithm by extending a

cr

30

model that has been successfully applied in natural language processing (NLP)

us

[17]. The model is a hybrid of generative (SCFG) and discriminative (CRF) models. Extensive computational experiments (described in Section 5), indicate that including data from RNA sequences with unknown secondary structures in the training set improves the usefulness of the probabilistic model for RNA

an

35

secondary structure prediction.

Table 1 gives a comparison of various tools for RNA secondary structure

M

predictions that are based on (i) energy parameters, (ii) discriminative models, or (iii) generative models. In general, it is known that discriminative models 40

tend to perform better than generative models because discriminative models

d

directly provide the posterior probabilities for the secondary structures of an

te

RNA sequence [4]. (In other words, we do not need a joint distribution of secondary structures and RNA sequences to predict the secondary structure

Ac ce p

of a given RNA sequence.) Generative models for RNA secondary structures, 45

such as SCFGs, enable us to train model parameters without using information about secondary structures of RNA sequences (i.e., unsupervised learning) by employing an expectation-maximization (EM) type of algorithm [18]. However, it is well-known that unsupervised learning does not work at all [12], and all previous studies of parameter learning in SCFGs employed information about

50

secondary structures (i.e., supervised learning) [10, 12]. In summary, Table 1 shows that there is no tool (except for the method proposed in this study) to employ RNA sequences with both known and unknown secondary structures in training probabilistic models, which is a novel point of this study. The rest of this paper is organized as follows. In Section 2, we briefly de-

55

scribe our approach for learning model parameters for RNA secondary structure prediction. In Section 3, the methods are explained in detail and the 3

Page 3 of 24

Table 1: Comparison of tools for RNA secondary structure prediction

w/ sec

w/o sec

Reference

Mfold(UNAfold)

Y

N

N

[2]

ViennaRNA

Y

N

N

[19]

RNAstructure

Y

N

N

[7]

Simfold

Y

Y

N

[14]

CONTRAfold

N

Y

N

[4]

CONTEXTfold

N

Y

N

[20]

Pfold

N

Y

∗

[21]

CONUS

N

Y

∗

[10]

npbfold

N

Y

∗

[12]

Our Method

N

Y

Y

This work

an

us

cr

ip t

Energy

M

Software

The column “Energy” shows whether the tool employs Turner’s energy parameters (Y) [22] or not (N); When using the energy parameters, training procedures are not

d

needed. The column “w/ Sec” indicates whether the probabilistic model can handle training data with secondary structures (Y) or not (N), and the column “w/o Sec” shows

te

that the model can handle training data without secondary structures. Because Pfold, CONUS and npbfold (denoted by ∗) employ generative models based on SCFGs, they

Ac ce p

can, in theory, use unannotated RNA sequences in training. However, it is known that this training fails in practice.

implementation is described in Section 4. In Section 5, we report the results of computational experiments, which confirm that our model is better than both discriminative and generative models separately.

60

2. Approach

In this study, we extend a semi-supervised learning approach based on hidden Markov models, proposed by [17] in the context of NLP, to the case of SCFGs that model RNA secondary structures [23]. The model is a hybrid of generative and discriminative models. A generative model defines the joint distribution

4

Page 4 of 24

Table 2: Grammars for RNA secondary structures

Parameters

Ambiguity

Description

ip t

Grammar

21

no

Pfold

G6s

261

no

Pfold + stacking

basic grammar

1,022

no

Pfold + various loops

(CONTRAfoldG)

5,448

yes

CONTRAfold

90,947

no

ViennaRNA

us

(ViennaRNAG)

cr

G6

The second column “Parameters” shows the number of parameters remaining after sharing parameters in the TORNADO program [25]. The third column “Ambiguity”

an

indicates whether the grammar is ambiguous; a grammar is called unambiguous if each of the possible secondary structures corresponds to only one parse tree derived

M

from the grammar. In this study, we tried G6, G6s, and the basic grammar, but we did not use CONTRAfoldG (the grammar employed in CONTRAfold software [4]) or ViennaRNAG (the grammar employed in Vienna RNA package [26]). See [25] and [10]

p(x, y) of an RNA sequence x and its secondary structure y, and a discriminative

te

65

d

for the detailed definition of each grammar.

model defines the conditional distribution p(y|x). In this study, we utilized an

Ac ce p

SCFG model [24, 10, 25] as a generative model and the corresponding CRF model as a discriminative model [4]. Both models require grammars for RNA secondary structures and there exist several grammars, ranging from the simple

70

to the complex (see Table 2). In this study, we tried three grammars: G6, G6s, and the basic grammar. The discriminative model is constructed from the same grammar that defines the generative (SCFG) model (see the second paragraph in Section 3.1 for the details).

5

Page 5 of 24

3. Methods 3.1. Models

ip t

75

Inspired by a successful NLP study [17], we introduce the following proba-

cr

bilistic model of RNA secondary structures. pH (y|x; w,θ, λD , λG ) =

pD (y|x; w)λD pG (x, y; θ)λG . D ′ λD pG (x, y ′ ; θ)λG y ′ ∈S(x) p (y |x; w)

us

∑

(1)

Here, S(x) is a set of possible secondary structures of an RNA sequence x and

an

two probability distributions pD and pG , provided by a discriminative model (with parameters w) and a generative model (with parameters θ), are combined by using weights λD , λG ∈ [0, 1].

M

In the above, we exclude arbitrary combinations of generative and hybrid models for algorithmic reasons, and pD and pG are introduced in the following way. For a given grammar for RNA secondary structures (cf. Table 2), a

d

generative model can be defined by a SCFG [10, 25]. Moreover, a discriminative

te

model can be derived from the SCFG model (with the same grammar) as follows: ∏ fi (x,y) pG (x, y) pG (x, y) i θi p (y|x) = =∑ =∑ ∏ fi (x,y′ ) G ′ p (x, y ) p(x) y ′ ∈S(x) y ′ ∈S(x) i θi ∏ fi (x,y) exp(log( i θi )) =∑ ∏ fi (x,y′ ) )) y ′ ∈S(x) exp(log( i θi ∑ exp( i fi (x, y) log(θi )) ∑ =∑ ′ i fi (x, y ) log(θi )) y ′ ∈S(x) exp( ∑ exp( i fi (x, y)wi ) ∑ =∑ ′ y ′ ∈S(x) exp( i fi (x, y )wi ) ( T ) exp w F (x, y) =∑ , T ′ y ′ ∈S(x) exp (w F (x, y ))

Ac ce p

D

80

where we set w = {wi }i and wi = log(θi ); F (x, y) = (f1 (x, y), f2 (x, y) . . . , f# (x, y)) denotes the vector of the numbers of appearances of the parameters (transition and emission probabilities) in the SCFG model, where fi (x, y) denotes the count of parameter θi for an RNA secondary structure y. Notice that each element 6

Page 6 of 24

in w is real-valued in the discriminative model (given by the above), and the 85

parameter w in a discriminative model is not related to the parameter θ in a

ip t

generative model.

The following implication of the hybrid model is useful. It can be easily seen

cr

that

us

pH (y|x;w, θ, λD , λG ) ( ) ∝ exp λD wt F (x, y) + λG log pG (x, y; θ)

(2)

holds; this indicates that the hybrid model can be considered to be a discrimina-

an

tive model in which a new feature, pG (x, y; θ) (the joint probability of x and y of the generative model), is added to the original discriminative model. Roughly 90

speaking, hybrid models improve on the base models by incorporating the joint

M

probability pG (x, y; θ) of a generative model into a discriminative model. 3.2. Overview of training procedure

d

In training the parameters (w, θ, λD , λG ) in a hybrid model (Eq 1), unlabeled M

data Du := {xm }m=1 (i.e., RNA sequences without secondary structures) and ′′ n n N labeled data Dl := {Dl′ = {xk , y k }K k=1 , Dl = {x , y }n=1 } (i.e., RNA sequences

te

95

with secondary structures) are required. (The complete set of training data is

Ac ce p

denoted by D := {Du , Dl }.) Then, training of parameters in a hybrid model is

done in the following manner. 1. Training w in the discriminative model by using RNAs with secondary

100

structures (data: Dl′ ). (0)

(0)

2. Initialize θ (0) , λD , λG in the generative model

3. Repeat (a)–(c) until |θ (t+1) − θ (t) |/|θ (t) | < ε. (a) Train θ (t+1) by using RNA sequences without secondary structures (data: Du ).

105

(t+1)

by using RNA sequences with secondary structures (data:

(t+1)

by using RNA sequences with secondary structures (data:

(b) Find λD Dl′′ ). (c) Find λG Dl′′ ).

7

Page 7 of 24

(t+1)

4. Output trained parameters w, θ (t+1) , λD

.

Note that the parameter w in the discriminative mode is trained once in step

ip t

110

(t+1)

, λG

1 and, intuitively, we obtain the parameters that improve this discriminative model by incorporating information from the generative model (cf. Eq. (2)).

cr

In the following sections, we give more details about the training of each parameter in a hybrid model. 3.3. Training parameters in the hybrid model

us

115

In the hybrid model, the parameters w, θ, λD and λG need to be trained.

an

3.3.1. Learning parameter w in the discriminative model (step 1)

To train w given labeled training data Dl′ = {xk , y k }K k=1 where xk is an RNA maximized. L(w) =

K ∑

M

sequence and y k is its secondary structure, the following objective function is

log pD (y k |xk ; w) −

k=1

C ||w||22 2

(3)

d

Here, the first and second terms are the likelihood and a regularization term,

te

respectively. The gradient of L is

∂L(w) ∑ = F (xk , y k ) − ∂w K

∑ y ′ ∈S(x)

ZD (x)

Ac ce p

k=1

=

where ZD (x) =

K ∑

F (xk , y ′ )

F (xk , y k ) − EpD (·|xk ) (F (xk , y)),

(4)

k=1

∑

y ′ ∈S(x)

( ) exp wT F (x, y ′ ) . This can be efficiently computed by

using conventional inside–outside algorithms [24, 4]. Then, the objective func-

120

tion is optimized with the Limited-memory Broyden-Fletcher-Goldfarb–Shanno (L-BFGS) method.

3.3.2. Learning parameters θ in the generative model (step 3a) The parameter θ is learned from unlabeled training data Du , with the other

parameters fixed. The objective function is |Du |

G(θ|λD , λG , w) =

∑

m=1

log

∑

g(xm , y; θ),

y

8

Page 8 of 24

where

ip t

g(x, y; θ) := pD (y|x; w)λD pG (x, y; θ)λG , which can be optimized by using an EM type of algorithm.

cr

In the algorithm, the following Q-function is maximized with respect to θ to obtain θ(t+1) (i.e., θ (t+1) = arg maxθ Q(θ, θ (t) )).

m

y

[ ] EP H (·|xm ;w,θ(t) ,λD ,λG ) log pG (xm , y; θ)λG

an

=

∑

us

Q(θ, θ (t) ; λD , λG ) ∑∑ = pH (y|xm ; w, θ (t) , λD , λG ) log pG (xm , y; θ)λG

m

(5)

Here, θ (t) is the parameter vector in the previous iteration This is realized in a similar manner by using an EM algorithm [18], leading to optimization

M

125

problems with constraints; each parameter is updated by using various expected counts evaluated with respect to the distribution pH . It can be easily seen that

d

maximizing the Q-function with respect to θ leads to a monotonic increase of the

130

te

likelihood [17]. See Supplementary Section S1 for the detailed training method. 3.3.3. Determine λD (Step 3b)

Ac ce p

The objective function is L(λD |θ, w, λG ) =

∑

log pH (y n |xn ; w, θ, λD , λG ) −

n

C 2 λ , 2 D

and its derivative with respect to λD is ∂L(λD |λG , θ, w) ∂λD ∑ ∑ = wT F (xn , y n ) − wT EpH (·|xn ;λD λG ,θ,w) [F (xn , ·)] − CλD , n

n

where where EpH is the expectation with respect to the distribution pH (·|xn ; λD λG , θ, w), which can can be efficiently computed by using an inside–outside algorithm (see Algorithms S1 & S2 in the Supplementary material) because we assume that pD and pG are defined based on the same grammar (cf. Section 3.1). Finally,

135

λD is optimized by using the L-BFGS method. 9

Page 9 of 24

3.3.4. Determine λG (Step 3c)

L(λG |λD , θ, w) =

∑

log pH (y n |xn ; w, θ, λD , λG ) −

n

ip t

In order to find λG , the following objective function is maximized. C |λG |2 2

cr

Here, the second term is a regularization term. The derivative of this objective function with respect to λG is

n

=

∑

n

F (x , y ) (log θ) − n

n T

∑

[ ] EpH F (xn , y n )T (log θ) − C|λG |

an

=

∑

n

∑ F (xn , y n )T (log θ) − (log θ)T EpH [(F (xn , y n )] − C|λG |,

n

n

M

n

us

∂L(λG |λD , θ, w) = ∂λG ∑ ∑ [ ] log pG (xn , y n ) − EpH log pG (xn , y n ) − C|λG |

where EpH (·|xn ;λD λG ,θ,w) [F (x , ·)] can be computed by using an inside–outside n

algorithm (see Algorithms S1 & S2 in the Supplementary material). Then, λG

140

te

d

is optimized by using the L-BFGS method.

4. Implementation

Ac ce p

Our implementation for training parameters in a hybrid model is based on

the TORNADO package developed by Dr. Rivas and co-workers [25] because TORNADO includes general routines, such as an inside–outside algorithm and methods to compute the expected counts for many pre-defined grammars of

145

RNA secondary structures. The software will be available on request.

10

Page 10 of 24

5. Computational Experiments

ip t

5.1. Evaluation settings 5.1.1. Evaluation measures

secondary structure prediction. number of correct base pairs number of true base pairs number of correct base pairs PPV = number of predicted base pairs 1 . F-score = 2 1 1 + SEN PPV

an

us

SEN =

cr

We used the following typical measures to evaluate the performance of RNA

Notice that there is a trade-off between the sensitivity (SEN) and the positive predicted value (PPV) measures, and the F-score takes a balance between them.

M

150

5.1.2. Decoding method

d

As described in Section 1, a decoding method should be specified to predict a single secondary structure from a trained hybrid model obtained by using the

155

te

procedure described above. We used the CMEA estimator [4, 25], a kind of MEA estimator [16], for decoding the RNA secondary structure from a trained

Ac ce p

probabilistic model. CMEA includes a parameter γ that adjusts the sensitivity and PPV for a predicted secondary structure. 5.1.3. Dataset

In our computational experiments, we utilized the dataset from a recent

160

study [25], in which training data and test data are carefully created so as to avoid over-fitting issues in training probabilistic models. Specifically, [25] provided two datasets (DatasetA and DatasetB), both of which include RNA sequences with validated secondary structures. DatasetA consists of TrainSetA and TestSetA; DatasetB consists of TrainSetB and TestSetB. TrainSetA

165

(resp., TrainSetB) includes similar secondary structures (not similar sequences) to those included in TestSetA (resp., TestSetB) because TestSetA includes the same families as TrainSetA, while there is no overlap of families between 11

Page 11 of 24

DatasetA and DatasetB. More specifically, DatasetA includes the following families: tRNA, SRP RNA, tmRNA, RNase P RNA, 5S rRNA, telomerase RNA, group I introns, and group II introns. The total numbers of sequences in Train-

ip t

170

SetA and TestSetA are 3,166 and 697, respectively. In contrast, DataSetB

cr

includes 5.8S rRNA, U1, U4, riboswitches, cis reglatoryRNA, ribozymes, and

bacteriophage pRNA. The total numbers of sequences in TrainSetB and Test-

175

5.1.4. Settings for training the hybrid model

us

SetB are 1,094 and 430, respectively. See [25] for the details of the datasets.

In training a hybrid model with the dataset described in Section 5.1.3, the M

an

sequence information of only TrainSetB is used as Du := {xm }m=1 (i.e., as the training data for learning parameters in the generative model). For the training

180

M

′′ n n N data Dl := {Dl′ = {xk , y k }K k=1 , Dl = {x , y }n=1 } used in supervised learning,

we used the equipartition of TrainSetA (as Dl′ and Dl′′ ).

d

5.2. Hybrid models improve the accuracy of discriminative and generative models

te

Figure 1 shows a comparison between hybrid models and discriminative models with respect to three grammars: G6, G6s, and the basic grammar (cf. Table 2). Table 3 gives the corresponding F-scores. In training both models, we

Ac ce p

185

used the same known RNA secondary structures (see Table 4 for the details of training data) so as to make our comparison fair. The results indicate that, on TestSetB, the hybrid models achieved significantly better accuracy (with respect to all of the grammars) than discriminative models, while the models

190

achieved similar accuracies on TestSetA. Because TestSetB does not include any RNAs that are structurally similar to those of TrainSetA, as TestSetA does, the results indicate that hybrid models successfully predict secondary structures in TestSetB without using the structural information in DatasetB. Note that the hybrid model use only the sequence information of TrainSetB and not the

195

structural information.

12

Page 12 of 24

In Table 3, the F-scores (cf. Section 5.1.1) for hybrid, generative and, discriminative models are shown (the values for generative models were taken from

ip t

a previous study [25]). This indicates that the hybrid models are better than both generative and discriminative models with respect to all the tested grammars. Moreover, the table indicates that discriminative models are better than

cr

200

generative models (although this is already known from previous studies, e.g.,

us

[4]; the third paragraph in Section 1).

In summary, our hybrid models performed better than discriminative models on TestSetB (which does not include the same families as the training dataset), while the models achieved similar performances on TestSetA (which does include

an

205

the same families as the training dataset). These results indicate that hybrid models will be useful for predicting secondary structures for RNA sequences for

M

which there are no training data, which might be promising results (see the fifth paragraph in Section 6). TestSetA

TestSetB

0.7

0.7 hybrid(G6) disc(G6) hybrid(G6s) disc(G6s) hybrid(bg) disc(bg)

0.6

te

0.5

0.3

Ac ce p

0.2

0.3

0.2

0.1

0

0.4

0.5

0.6

0.7

0.5

0.4

0.1

0.3

0.6

SEN

SEN

0.4

hybrid(G6) disc(G6) hybrid(G6s) disc(G6s) hybrid(bg) disc(bg)

B

d

A

0 0.8

0.9

0.3

0.4

PPV

0.5

0.6

0.7

0.8

0.9

PPV

Figure 1: A comparison between hybrid models (solid lines) and discriminative models (dashed

lines) for TestSetA (left) and TestSetB (right). In both figures, the horizontal axis and vertical axis show PPV and SEN (with respect to base-pairs in RNA secondary structures; Section 5.1.1), respectively. TestSetA includes RNAs that are structurally similar to the training data while TestSetB does not (see the main text and Table 4 for details of the training and test sets). In these experiments, we used three grammars, G6, G6s, and the basic grammar (denoted by “bg”), for training the hybrid (denoted by “hybrid”) and discriminative (denoted by “disc”) models (cf. Table 2). The CMEA estimator [4, 25] was used to predict a single RNA secondary structure from the trained probability distributions.

13

Page 13 of 24

Table 3: Comparison of hybrid, discriminative, and generative models

TestSetB

ip t

TestSetA hybrid

disc

gen

hybrid

disc

gen

G6

0.500

0.488

0.478

0.499

0.476

0.462

G6s

0.502

0.507

0.489

0.504

0.496

0.475

basic grammar

0.573

0.568

0.567

0.574

0.564

0.536

cr

Grammar

us

Each value indicates the maximum F-score in the SEN–PPV curves shown in Figure 1. The labels “hybrid”, “‘disc” and “gen” indicate the hybrid, discriminative, and generative models, respectively. TestSetA includes RNAs that are structurally similar to those of TrainSetA, while

an

TestSetB does not (see Section 5.1.3 for the details). The F-scores for the generative models were taken from [25]. In training all of the three models (hybrid, disc, gen), we use identical information about RNA secondary structures (i.e., we use TrainSetA, which includes RNAs

M

that are structurally similar to those of TestSetA). See the last paragraph of Section 5.1.3 and Section 5.2 for the details. See also Table 4 for a detailed summary of training data.

5.3. Effect of RNA families on training in hybrid models

d

210

te

To evaluate the effect of the RNA families used in training the generative models contained in hybrid models, we constructed the following trained hybrid

Ac ce p

models and evaluated their performances. Du−ribo (resp. Du−cis ) denotes the training dataset (without secondary structures) in which we excluded all RNA

215

sequences belonging to the riboswitch (resp., cis-regulatory RNA) family from Du (cf. see Section 3.2). The “hybrid-ribo” (resp., “hybrid-cis”) model is a hybrid model in which the generative model was trained with the data Du−ribo (resp., with Du−cis ). On its own “hybrid” denotes the hybrid model trained using the entire dataset. See Table 5 (resp. Supplementary Table S1) for the details

220

of the training sets in the comparison between the “hybrid” and “hybrid-ribo” models (resp., between the “hybrid” and “hybrid-cis” models). Note that only RNA sequences are used in Du , Du−ribo , and Du−cis . In this experiment, we used two grammars: G6 and G6s (cf. Table 2). A comparison of “hybrid-ribo” and “hybrid” is shown in Figure 2. It can be

225

seen that the “hybrid-ribo” model performed worse than the “hybrid” model 14

Page 14 of 24

Table 4: The details of training data for the hybrid, discriminative and generative models (denoted by “hybrid”, “disc”, and “gen”, respectively).

Dl′ : w (discriminative model)

Du : θ (generative model)

hybrid

TrainSetA (1/2)

TrainSetB (w/o sec)

TrainSetA (1/2)

TrainSetA

–

–

–

TrainSetA

gen

cr

disc

Dl′′ : λD , λG

ip t

model

–

us

w and θ are parameter vectors in the discriminative and generative models, respectively (Eq. 1). A hybrid model also includes parameters λD and λG . Dl′ and Dl′′ are labeled training data (i.e., RNA sequences with secondary structures) while Du is unlabeled training data (i.e.,

an

RNA sequences without secondary structures). For a hybrid model, TrainSetA (1/2) means half of TrainSetA; hence, the entirety of TrainSetA (with secondary structures) was used in learning the parameters w, λD and λG . Only the sequence information of RNA sequences is

M

utilized in the learning the parameter θ (dataset in bold). Notice that the secondary structure information in TrainSetA is used when training both discriminative and generative models. In

d

summary, common information on secondary structures is employed to train the parameters

te

in the hybrid, discriminative, and generative models.

with respect to TestSetB (Figure 2B), although both models achieved similar

Ac ce p

performances with respect to TestSetA (Figure 2A). As expected, the difference in performance between “hybrid” and “hybrid-ribo” for TestSetB is due to the performance difference for the riboswitch family (Figure 2 B-1 and B-2).

230

The comparison between “hybrid” and “hybrid-cis” is shown in Supplemen-

tary Figure S1. Results similar to those in the case of “hybrid-ribo” were obtained.

In summary, these results indicate that the performance of a hybrid model

should be improved by using more unannotated training data (RNA sequences

235

without secondary structures).

15

Page 15 of 24

TestSetA

riboswitch

0.7

0.7 hybrid(G6) hybrid-ribo(G6) hybrid(G6s) hybrid-ribo(G6s)

0.6

hybrid(G6) hybrid-ribo(G6) hybrid(G6s) hybrid-ribo(G6s)

B-1 0.6

0.4

0.3

0.3

0.2

0.2

0.1

0.1

0

0 0.3

0.4

0.5

0.6

0.7

0.8

0.9

0.3

0.4

0.5

TestSetB 0.7 hybrid(G6) hybrid-ribo(G6) hybrid(G6s) hybrid-ribo(G6s)

0.6

0.7

0.8

0.9

others

0.7

B

0.6

PPV

us

PPV

cr

SEN

0.5

0.4 SEN

0.5

ip t

A

hybrid(G6) hybrid-ribo(G6) hybrid(G6s) hybrid-ribo(G6s)

B-2 0.6

0.5

an

0.5

SEN

0.4

SEN

0.4

0.3

0.3

0.2

0.2

0.1

0 0.3

0.4

0.5

0.6

0.7

0.8

PPV

0.9

M

0.1

0 0.3

0.4

0.5

0.6

0.7

0.8

0.9

PPV

Figure 2: Panels (A) and (B) in the left column show the SEN–PPV curves for TestSetA

d

(including RNAs that are structurally similar to those in the training set) and TestSetB (not

te

including RNAs that are structurally similar to those in the training set), respectively. Panel (B1) shows the SEN–PPV curves for the riboswitch family in TestSetB, while panel (B2) shows the SEN–PPV curves for the other families in TestSetB. The “hybrid” curves (solid

Ac ce p

lines) show the results of training using all data in Du , while “hybrid-ribo” curves (dashed −ribo lines) show the results of training using all data in Du . See also the caption of Figure 1.

See Table 5 for details of the training and test data.

6. Discussion

In this study, we implemented hybrid models for three grammars represent-

ing RNA secondary structures: G6, G6s, and the basic grammar (cf. Table 2). According to previous studies (e.g. [25]), more complex grammars, such as

240

grammars derived from the CONTRAfold [4] and ViennaRNA packages (denoted by CONTRAfoldG and ViennaRNAG, respectively in Table 2) achieved better performance than G6, G6s, and the basic grammar. In the future, we will incorporate these more complex grammars into the hybrid model. More-

16

Page 16 of 24

Table 5: The details of the training data for the “hybrid-ribo” model in comparison with the “hybrid” model (in Figure 2) Dl′ : w (discriminative model)

Dl′′ : λD , λG

hybrid

TrainSetA (1/2)

TrainSetB (w/o sec)

TrainSetA (1/2)

hybrid-ribo

TrainSetA (1/2)

TrainSetB\riboswitch (w/o sec)

TrainSetA (1/2)

Du : θ (generative model)

ip t

model

cr

The “TrainSetB\riboswitch” is a subset of TrainSetB, obtained by excluding all RNA belonging to the riboswitch family from TrainSetB. In the sets marked in bold we use only the

us

sequence information, while in the other training sets we use both sequence and structural information. The parameters w and θ are those of the discriminative and generative model components, respectively. “TrainSetA (1/2)” indicates that TrainSetA is partitioned into two

an

sets to train w and {λD , λG }. See Section 5.3 for the details. See also the caption of Table 4.

over, we will perform a comprehensive comparison with other programs using commonly-utilized benchmark datasets such as CompaRNA [27]. In the com-

M

245

parison, not only the programs for RNA secondary structure predictions from a single RNA sequence but also those using multiple sequences will be included:

d

CentroidAlifold [28], CentroidHomfold [29, 30], CMfinder [31], FoldalignM [32],

250

te

LaRA [33], MASTR [34], RNAalifold [35], RNASampler [36] and WAR [37]. It should be noted that we do not use arbitrary combinations of generative

Ac ce p

and discriminative models in our hybrid model. For example, a hybrid model of an SCFG using the basic grammar and the CRF in CONTRAfold is not appropriate, because it is necessary to compute the partition function for the product of the generative and discriminative models (cf. Eq. (1)). Hence, in

255

general, we first define an SCFG model with a certain grammar and derive the corresponding CRF model from the SCFG model (cf. Section 3.1). Fortunately, [25] provides a grammar that is compatible with the CONTRAfold model and can, therefore, be used in our hybrid model to construct more accurate models for RNA secondary structures.

260

In our experiments, hybrid models performed better than both generative and discriminative models. This is because the hybrid model is based on a discriminative model in which a new feature (the joint probability of an RNA

17

Page 17 of 24

sequence and a secondary structure) is added (see the last paragraph in Section 3.1). Hence, we expect that hybrid models will improve conventional discriminative models (e.g., CRFs) by incorporating the joint probability informa-

ip t

265

tion computed by a generative model (such as an SCFG). As mentioned a lot in

cr

this paper, the hybrid model utilizes additional sequence information (not struc-

ture information) in training parameters, showing that sequence information is

270

us

useful for constructing a model of RNA secondary structures.

Our method will be useful when we have a limited number of training datasets (i.e., when a limited number of reliable RNA secondary structures is

an

available). For example, fewer RNA secondary structures have been determined for some RNA sequences, including non-standard (modified) nucleotides such as 2′ -O-methyl [38] and pseudo-uridine [39], although RNA secondary structure predictions for those RNA sequences are important. It will be possible to

M

275

construct probabilistic models for secondary structures of RNAs with modified nucleotides if a hybrid model is utilized.

d

It would be also interesting to train our hybrid model according to a given

280

te

accuracy measure for RNA secondary structure predictions. (Here, the decoding of an RNA secondary structure from the trained probabilistic distribution of

Ac ce p

RNA secondary structures was carried out using the principle of MEA; see Section 1). In RNA secondary structure predictions, accurate prediction of base-pairs is important, and training with regard to relevant accuracy measures will lead to more suitable probabilistic models for accurately predicting base-

285

pairs in RNA secondary structures [40]. The results described in Section 5.3 seem to be promising because they sug-

gest that more accurate structure predictions will be realized if we use more RNA sequences without secondary structures in training the hybrid model. Because a huge number of RNA sequences is currently available in several databases

290

[41, 42, 43, 44], we plan to train the hybrid model by using those RNA sequences. In order to do this, a more efficient implementation for training the hybrid model will be needed (e.g., parallelization in training algorithms).

18

Page 18 of 24

7. Conclusion

295

ip t

In this paper, we have proposed a novel semi-supervised learning approach to train parameters in probabilistic models of RNA secondary structures, which enables us to incorporate not only RNA sequences with secondary structures

cr

(labeled training data) but also sequences with unknown secondary structures (unlabeled training data). Computational experiments consistently indicate that

300

us

the hybrid model is better than conventional generative and discriminative models. To our knowledge, this is the first study of a semi-supervised learning approach for RNA secondary structure prediction. The method will be useful when

an

the number of reliable structures is limited, such as in the secondary structure

M

prediction for RNA sequences that include modified nucleotides.

Acknowledgements 305

We are grateful to Dr. Elena Rivas for technical support regarding the TOR-

d

NADO program. We also thank members of Asai Laboratory at the University

te

of Tokyo and members of CBRC/AIST for useful discussions. This work was conducted while HY was a master course student at the University of Tokyo.

Ac ce p

A part of this work was also conducted while MH belonged to the University of 310

Tokyo; he joined Waseda University on April 1, 2014. This work was supported in part by MEXT KAKENHI (Grant-in-Aid for

Young Scientists (A) Grant Number 24680031 for MH; Grant-in-Aid for Scientific Research (A) Grant Number 25240044 for MH and KA) ).

References

315

[1] J. Gorodkin, W. L. Ruzzo (Eds.), RNA Sequence, Structure, and Function: Computational and Bioinform 2014th Edition, Humana Press, 2014. URL http://amazon.com/o/ASIN/162703708X/ [2] N. Markham, M. Zuker, UNAFold: software for nucleic acid folding and hybridization, Methods Mol. Biol. 453 (2008) 3–31. 19

Page 19 of 24

320

[3] M. Zuker, Mfold web server for nucleic acid folding and hybridization pre-

ip t

diction, Nucleic Acids Res 31 (13) (2003) 3406–3415. [4] C. Do, D. Woods, S. Batzoglou, CONTRAfold: RNA secondary structure

prediction without physics-based models, Bioinformatics 22 (2006) e90–98.

325

cr

[5] M. Hamada, H. Kiryu, K. Sato, T. Mituyama, K. Asai, Prediction of RNA secondary structure using generalized centroid estimators, Bioinformatics

us

25 (4) (2009) 465–473.

[6] Z. J. Lu, J. W. Gloor, D. H. Mathews, Improved RNA secondary structure

an

prediction by maximizing expected pair accuracy, RNA 15 (2009) 1805– 1813. 330

[7] D. Mathews, M. Disney, J. Childs, S. Schroeder, M. Zuker, D. Turner, In-

M

corporating chemical modification constraints into a dynamic programming algorithm for prediction of RNA secondary structure, Proc. Natl. Acad. Sci.

d

U.S.A. 101 (2004) 7287–7292.

[8] K. Sato, Y. Kato, M. Hamada, T. Akutsu, K. Asai, IPknot: fast and accurate prediction of RNA secondary structures with pseudoknots using

te

335

Ac ce p

integer programming, Bioinformatics 27 (13) (2011) 85–93. [9] J. S. McCaskill, The equilibrium partition function and base pair binding probabilities for RNA secondary structure, Biopolymers 29 (6-7) (1990) 1105–1119.

340

[10] R. Dowell, S. Eddy, Evaluation of several lightweight stochastic context-free grammars for RNA secondary structure prediction, BMC Bioinformatics 5 (2004) 71.

[11] C. Do, C. Foo, S. Batzoglou, A max-margin model for efficient simultaneous alignment and folding of RNA sequences, Bioinformatics 24 (2008) i68–76.

345

[12] K. Sato, M. Hamada, T. Mituyama, K. Asai, Y. Sakakibara, A nonparametric bayesian approach for predicting rna secondary structures, J. Bioinformatics and Computational Biology 8 (4) (2010) 727–742. 20

Page 20 of 24

[13] M. Andronescu, A. Condon, H. Hoos, D. Mathews, K. Murphy, Efficient parameter estimation for RNA secondary structure prediction, Bioinformatics 23 (2007) 19–28.

ip t

350

[14] M. Andronescu, A. Condon, H. H. Hoos, D. H. Mathews, K. P. Murphy,

cr

Computational approaches for RNA energy parameter estimation, RNA 16 (2010) 2304–2318.

355

us

[15] M. Hamada, H. Kiryu, W. Iwasaki, K. Asai, Generalized centroid estimators in bioinformatics, PLoS ONE 6 (2) (2011) e16450.

an

[16] M. Hamada, K. Asai, A classification of bioinformatics algorithms from the viewpoint of maximizing expected accuracy (MEA), J. Comput. Biol.

M

19 (5) (2012) 532–549.

[17] J. Suzuki, A. Fujino, H. Isozaki, Semi-Supervised structured output learning based on a hybrid generative 360

in: Proceedings of the 2007 Joint Conference on Empirical Methods in

d

Natural Language Processing and Computational Natural Language

te

Learning (EMNLP-CoNLL), 2007, pp. 791–800. URL http://www.aclweb.org/anthology/D/D07/D07-1083

Ac ce p

[18] A. P. Dempster and N. M. Laird and D. B. Rubin, Maximum likelihood 365

estimation from incomplete data via the EM algorithm (with discussion), J. Roy. Statist. Soc. 39 (1977) 1–38.

[19] I. Hofacker, W. Fontana, P. Stadler, S. Bonhoeffer, M. Tacker, P. Schuster, Fast folding and comparison of RNA secondary structures., Monatsh. Chem. 125 (1994) 167–188.

370

[20] S. Zakov, Y. Goldberg, M. Elhadad, M. Ziv-Ukelson, Rich parameterization improves rna structure prediction, in: V. Bafna, S. C. Sahinalp (Eds.), RECOMB, Vol. 6577 of Lecture Notes in Computer Science, Springer, 2011, pp. 546–562.

21

Page 21 of 24

[21] B. Knudsen, J. Hein, RNA secondary structure prediction using stochastic 375

context-free grammars and evolutionary history, Bioinformatics 15 (1999)

ip t

446–454.

[22] D. H. Mathews, J. Sabina, M. Zuker, D. H. Turner, Expanded sequence

ondary structure, J. Mol. Biol. 288 (1999) 911–940.

[23] R. Dowell, S. Eddy, Efficient pairwise RNA structure prediction and align-

us

380

cr

dependence of thermodynamic parameters improves prediction of RNA sec-

ment using sequence alignment constraints, BMC Bioinformatics 7 (2006)

an

400.

[24] R. Durbin, S. Eddy, A. Krogh, G. Mitchison, Biological sequence analysis,

385

M

Cambridge University press, Cambridge, UK, 1998.

[25] E. Rivas, R. Lang, S. R. Eddy, A range of complex probabilistic models for RNA secondary structure prediction that includes the nearest-neighbor

d

model and more, RNA 18 (2) (2012) 193–212.

te

[26] I. L. Hofacker, Vienna RNA secondary structure server, Nucleic Acids Res 31 (13) (2003) 3429–3431.

[27] T. Puton, L. P. Kozlowski, K. M. Rother, J. M. Bujnicki, CompaRNA: a

Ac ce p

390

server for continuous benchmarking of automated methods for RNA secondary structure prediction, Nucleic Acids Res. 41 (7) (2013) 4307–4323.

[28] M. Hamada, K. Sato, K. Asai, Improving the accuracy of predicting secondary structure for aligned RNA sequences, Nucleic Acids Res. 39 (2)

395

(2011) 393–402.

[29] M. Hamada, K. Yamada, K. Sato, M. C. Frith, K. Asai, CentroidHomfoldLAST: accurate prediction of RNA secondary structure using automatically collected homologous sequences, Nucleic Acids Res. 39 (Web Server issue) (2011) W100–106.

22

Page 22 of 24

400

[30] M. Hamada, K. Sato, H. Kiryu, T. Mituyama, K. Asai, Predictions of RNA secondary structure by combining homologous sequence information,

ip t

Bioinformatics 25 (12) (2009) i330–338.

[31] Z. Yao, Z. Weinberg, W. L. Ruzzo, CMfinder–a covariance model based

405

cr

RNA motif finding algorithm, Bioinformatics 22 (4) (2006) 445–452.

[32] J. Havgaard, S. Kaur, J. Gorodkin, Comparative ncRNA gene and structure

us

prediction using Foldalign and FoldalignM, Curr Protoc Bioinformatics Chapter 12 (2012) Unit12.11.

an

[33] M. Bauer, G. Klau, K. Reinert, Accurate multiple sequence-structure alignment of RNA sequences using combinatorial optimization, BMC Bioinfor410

matics 8 (2007) 271.

M

[34] S. Lindgreen, P. P. Gardner, A. Krogh, MASTR: multiple alignment and structure prediction of non-coding RNAs using simulated annealing, Bioin-

d

formatics 23 (2007) 3304–3311.

[35] S. Bernhart, I. Hofacker, S. Will, A. Gruber, P. Stadler, RNAalifold: improved consensus structure prediction for RNA alignments, BMC Bioinfor-

te

415

Ac ce p

matics 9 (2008) 474.

[36] R. Achawanantakun, Y. Sun, S. S. Takyar, ncRNA consensus secondary structure derivation using grammar strings, J Bioinform Comput Biol 9 (2) (2011) 317–337.

420

[37] E. Torarinsson, S. Lindgreen, WAR: Webserver for aligning structural RNAs, Nucleic Acids Res. 36 (Web Server issue) (2008) 79–84.

[38] T. Haraguchi, H. Nakano, T. Tagawa, T. Ohki, Y. Ueno, T. Yoshida, H. Iba, A potent 2’-O-methylated RNA-based microRNA inhibitor with unique secondary structures, Nucleic Acids Res. 40 (8) (2012) e58.

425

[39] E. Kierzek, M. Malgowska, J. Lisowiec, D. H. Turner, Z. Gdaniec, R. Kierzek, The contribution of pseudouridine to stabilities and structure of RNAs, Nucleic Acids Res. 23

Page 23 of 24

[40] J. Suzuki, E. McDermott, H. Isozaki, Training conditional random fields with multivariate evaluation measures, in: Proc. of ACL, 2006, pp. 217– 224.

ip t

430

[41] S. Griffiths-Jones, S. Moxon, M. Marshall, A. Khanna, S. R. Eddy, A. Bate-

Acids Res 33 (Database issue) (2005) 121–124.

cr

man, Rfam: annotating non-coding RNAs in complete genomes, Nucleic

435

us

[42] D. Bhartiya, K. Pal, S. Ghosh, S. Kapoor, S. Jalali, B. Panwar, S. Jain, S. Sati, S. Sengupta, C. Sachidanandan, G. P. Raghava, S. Sivasubbu,

an

V. Scaria, lncRNome: a comprehensive knowledgebase of human long noncoding RNAs, Database (Oxford) 2013 (2013) bat034.

[43] C. Xie, J. Yuan, H. Li, M. Li, G. Zhao, D. Bu, W. Zhu, W. Wu, R. Chen,

440

M

Y. Zhao, NONCODEv4: exploring the world of long non-coding RNA genes, Nucleic Acids Res. 42 (Database issue) (2014) 98–103.

d

[44] C. Park, N. Yu, I. Choi, W. Kim, S. Lee, lncRNAtor: a comprehensive re-

Ac ce p

te

source for functional investigation of long noncoding RNAs, Bioinformatics.

24

Page 24 of 24

Machine learning approach for the prediction of protein secondary structure.

A regularized approach for geodesic-based semisupervised multimanifold learning.

RNA-SSPT: RNA Secondary Structure Prediction Tools.

A Deep Learning Network Approach to ab initio Protein Secondary Structure Prediction.

Tertiary structure prediction of RNA-RNA complexes using a secondary structure and fragment-based method.

Deformed graph laplacian for semisupervised learning.

Structure Prediction of RNA Loops with a Probabilistic Approach.

Data-directed RNA secondary structure prediction using probabilistic modeling.

Prediction of RNA secondary structure, including pseudoknotting, by computer simulation.

CompaRNA: a server for continuous benchmarking of automated methods for RNA secondary structure prediction.

Fick's Law Assisted Propagation for Semisupervised Learning.

A conserved secondary structure for telomerase RNA.

Self-Trained LMT for Semisupervised Learning.

Network-based stochastic semisupervised learning.

Hybrid system for protein secondary structure prediction.

A Maximum Margin Approach for Semisupervised Ordinal Regression Clustering.

5S RNA secondary structure.

RNA-MoIP: prediction of RNA secondary structure and local 3D motifs from sequence data.

Method for predicting RNA secondary structure.

Fine-grained parallelism accelerating for RNA secondary structure prediction with pseudoknots based on FPGA.

Dynalign II: common secondary structure prediction for RNA homologs with domain insertions.

RNAex: an RNA secondary structure prediction server enhanced by high-throughput structure-probing data.

Expected degree for RNA secondary structure networks.

Knowledge base and neural network approach for protein secondary structure prediction.