G Model

BIO 3510 1–9 BioSystems xxx (2014) exxx–exxx

Contents lists available at ScienceDirect

BioSystems journal homepage: www.elsevier.com/locate/biosystems

Reconstruction of phylogenetic trees of prokaryotes using maximal common intervals

1 2

3 Q1

Mahdi Heydari a , Sayed-Amir Marashi b,c, *, Ruzbeh Tusserkani d , Mehdi Sadeghi e

4 5 6 7 8

a

Department of Algorithms and Computation, College of Engineering, University of Tehran, Tehran, Iran Department of Biotechnology, College of Science, University of Tehran, Tehran, Iran c School of Biological Science, Institute for Research in Fundamental Sciences (IPM), Tehran, Iran d School of Mathematics, Institute for Research in Fundamental Sciences (IPM), Tehran, Iran e National Institute of Genetic Engineering and Biotechnology, Tehran, Iran b

A R T I C L E I N F O

A B S T R A C T

Article history: Received 28 December 2012 Received in revised form 13 August 2014 Accepted 1 September 2014 Available online xxx

One of the fundamental problems in bioinformatics is phylogenetic tree reconstruction, which can be used for classifying living organisms into different taxonomic clades. The classical approach to this problem is based on a marker such as 16S ribosomal RNA. Since evolutionary events like genomic rearrangements are not included in reconstructions of phylogenetic trees based on single genes, much effort has been made to find other characteristics for phylogenetic reconstruction in recent years. With the increasing availability of completely sequenced genomes, gene order can be considered as a new solution for this problem. In the present work, we applied maximal common intervals (MCIs) in two or more genomes to infer their distance and to reconstruct their evolutionary relationship. Additionally, measures based on uncommon segments (UCS’s), i.e., those genomic segments which are not detected as part of any of the MCIs, are also used for phylogenetic tree reconstruction. We applied these two types of measures for reconstructing the phylogenetic tree of 63 prokaryotes with known COG (clusters of orthologous groups) families. Similarity between the MCI-based (resp. UCS-based) reconstructed phylogenetic trees and the phylogenetic tree obtained from NCBI taxonomy browser is as high as 93.1% (resp. 94.9%). We show that in the case of this diverse dataset of prokaryotes, tree reconstruction based on MCI and UCS outperforms most of the currently available methods based on gene orders, including breakpoint distance and DCJ. We additionally tested our new measures on a dataset of 13 closely-related bacteria from the genus Prochlorococcus. In this case, distances like rearrangement distance, breakpoint distance and DCJ proved to be useful, while our new measures are still appropriate for phylogenetic reconstruction. ã 2014 Elsevier Ireland Ltd. All rights reserved.

Keywords: Gene order Breakpoint distance Common interval Genomic rearrangement Comparative genomics DCJ

Q2

9 10 11 12 13 14 15 16 17 18 19

1. Introduction The usual way to reconstruct the phylogenetic tree of prokaryotes is to use the sequences of their 16S rRNA gene (Hao and Gao, 2008). However, it is suggested that tree reconstruction merely based on a single gene is not sufficient to explain many evolutionary events, like insertions, deletions, or horizontal gene transfer (Suyama and Bork, 2001). Different strategies like phylogenomics and supertree reconstruction are proposed to address the same issue in phylogenetic reconstruction (Sanderson et al., 1998; Sicheritz-Pontén and Andersson, 2001; Soltis and Soltis, 2001). As a result, some studies suggested using

* Corresponding author. E-mail address: [email protected] (S.-A. Marashi).

gene order of the genomes as an alternative source of information for reconstructing phylogenetic trees (Belda et al., 2005; Blin et al., 2005; Luo et al., 2008; Moret et al., 2001). Using these methods, one can obtain phylogenetic trees which take into account the evolutionary history of a genomic sequence. These trees are usually consistent with our knowledge about the phylogenetic relationships of different species (Markov and Zakharov, 2009). Therefore, such trees can be used, in combination with standard methods like 16S rRNA-based trees, to provide a more comprehensive picture of the phylogenetic relations. Chromosome (genome) rearrangements, as the evolutionary events which shape the genomic structure, were first described more than seventy years ago, where the concept of “breakpoints” (disruption of gene orders) was originally introduced (Dobzhansky and Sturtevant, 1938; Sturtevant and Dobzhansky, 1936). Following the ideas presented in those classical papers, it was suggested that

http://dx.doi.org/10.1016/j.biosystems.2014.09.002 0303-2647/ ã 2014 Elsevier Ireland Ltd. All rights reserved.

Please cite this article in press as: Heydari, M., et al., Reconstruction of phylogenetic trees of prokaryotes using maximal common intervals. BioSystems (2014), http://dx.doi.org/10.1016/j.biosystems.2014.09.002

20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35

G Model

BIO 3510 1–9 2 36

M. Heydari et al. / BioSystems xxx (2014) xxx–xxx

53

the evolutionary distance of two genomes can be estimated by inferring the rearrangement events (Hannenhalli and Pevzner, 1995a; Hannenhalli and Pevzner, 1995b). Typically, in such studies and other similar works, only “common genes” in all genomes are taken into account (Belda et al., 2005; Luo et al., 2008; Markov and Zakharov, 2009). Mathematically speaking, these methods analyze gene permutations (GP) rather than gene order sequences (GOS) (El-Mabrouk and Sankoff, 2012). We will define these terms in the next section. Consequently, these methods neglect gene gain and gene loss events. In this paper, based on the concept of common intervals (Schmidt and Stoye, 2004; Uno and Yagiura, 2000), we present new measures of pairwise genome distance, which can be used for phylogenetic tree reconstruction. These measures are suitable for the analysis of distant genomes, as they do not require removal of uncommon genes. We show that the phylogenetic trees based on these measures are consistent with reference trees obtained from 16S rRNA or NCBI Taxonomy Browser.

54

2. Basic definitions

37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52

55 56 Q3 57 58 59 60 61 62 63 64 65

2.1. Gene order sequence (GOS) Let S = {1, . . . , n} be a finite set of genes. A gene order sequence (GOS), G = (g 1,g 2, . . . , g t), is defined as an ordered list of genes, i.e., g 1 2 S for all 1  i  l. Position of each gene g 1 is simply its index i. In general, l can be greater than n since a GOS can contain repetitive elements. Please note that each g 1 can be labeled as “+” or “” depending on their orientation on the genome. If this is the case, then the GOS is a signed GOS, otherwise it is unsigned. Let the interval [x,y] denote the set {x,x + 1, . . . , y  1,y}, with x < y. We define G [x,y] (G) as the set of genes appearing between the positions x and y of G. In general, it is possible to have two GOS’s, G1 and G2, with G [1,n] (G1) 6¼ G [1,n] (G2).

66

2.2. GOS common intervals

67

Informally speaking, two genomic segments are GOS common intervals if, regardless of repeated genes in each segment, they contain the same set of genes. Suppose that two GOS’s A and B of set {1, . . . , n} are given as input. A pair of intervals ([xA,yA], [xB,yB]) with 1  xA < yA  n and 1  xB < yB  n is called a common interval if it satisfies G½xA ;yA  ðAÞ ¼ G½xB ;yB  ðBÞ ¼ M. The common interval size, |M|, is equal to the number of different genes in each interval. Let A and B be two GOS’s. The ordered pair ([i,j], [i0 ,j0 ]) of two gene order sequences A and B is a maximal common interval (MCI) if there exists no different common interval ([xA,yA], [xB,yB]) such

68 69 70 71 72 73 74 75 76

77

2.3. Gene permutation (GP)

79

A gene permutation P on n different genes is a rearrangement of these genes into a particular order. Therefore, there is a one-to-one correspondence between the elements of each two GPs P1 and P2, i.e., G [1,n] (P1) = G [1,n] (P2). In other words, if two genomic regions with n different genes have the same gene content then each region can be considered as a permutation of the other. Please note that in this definition gene duplications are not allowed.

80

3. Measures of evolutionary distance of two genomes

87

Evolutionary rearrangement events do not occur very often (Fertin et al., 2009). Therefore, the whole genomic structures evolve usually slower than DNA sequences (Morozov et al., 2013). Measures which are introduced in this study will reflect the influence of these evolutionary events including insertion and deletion in similarity or discrepancy score. A measure of evolutionary distance between a pair of genomes is a measure to estimate how different the two genomes are. Having a measure for evolutionary distance, distance matrix is an n  n symmetric matrix D in which Dij represents the evolutionary distance of the genomes i and j. In this manuscript, we focus only on those measures in the literature which are based on gene orders in genomes.

88

3.1. Rearrangement distance measures

101

The strategies to estimate distance measures can be categorized into two main groups. In the first type of strategies, a certain number of predefined genomic “rearrangement events” are considered. The distance measure can be computed by solving the “rearrangement problem”, i.e., the problem of finding a minimum number of rearrangement events necessary to transform original GOS to a target GOS (Delgado et al., 2010). By solving this problem the distance of the two genomes can be determined. Distance measure computing strategies in this category may differ in the allowed “rearrangement events”, or in the schemes of rearrangement event penalties. If all of the genomes can be written as GPs of the same set of genes, then the rearrangement problem is tractable in polynomial time (El-Mabrouk and Sankoff, 2012). However, when other evolutionary events like duplication are taken into account, most strategies to solve this problem become NP-hard (Blin and Rizzi, 2005; Delgado et al., 2010). Some authors have suggested

102

l2=3

l1=2

4

that [i,j] $ [xA,yA] or [i0 ,j0 ] $ [xB,yB]. See Fig. 1 for an illustrative example.

l3=2

u1=3

l4=2

2

3

2

6

4

5

5

1

12

1

8

3

9

10

5

2

3

9

4

12

13

13

16

1

5

4

8

11

10

5

u’1=4

u’2=2

Fig. 1. Schematic representation of maximal common intervals and uncommon segments between two GOS’s. In our analysis, the two GOS’s are not necessarily of the same length and may not have the same gene content. Minimum size of maximal common intervals and also uncommon segments were assumed to be 2. In the first GOS there exists one uncommon segment with size u1, while in the second GOS there are two uncommon segments with size u0 1 and u0 2 . Additionally, there are four maximal common intervals between the two GOS’s with sizes l1, . . . , l4. Note that in the second GOS, two common intervals have overlap.

Please cite this article in press as: Heydari, M., et al., Reconstruction of phylogenetic trees of prokaryotes using maximal common intervals. BioSystems (2014), http://dx.doi.org/10.1016/j.biosystems.2014.09.002

78

81 82 83 84 85 86

89 90 91 92 93 94 95 96 97 98 99 100

103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118

G Model

BIO 3510 1–9 M. Heydari et al. / BioSystems xxx (2014) xxx–xxx 119

126

approximation algorithms to solve this problem in the presence of duplication and loss events (Swenson et al., 2008). In practice, when gene gain and loss events are neglected, we should ignore the uncommon genes between two genomes. Therefore, the above-mentioned strategy is suitable when closely-related genomes are studied. In more distant genomes, however, we may find a very limited number of genes which appear in all of them.

127

3.2. Examples of rearrangement distance measures

128

Reversal distance problem (Hannenhalli and Pevzner, 1995a) is the problem of finding the genomic distances, when one GOS is to be transformed to a target GOS merely by using reversal events (Fig. 2(A)). Starting from the first GOS, in each step, a subsequence of the current GOS is selected and reversed. The process is continued until the target sequence is obtained. This procedure is also called “sorting by reversal”. The minimum number of required steps to obtain the target GOS from the original GOS is the reversal distance between two sequences. Hannenhalli and Pevzner (1995a) presented a polynomial algorithm for solving this problem and computing the reversal distance between two GOS’s. Shortly later, they extended this problem to find the pairwise genomic distances when an original GOS is to be transformed into a target GOS using reversal, translocation, fusion and fission events (Hannenhalli and Pevzner, 1995b). This procedure is also called “sorting by rearrangement”. Using these four events, one can also model the genomic rearrangements not only in monochromosomal cases, but also in the multi-chromosomal systems (Fig. 2(B)). The minimum number of required steps to obtain the target GOS from the original GOS is the rearrangement distance (Re) between two sequences. A related strategy, called “double cut and join” or DCJ, was later presented (Yancopoulos et al., 2005). In the DCJ model, each genome consists of a set of genes g, which has two “extremities”, g

120 121 122 123 124 125

129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151

(A)

Sorng by rearrangements

-3

7

-1

4

5

6

-2

-3 -2

-1

4

5

6

7

fusion

-3 -2 -1

4

5

6

7

fission reversal

-3 -2 -1

4

5

6

7

1

4

5

6

7

2

3

translocaon

Sorng by reversal

(B) 1

-6 -3 -7

2

-4 -5

8

1

-6 -3 -2

7

-4 -5

8

1

2

3

6

7

-4 -5

8

1

2

3

4

-7 -6 -5

8

1

2

3

4

5

6

7

8

Fig. 2. (A) Example of sorting by reversal. Here, an original GOS (1,6,3,7,2,4,5,8) should be transformed into a target GOS (1,2,3,4,5,6,7,8) merely by subsequence reversal events. In this example, the reversal distance of the two sequences is four. (B) Example of sorting by rearrangements. Here, a pair of GOS’s (3,7) | (1,4,5,6,2) should be transformed into a pair of target GOS’s (1,2,3,4) | (5,6,7) by reversal, translocation, fusion and fission events. In this example, the rearrangement distance of the two sequences is four. This figure is adopted from the slides by Guillaume Bourque (available from: http://www.math. nus.edu.sg/matzlx/ma3259/GB_Lecture21.pdf).

3

and g+. Each extremity is either adjacent to another extremity, or to a telomere, denoted by “o”. Consider a pair of adjacencies pq and rs, which can be telomeric adjacencies (like “po” and “oq”) or even empty chromosome (“oo”). A DCJ operation r acting on adjacencies pq and rs cuts these adjacencies and replaces them by either pr and qs, or ps and qr (Ková9 c et al., 2011). It can be easily shown that the four previously-mentioned rearrangement events can be modeled by DCJ operations (Ková9c et al., 2011; Yancopoulos et al., 2005). A sequence of k operations r1,r2, . . . , rk transforming one genome P into another genome G is called a DCJ scenario of length k. The DCJ distance between genomes P and G is defined as the minimum possible k.

152

3.3. “Statistical” distance measures

164

In the second type of strategies, instead of solving the rearrangement problem, one computes some statistics to estimate the distance of the two genomes. Such measures, typically, can be computed in polynomial time even in case of GOS’s.

165

3.4. Examples of “statistical” distance measures

170

In order to estimate the similarity or distance of two genomes, a number of studies have considered statistics on the “clusters” of neighboring genes (Jahn, 2010; Wittler, 2010). Common intervals and gene teams are examples of such clusters. Finding common intervals of two genomes was pioneered by Uno and Yagiura (2000). They presented a polynomial time algorithm for finding all common intervals in a pair of genomes. Later, Heber and Stoye (2001) extended the algorithm to support more than two genomes. Both of these algorithms, however, work based on GPs rather than GOS’s. Therefore, Schmidt and Stoye (2004) presented a quadratic time algorithm, called CI, to find common intervals in two or more GOS’s. Interestingly, this algorithm was much simpler than the previous methods. In recent years, a number of tools for finding common intervals are available, e.g., CREx (Bernt et al., 2007), which have been used for finding genomic distances and phylogenetic reconstruction (Perseke et al., 2008). In a number of studies, gaps are allowed to appear in gene clusters. A conserved gene cluster in which the maximum distance between adjacent genes is constrained is called a gene team (Bergeron et al., 2002; Luc et al., 2003; Zhang and Leong, 2009). Gene teams are suitable for comparative genome analysis, and theoretically, can be applied for finding genomic distances. A major drawback of the algorithms for finding gene teams is that we should assume a unique genomic position for each gene, e.g., gene duplication events are not allowed (Béal et al., 2004). Although these algorithms can accept genomes with different gene contents, in practice uncommon genes are ignored in the computations. It should be noted that extensions to the previous algorithms are presented which are able to deal with sequences with gene duplicates (He and Goldwasser, 2005; Ling et al., 2009). Other examples of “statistical” distance measures are breakpoint distance, Br, and the relative number of common pairs, S. Breakpoint distance between two genomes (Sankoff and Blanchette, 1998; Sankoff et al., 1992; Watterson et al., 1982) is the number of gene pairs which are adjacent in a first genome, but not the second one. Therefore, Br is a discrepancy measure (see below). The relative number of common pairs of neighboring genes in two genomes (Markov and Zakharov, 2009) can be computed by dividing number of identical pairs of neighboring genes to the number of homologous genes used in the comparison of two genomes. As a result, S should be converted to a dissimilarity measure to be usable in tree reconstruction.

171

Please cite this article in press as: Heydari, M., et al., Reconstruction of phylogenetic trees of prokaryotes using maximal common intervals. BioSystems (2014), http://dx.doi.org/10.1016/j.biosystems.2014.09.002

153 154 155 156 157 158 159 160 161 162 163

166 167 168 169

172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213

G Model

BIO 3510 1–9 4

M. Heydari et al. / BioSystems xxx (2014) xxx–xxx

214

3.5. Discrepancy vs. dissimilarity measures

215

The “statistical” distance between two genomes can be either computed directly or indirectly. If we directly compute the distance of two genomes, the resulting measure will be called a discrepancy measure. On the other hand, if we firstly compute the similarity of two genomes and then convert this similarity value to a distance measure, it will be called a dissimilarity measure. A dissimilarity measure can be computed simply by subtracting each similarity value from the maximum element of the similarity matrix.

216 217 218 219 220 221 222 223

4. Materials and methods

224

4.1. Genome datasets

225

Two datasets were used in this study (see Additional file 1):

226

 COG dataset: this dataset contains the 63 prokaryotic genomes

227

(13 archaea and 50 bacteria) which were used in the COG (clusters of orthologous groups) project (Tatusov et al., 2003) (http://www.ncbi.nlm.nih.gov/COG/). Using the COG IDs of the existing genes, the gene order sequences were found in each of the 63 genomes (see Additional file 2). The 16S rRNA sequences of these genomes were obtained from GenBank. These sequences were aligned using ClustalW in Mobyle server (Néron et al., 2009) (http://mobyle.pasteur.fr/). The alignment results in a distance matrix, which was used to reconstruct the phylogenetic tree of these 63 species.  Prochlorococcus dataset: this dataset contains 13 closely-related bacteria from the genus Prochlorococcus (Luo et al., 2008) with known gene orders. Gene order information of these genomes was kindly provided by Haiwei Luo see Additional file 3. The phylogenetic tree based on 16S rRNA sequences was reconstructed as described below.

228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 245

Supplementry material related to this article found, in the online version, at http://dx.doi.org/10.1016/j.biosystems.2014. 09.002.

246

4.2. Computing evolutionary distances

247

SPRING (Lin et al., 2006) (http://algorithm.cs.nthu.edu.tw/ tools/SPRING/) was used to compute breakpoint distance. SoRT2 (Huang and Lu, 2010; Huang et al., 2010) (http://genome.cs.nthu. edu.tw/SORT2/) was used to calculate the rearrangement distance (Re), by selecting the “sorting by reversal, generalized transpositions and translocations” option with its default parameters. Finally, UniMoG (Bergeron et al., 2006), which includes implementations of the DCJ (Yancopoulos et al., 2005) and HP (Hannenhalli and Pevzner, 1995b; Jean and Nikolski, 2007; Tesler, 2002) methods, was used to compute DCJ and hp distance measures, respectively. We obtained the same results by using Re, hp and DCJ measures, which is in agreement with the previous reports (Bergeron et al., 2008; Huang et al., 2010). Therefore, only DCJ results are included in this manuscript. In UniMoG, the input can contain GOS’s with different gene contents and also gene duplicates. If a gene is missing in one GOS, it will not be considered in the analysis. Additionally, in case of duplicates, only the first occurrence of each gene is kept. If a gene with the same identifier is found more than once, all occurrences except the first are discarded.

244

248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266

267

4.3. Our approach: computing evolutionary distances based on MCIs

268

We used CI (Connecting Intervals) algorithm (Schmidt and Stoye, 2004) for finding maximal common intervals (MCIs) in two

269

GOS’s. Please note that GOS’s are unsigned here. Suppose that we find N MCIs with sizes l1, . . . , lN. Based on the resulting MCIs, we consider three different measures of similarity between two GOS’s:

270

 N, the number of MCIs;

274

 a¼  E¼

271 272 273

PN

i¼1 li , sum of MCI sizes; and pP ffiffiffiffiffiffiN 2 i¼1 li , the Euclidean norm of MCI sizes.

The similarity matrix can be simply converted to a distance matrix by subtracting each similarity value from the maximum of the observed similarity values, i.e., maximum of the elements of the similarity matrix. We also followed another strategy for computing distance matrices. In our analysis, some genomic segments are not included in any MCI. Informally speaking, these segments represent the unmatched parts of the genomes, and therefore, can show how different the genomes are. For these uncommon segments (UCS’s) in a pair of genomes, we can define several measures of discrepancy. Suppose that after finding the MCIs, in the first (resp. second) genome we find k (resp. k1) uncommon segments, with sizes u1, . . . , uk (resp. u0 1 , . . . , u0 k0 ). Let the total number of genes in the first (resp. second) genome be t (resp. t0 ). We define two measures ~ as follows: ~ and E, of discrepancy, s  k  0 k 0  s ~ ¼ 12 Si¼1t u1 þ Si¼1t u 1 , the average sum of UCS sizes; and 0qffiffiffiffiffiffiffiffiffiffiffi ~ ¼ 1@  E 2

k

Si¼1 u2i t

1 pffiffiffiffiffiffiffiffiffiffiffiffi k Si¼1 u0 2i A , the average Euclidean norm of UCS þ t0

sizes.

275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290

291

4.4. Phylogenetic tree creation methods

292

For creating phylogenetic trees from a distance matrix, three methods, namely UPGMA (Sokal and Michener, 1958) and neighbor joining (NJ) (Saitou and Nei, 1987) and Fitch–Margoliash (FM) (Fitch and Margoliash, 1967) were used in this work. UPGMA, NJ and FM work based on distance matrices. PHYLIP (Retief, 1999) was used for creating UPGMA, NJ and FM phylogenetic trees.

293

4.5. Comparison of phylogenetic trees

299

Different strategies for phylogenetic tree reconstruction may result in different trees. In order to compare the resulting trees we used PhyloCore software (Nye et al., 2006). Each reconstructed tree should be provided to the software as a cladogram. Then, for each pair of the input files in PHYLIP format, PhyloCore computes the percentage of similarity of the two trees. Briefly, PhyloCore aligns a pair of phylogenetic trees by matching those branches which share the equivalent leaf elements. In a phylogenetic tree, each branch partitions the set of leaf nodes into a pair of subsets. Consequently, by comparing the two corresponding subsets of leaf nodes, a percentage similarity score is computed for every pair of edges in the two aligned trees. A key feature of this approach is the possibility of aligning a pair of phylogenetic trees with different topological structures, as long as they contain the same set of leafs. In other words, PhyloCore can compute the similarity of a binary tree and a non-binary tree. For the visualization of compared trees Dendroscope 3 is used (Huson and Scornavacca, 2012). For reference trees, we decided to use three standards: (i) the phylogenetic tree obtained by the “common tree” feature of NCBI taxonomy browser (Sayers et al., 2011) (available from: http://www.ncbi.nlm.nih.gov/Taxonomy/), which is mostly based

300

Please cite this article in press as: Heydari, M., et al., Reconstruction of phylogenetic trees of prokaryotes using maximal common intervals. BioSystems (2014), http://dx.doi.org/10.1016/j.biosystems.2014.09.002

294 295 296 297 298

301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320

G Model

BIO 3510 1–9 M. Heydari et al. / BioSystems xxx (2014) xxx–xxx

5

Table 1 Similarity of different reconstructed phylogenetic trees with the NCBI taxonomy for 63 prokaryotes. In this table, the first column shows the similarity or dissimilarity measure used for tree reconstruction. In the second to the fourth columns, percent similarity between the reconstructed trees and the NCBI taxonomy tree are presented, when UPGMA NJ, or FM methods are used for tree creation. Tree similarities are computed by PhyloCore software (Nye et al., 2006). For more information about the measures in this table, please see Section 4. Measure

% similarity, UPGMA

% similarity, NJ

% similarity, FM

Reference

16S rRNA Br S N

94.0 71.7 89.0 89.6 92.0 89.0 93.6 88.8 49.3

96.5 72.3 92.3 89.7 93.0 92.7 94.9 85.1 44.0

96.9 60.5 92.0 91.3 93.1 89.0 94.1 81.1 55.0

N/A (Sankoff and Blanchette, 1998) (Markov and Zakharov, 2009) present study present study present study present study present study (Bergeron et al., 2006)

s E s~ ~ E

DCJ

321

while the whole simulated tree can be considered as the reference tree for assessing the correctness of the reconstruction. Two strategies were considered to generate the offspring GOS’s. In the first strategy, the trees grow symmetrically, i.e., in each step we produce exactly two GOS’s from each ancestral node. Therefore, after n steps, the resulting tree will have 2n leaves. In the second strategy, in each step, with equal probability, we either produce one GOS or two GOS’s from each ancestral node. Therefore, the final tree is not necessarily symmetrical. We used three different parameter settings in the three runs of simulations. In our tool, several parameters are adjustable. These parameters are presented in Additional file 5. Five rearrangement events are allowed in the random generation of GOS’s, namely insertion, deletion, duplication, transposition and inversion. Additionally, the length of the ancestral GOS, maximum number of possible different genes in each GOS, the number of GOS generating iterations, and the tree growing strategy (symmetrical vs. asymmetrical) are also defined. Supplementry material related to this article found, in the online version, at http://dx.doi.org/10.1016/j.biosystems.2014. 09.002.

356

5. Results and discussion

377

345

on the Bergey’s Manual of Systematic Bacteriology (http://www.bergeys.org/); (ii) the distance-based phylogenetic trees reconstructed from the analysis of 16S ribosomal RNA of these species. This tree was constructed as follows: firstly, the 16S ribosomal RNA sequences of all strains were obtained. Secondly, these sequences were aligned using Mobyle server (Néron et al., 2009). Finally, based on this alignment, the phylogenetic tree was reconstructed using one of the methods for creating phylogenetic trees (i.e., UPGMA, NJ and FM). The 16S rRNA NJ and UPGMA trees were constructed as the consensus tree of 500 bootstrapping runs; and (iii) the 16S rRNA phylogenetic tree based on maximum likelihood (ML) analysis, with a total number of 500 bootstrapping runs. It should be noted that the ML, UPGMA and NJ tree creation and the bootstrappings were performed by MEGA 6 software (Tamura et al., 2013), with its default parameter settings (see Additional file 4). Supplementry material related to this article found, in the online version, at http://dx.doi.org/10.1016/j.biosystems.2014. 09.002. Note that the NCBI taxonomy tree is not a binary tree in general. Additionally, it should be noted that neither the NCBI taxonomy tree nor the tree based on 16S ribosomal RNAs is a perfect standard tree for our analysis, since the phylogenetic trees based on common intervals may be influenced by evolutionary processes like horizontal gene transfer.

5.1. Analysis of the 63 prokaryotes

378

346

4.6. Experimental analysis on synthetic data

In the first part of our analysis, for each pair of the prokaryotic genomes we computed all MCIs. Based on the results, three measures of similarity, namely N, s and E, are computed, which were in turn ~ and used for computing pairwise genomic distances. Additionally, s ~ were used as direct measures of discrepancy between two E genomes. Minimum size of MCIs and also UCS’s was assumed to be 3. We also used S, Br and DCJ as the distance measures reported in the literature (see Section 4, and also Tables 1 and 2). For tree creation, UPGMA, NJ, and FM methods were used. In the next step, we need to know what combination of measures and tree creation methods is the most successful in

379

322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344

347 348 349 350 351 352 353 354 355

In order to further assess the validity of our findings, we developed a software tool for simulating genome evolution. In each simulation iteration we first generate a random ancestral GOS with fixed length, which is subsequently transformed using a number of defined rearrangement events. Each resulting GOS is a new “offspring” of the ancestral GOS, which can be further used to produce the next generation. At the end, we have a simulated phylogenetic tree. The leaves in this tree are simulated GOS’s. These GOS’s can be used for phylogenetic tree reconstruction,

Table 2 Similarity of different reconstructed phylogenetic trees with the tree based on 16S rRNA for 63 prokaryotes. In this table, the first column shows the similarity or dissimilarity measure used for tree reconstruction. In the second to the fourth columns, percent similarity between the reconstructed trees and the 16S rRNA-based tree are presented, when UPGMA, NJ or FM methods are used for tree creation. Tree similarities are computed by PhyloCore software (Nye et al., 2006). For more information about the measures in this table, please see Section 4. Measure

% similarity, UPGMA

% similarity, NJ

% similarity, FM

Reference

Br S N

50.5 72.1 76.6 79.1 75.1 75.6 77.2 38.7

51.6 75.7 72.9 73.8 78.2 78.1 71.3 40

51.0 75.5 73.9 75.3 73.5 77.2 68.8 44.9

(Sankoff and Blanchette, 1998) (Markov and Zakharov, 2009) present study present study present study present study present study (Bergeron et al., 2006)

s E s~ ~ E DCJ

Please cite this article in press as: Heydari, M., et al., Reconstruction of phylogenetic trees of prokaryotes using maximal common intervals. BioSystems (2014), http://dx.doi.org/10.1016/j.biosystems.2014.09.002

357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376

380 381 382 383 384 385 386 387 388

G Model

BIO 3510 1–9 6 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426

M. Heydari et al. / BioSystems xxx (2014) xxx–xxx

finding the phylogenetic relations of the prokaryotic species. For this purpose, we need at least a “reference tree”. The NCBI taxonomy tree and the trees based on 16S rRNAs were used as reference and each of the reconstructed trees in the present work were compared to each of these references. In contrast to the previous studies (Belda et al., 2005; Blin et al., 2005; Luo et al., 2008; Markov and Zakharov, 2009), we used a software tool, PhyloCore, for automatic comparison of the topologies of the reconstructed trees. Note that the NCBI taxonomy tree is not necessarily a binary tree. As an additional experiment, we investigated whether by considering only “well-supported” branches in a tree based on 16S rRNA we observe better agreement with MCI- and UCS-based trees. For this purpose, we first reconstructed a 16S rRNA-based bootstrapped tree with NJ as the tree creation method. Then, using MEGA6 (Tamura et al., 2013) we removed the branchings which are supported by less than 70 percent of the bootstrap runs. The results are presented in Additional file 6. We observed that agreement between the two trees has slightly been improved, which suggests that the MCI- and UCS-based trees are more or less ~ -based reliable. In Additional file 7 we present an example, where s tree is compared with the 16S rRNA-based NJ tree including only well-supported branchings. Interestingly, those regions of the 16S rRNA-based tree which are less reliable (i.e., those regions in which the branchings are removed) are not in good agreement with the s~ -based tree. This observation suggests that mismatching parts of the trees might be as a result of unreliability of the 16S rRNA-based trees. Supplementry material related to this article found, in the online version, at http://dx.doi.org/10.1016/j.biosystems.2014. 09.002. In Tables 1 and 2, different methods for phylogenetic tree reconstruction are compared. When UPGMA is used for tree ~ are the best measures for computing pairwise creation, s and s genome dissimilarity. Application of these measures resulted in a tree which has >92% similarity with the NCBI tree and >76.1% ~ similarity with the 16S rRNA-based tree. We observed that s and s perform slightly better than other measures. Except for Br and DCJ, the remaining measures also behave more or less similarly.

~ , S and E are the Comparably, when NJ is used for tree creation, s , s best measures for computing (>92.3% similarity for NCBI tree, >74.3% similarity for 16S rRNA-based tree). With FM as the tree ~ and S are the best measures (>92.0% creation method, s , s similarity for NCBI tree, >75.3% similarity for 16S rRNA-based tree), followed by N and E. In Tables 1 and 2, one can observe that in most of the cases, the overall behavior of the measures is comparable, and independent of the tree creation method (UPGMA, NJ, or FM). This observation suggests that the usefulness of these measures does not depend on the selection of the tree creation method. Fig. 3 illustrates the comparison of two phylogenetic trees. The first tree is the NCBI taxonomy tree, while the second tree is ~ as the distance measure and reconstructed by considering s UPGMA as the tree creation method. The visualization of compared trees are performed by Dendroscope 3 (Huson and Scornavacca, 2012). Another comparison is shown in Fig. 4, where the tree based on 16S rRNA is compared with the tree reconstructed by ~ as the distance measure. These comparisons suggest considering s that our approach successfully separates bacteria and archaea. Additionally, closely related prokaryotes (based on the NCBI taxonomy and 16S rRNA) appear close to each other in the reconstructed phylogenetic tree. By comparing Tables 1 and 2, one can see that phylogenetic trees based on the gene order data are more similar to the NCBI taxonomy tree than to the tree based on 16S rRNAs. One possible explanation to this observation is that the similarities to a multifurcating tree are naturally higher than the similarities to binary tree. In Table 3, we compare the ML-based phylogenetic trees with the trees reconstructed in the present work. This table is essentially similar to the previous tables, which suggests that the success of these measures does not depend on the selection of the reference tree. Fig. 5 shows the comparison of ML-based tree (by a total number of 500 bootstrapping runs) with the tree ~ as the distance measure and reconstructed by considering s UPGMA as the tree creation method. Again, a reasonable agreement is observed. In this study, based on the orders of genes, we presented a number of distance measures between a pair of genomes. Among

~ as the distance measure and UPGMA as the tree creation Fig. 3. Comparison of the NCBI taxonomy tree (A) with the phylogenetic tree reconstructed by considering s method (B). The taxonomy tree is based on the NCBI taxonomy browser. Comparison and visualization of the trees are performed by Dendroscope 3 software.

Please cite this article in press as: Heydari, M., et al., Reconstruction of phylogenetic trees of prokaryotes using maximal common intervals. BioSystems (2014), http://dx.doi.org/10.1016/j.biosystems.2014.09.002

427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465

G Model

BIO 3510 1–9 M. Heydari et al. / BioSystems xxx (2014) xxx–xxx

7

~ as the distance measure and UPGMA Fig. 4. Comparison of the UPGMA phylogenetic tree based on 16S rRNA (A) with the phylogenetic tree reconstructed by considering s as the tree creation method (B). The 16S rRNA tree is validated with a total number of 500 bootstrapping runs. Comparison and visualization of the trees are performed by Dendroscope 3 software. 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486

~ result in trees which are ~ , and also E and E, these measures, s and s more consistent with the reference trees. All these measures somehow summarize both the number and the size of maximal common intervals or uncommon segments. Undoubtedly, genomes which share longer maximal common intervals are expected to be closer compared to the genomes with shorter ones. ~ generally ~ , E and E Therefore, it is not surprising to observe that s , s result in better phylogenetic trees compared to measures like N and S, which only include number of MCIs. Among the analyzed measures in the study of the 63 genomes, Br appears not to be a suitable distance measure. For close genomes only few differences are expected in the gene orders, while for distant genomes a lot of differences are expected. Breakpoint distance works based on the idea of finding consecutive gene pairs in the first genome which do not appear successively in the second genome. Hence, for closely related genomes in which most gene pairs are conserved, even small changes in gene positions significantly influence Br value and are thus highly informative. In contrast, for distant genome pairs where we have too many non-conserved gene pairs, Br value is not expected to change much when little changes occur in the order of the genes. For DCJ this effect is even more noticeable. In fact, all

“uncommon” genes are removed before DCJ is computed. In the above-mentioned dataset, on average, only a limited number of common genes are found between each pair of genomes. Therefore, it is not surprising to observe that, in this case, DCJ is not a successful measure for phylogenetic tree reconstructions.

487

5.2. Analysis of the thirteen species from genus Prochlorococcus

492

As mentioned above, Br is expected to work best for very close genomes. Not surprisingly, this measure has been originally proposed for the analysis of gene permutations (GPs) (Watterson et al., 1982). We decided to repeat our analysis on a dataset of thirteen closely-related bacteria from the genus Prochlorococcus (Luo et al., 2008). With this analysis, one can compare the trees reconstructed based on several different distance measures which were reported in Tables 1 and 2. Table 4 summarizes the results. For NJ and FM methods, E is still the best dissimilarity measure. However, Br and DCJ perform much better than what we observed in case of the 63 prokaryotic genomes. This improvement is so considerable that in case of UPGMA, Br becomes the best dissimilarity measure. This

493

Table 3 Similarity of different reconstructed phylogenetic trees with the ML-based tree using 16S rRNAs of the 63 prokaryotes. In this table, the first column shows the similarity or dissimilarity measure used for tree reconstruction. In the second to the fourth columns, percent similarity between the reconstructed trees and the 16S rRNA-based tree are presented, when UPGMA, NJ or FM methods are used for tree creation. Tree similarities are computed by PhyloCore software (Nye et al., 2006). For more information about the measures in this table, please see Section 4. Measure

% similarity, UPGMA

% similarity, NJ

% similarity, FM

Reference

16S rRNA Br S N

82.5 51.1 74.1 74.4 75.8 72.7 78.5 72.9 38.4

96.0 50.9 75.6 73.5 75.0 77.0 78.7 72.0 39.7

90.0 50.8 77.5 74.4 75.4 73.8 79.6 68.8 46.2

N/A (Sankoff and Blanchette, 1998) (Markov and Zakharov, 2009) present study present study present study present study present study (Bergeron et al., 2006)

s E s~ ~ E DCJ

Please cite this article in press as: Heydari, M., et al., Reconstruction of phylogenetic trees of prokaryotes using maximal common intervals. BioSystems (2014), http://dx.doi.org/10.1016/j.biosystems.2014.09.002

488 489 490 491

494 495 496 497 498 499 500 501 502 503 504 505

G Model

BIO 3510 1–9 8

M. Heydari et al. / BioSystems xxx (2014) xxx–xxx

~ as the distance measure and UPGMA as Fig. 5. Comparison of the ML phylogenetic tree based on 16S rRNA (A) with the phylogenetic tree reconstructed by considering s the tree creation method (B). The 16S rRNA tree is validated with a total number of 500 bootstrapping runs. Comparison and visualization of the trees are performed by Dendroscope 3 software.

Table 4 Similarity of different reconstructed phylogenetic trees with the tree based on 16S rRNA in case of Prochlorococcus dataset. In this table, the first column shows the similarity or dissimilarity measure used for tree reconstruction. In the second to the fourth columns, percent similarity between the reconstructed trees and the 16S rRNA-based tree are presented, when UPGMA, NJ or FM methods are used for tree creation. Tree similarities are computed by PhyloCore software (Nye et al., 2006). For more information about the measures in this table, please see Section 4. Measure

% similarity, UPGMA

% similarity, NJ

% similarity, FM

Reference

Br S N

93.3 78.6 58.3 70.8 82.5 71.2 64.2 86.2

82.3 85.8 62.3 82.5 90.0 83.1 84.5 82.3

82.3 81.3 53.5 82.5 93.3 83.1 84.5 84.3

(Lin et al., 2006) (Markov and Zakharov, 2009) present study present study present study present study present study (Bergeron et al., 2006)

s

E s~ ~ E DCJ

506 507 508 509 510

observation is related to the fact that in case of closely related genomes, almost all of the genes are in common among all the genomes. Therefore, (almost) no information is lost when Br is computed for the genomes with the same gene contents, which are in fact GPs.

In conclusion, breakpoint distance (Br) is a very good measure to reconstruct the phylogenetic trees for a set of closely related organisms. For a diverse set of organisms, however, we suggest to use other measures like sum of maximal common interval sizes (s ) ~ ). or average sum of uncommon segment sizes (s

Table 5 Similarity of different reconstructed phylogenetic trees with the tree based on a dataset of simulated random reference trees. In this table, the first column shows the similarity or dissimilarity measure used for tree reconstruction. In the second, third and fourth columns, similarity values between the reconstructed trees and the corresponding reference tree are presented, when UPGMA is used for tree creation. Tree similarities are computed by PhyloCore software (Nye et al., 2006). In case of each simulation with fixed parameters, the results are averaged over 10 repeats. For each simulation, the coefficient of variation (CV) is also presented in case of each measure. For more information about the measures in this table, please see Section 4. Measure

Br S N

s E s~ ~ E DCJ

Simulation I

Simulation II

Simulation III

Average similarity

CV

Average similarity

CV

Average similarity

CV

82.7 84.1 86.4 94.9 95.9 75.3 95.7 34.2

0.12 0.10 0.10 0.06 0.05 0.15 0.05 0.20

81.1 89.6 86.5 97.1 97.1 70.5 96.8 30.3

0.09 0.05 0.04 0.04 0.04 0.15 0.04 0.06

68.4 91.4 73.6 95.1 94.7 49.8 89.4 27.9

0.08 0.05 0.07 0.05 0.05 0.24 0.07 0.04

Please cite this article in press as: Heydari, M., et al., Reconstruction of phylogenetic trees of prokaryotes using maximal common intervals. BioSystems (2014), http://dx.doi.org/10.1016/j.biosystems.2014.09.002

511 512 513 514 515

G Model

BIO 3510 1–9 M. Heydari et al. / BioSystems xxx (2014) xxx–xxx 516

5.3. Analysis of simulated data

517

535

To show the validity of our findings, we also tested the abovementioned methods on simulated data. Three simulation schemes were chosen with a variety of parameter settings (see Section 4 and also Additional file 5). Our parameters were chosen in a way to generate GOS’s which are fairly different from their ancestor(s). For each simulation scheme, in each simulation repeat, we randomly generated a number of GOS’s with known phylogeny (i.e., the reference tree). Each of the reconstructed phylogenetic trees obtained by different distance measures was compared to the reference tree. The simulation was repeated 10 times for each of the simulation schemes I, II and III, and the results are averaged over the ten simulations. These findings are summarized in Table 5. Similar to the results in Section 3.1, the suggested measures in the present work are among the best distance measure for tree reconstruction. Apparently, DCJ cannot compete other measures. This is presumably due to the fact that this measure ignores the uncommon genes, which are relatively high in our simulations. On the other hand, for all simulation schemes, E and s perform well. ~ is also good in simulation schemes I and II. Moreover, E

536

Acknowledgements

537 539

We would like to thank H. Luo (University of Georgia) for providing the Prochlorococcus dataset. This work was supported in part by a grant from IPM.

540

References

541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586

Béal, M.-P., Bergeron, A., Corteel, S., Raffinot, M., 2004. An algorithmic view of gene teams. Theor. Comput. Sci. 320, 395–418. Belda, E., Moya, A., Silva, F.J., 2005. Genome rearrangement distances and gene order phylogeny in gamma-proteobacteria. Mol. Biol. Evol. 22, 1456–1467. Bergeron, A., Corteel, S., Raffinot, M., 2002. The algorithmic of gene teams. Lect. Notes Comput. Sci. 2452, 464–476. Bergeron, A., Mixtacki, J., Stoye, J., 2006. A unifying view of genome rearrangements. Lect. Notes Comput. Sci. 4175, 163–173. Bergeron, A., Mixtacki, J., Stoye, J., 2008. HP distance via double cut and join distance. Lect. Notes Comput. Sci. 5029, 56–68. Bernt, M., Merkle, D., Ramsch, K., Fritzsch, G., Perseke, M., Bernhard, D., Schlegel, M., Stadler, P.F., Middendorf, M., 2007. CREx: inferring genomic rearrangements based on common intervals. Bioinformatics 23, 2957–2958. Blin, G., Rizzi, R., 2005. Conserved interval distance computation between nontrivial genomes. Lect. Notes Comput. Sci. 3595, 22–31. Blin, G., Chauve, C., Fertin, G., 2005. Genes order and phylogenetic reconstruction: application to g-proteobacteria. Lect. Notes Comput. Sci. 3678, 11–20. Delgado, J., Lynce, I., Manquinho, V., 2010. Computing the summed adjacency disruption number between two genomes with duplicate genes. J. Comput. Biol. 17, 1243–1265. Dobzhansky, T., Sturtevant, A.H., 1938. Inversions in the chromosomes of Drosophila pseudoobscura. Genetics 23, 28–64. El-Mabrouk, N., Sankoff, D., 2012. Analysis of gene order evolution beyond singlecopy genes. Methods Mol. Biol. 855, 397–429. Fertin, G., Labarre, A., Rusu, I., Tannier, E., Vialette, S., 2009. Combinatorics of Genome Rearrangements. MIT Press. Fitch, W.M., Margoliash, E., 1967. Construction of phylogenetic trees. Science 155, 279–284. Hannenhalli, S., Pevzner, P., 1995a. Transforming cabbage into turnip: polynomial algorithm for sorting signed permutations by reversals. In: Leighton, F.T., Borodin, A. (Eds.), STOC ‘95 Proceedings of the Twenty-Seventh Annual ACM Symposium on Theory of Computing. ACM, New York, NY, USA, pp. 178–189. Hannenhalli, S., Pevzner, P.A., 1995b. Transforming men into mice (polynomial algorithm for genomic distance problem). Proceedings of 36th Annual Symposium on Foundations of Computer Science, Milwakee, WI, pp. 581–592. Hao, B.L., Gao, L., 2008. Prokaryotic branch of the tree of life: a composition vector approach. J. Syst. Evol. 46, 258–262. He, X., Goldwasser, M.H., 2005. Identifying conserved gene clusters in the presence of homology families. J. Comput. Biol. 12, 638–656. Heber, S., Stoye, J., 2001. Finding all common intervals of k-permutations. Lect. Notes Comput. Sci. 2089, 207–218. Huang, Y.-L., Lu, C.L., 2010. Sorting by reversals, generalized transpositions, and translocations using permutation groups. J. Comput. Biol. 17, 685–705. Huang, Y.L., Huang, C.C., Tang, C.Y., Lu, C.L., 2010. SoRT2: a tool for sorting genomes and reconstructing phylogenetic trees by reversals, generalized transpositions and translocations. Nucleic Acids Res. 38, W221–W227.

518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534

538

9

Huson, D.H., Scornavacca, C., 2012. Dendroscope 3: an interactive tool for rooted phylogenetic trees and networks. Syst. Biol. 61, 1061–1067. Jahn, K., 2010. Approximate common intervals based gene cluster models. Ph.D. Thesis. Bielefeld University. Jean, G., Nikolski, M., 2007. Genome rearrangements: a correct algorithm for optimal capping. Inf. Process. Lett. 104, 14–20. Ková9 c, J., Warren, R., Braga, M.D.V., Stoye, J., 2011. Restricted DCJ model: rearrangement problems with chromosome reincorporation. J. Comput. Biol. 18, 1231–1241. Lin, Y.C., Lu, C.L., Liu, Y.-C., Tang, C.Y., 2006. SPRING: a tool for the analysis of genome rearrangement using reversals and block-interchanges. Nucleic Acids Res. 34, W696–W699. Ling, X., He, X., Xin, D., 2009. Detecting gene clusters under evolutionary constraint in a large number of genomes. Bioinformatics 25, 571–577. Luc, N., Risler, J.L., Bergeron, A., Raffinot, M., 2003. Gene teams: a new formalization of gene clusters for comparative genomics. Comput. Biol. Chem. 27, 59–67. Luo, H., Shi, J., Arndt, W., Tang, J., Friedman, R., 2008. Gene order phylogeny of the genus Prochlorococcus. PLoS One 3, e3837. Markov, A.V., Zakharov, I.A., 2009. Evolution of gene orders in genomes of cyanobacteria. Russ. J. Genet. 45, 906–916. Moret, B.M.E., Wang, L.S., Warnow, T., Wyman, S.K., 2001. New approaches for reconstructing phylogenies from gene order data. Bioinformatics 17, S165–S173. Morozov, A.A., Galachyants, Y.P., Likhoshway, Y.V., 2013. Inferring phylogenetic networks from gene order data. BioMed Res. Int. 2013, 503193. Néron, B., Ménager, H., Maufrais, C., Joly, N., Maupetit, J., Letort, S., Carrere, S., Tuffery, P., Letondal, C., 2009. Mobyle: a new full web bioinformatics framework. Bioinformatics 25, 3005–3011. Nye, T.M.W., Liò, P., Gilks, W.R., 2006. A novel algorithm and web-based tool for comparing two alternative phylogenetic trees. Bioinformatics 22, 117–119. Perseke, M., Fritzsch, G., Ramsch, K., Bernt, M., Merkle, D., Middendorf, M., Bernhard, D., Stadler, P.F., Schlegel, M., 2008. Evolution of mitochondrial gene orders in echinoderms. Mol. Phylogenet. Evol. 47, 855–864. Retief, J.D.,1999. Phylogenetic analysis using PHYLIP. Methods Mol. Biol.132, 243–258. Saitou, N., Nei, M., 1987. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4, 406–425. Sanderson, M.J., Purvis, A., Henze, C., 1998. Phylogenetic supertrees: assembling the trees of life. Trends Ecol. Evol. 13, 105–109. Sankoff, D., Blanchette, M., 1998. Multiple genome rearrangement and breakpoint phylogeny. J. Comput. Biol. 5, 555–570. Sankoff, D., Leduc, G., Antoine, N., Paquin, B., Lang, B.F., Cedergren, R., 1992. Gene order comparisons for phylogenetic inference: evolution of the mitochondrial genome. Proc. Natl. Acad. Sci. U. S. A. 89, 6575–6579. Sayers, E.W., Barrett, T., Benson, D.A., Bolton, E., Bryant, S.H., Canese, K., Chetvernin, V., Church, D.M., DiCuccio, M., Federhen, S., Feolo, M., Fingerman, I.M., Geer, L.Y., Helmberg, W., Kapustin, Y., Landsman, D., Lipman, D.J., Lu, Z., Madden, T.L., Madej, T., Maglott, D.R., Marchler-Bauer, A., Miller, V., Mizrachi, I., Ostell, J., Panchenko, A., Phan, L., Pruitt, K.D., Schuler, G.D., Sequeira, E., Sherry, S.T., Shumway, M., Sirotkin, K., Slotta, D., Souvorov, A., Starchenko, G., Tatusova, T.A., Wagner, L., Wang, Y., Wilbur, W.J., Yaschenko, E., Ye, J., 2011. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 39, D38–D51. Schmidt, T., Stoye, J., 2004. Quadratic time algorithms for finding common intervals in two and more sequences. Lect. Notes Comput. Sci. 3109, 347–358. Sicheritz-Pontén, T., Andersson, S.G.E., 2001. A phylogenomic approach to microbial evolution. Nucleic Acids Res. 29, 545–552. Sokal, R., Michener, C., 1958. A statistical method for evaluating systematic relationships. Univ. Kans. Sci. Bull. 38, 1409–1438. Soltis, P.S., Soltis, D.E., 2001. Molecular systematics: assembling and using the tree of life. Taxon 50, 663–677. Sturtevant, A.H., Dobzhansky, T., 1936. Inversions in the third chromosome of wild races of Drosophila pseudoobscura, and their use in the study of the history of the species. Proc. Natl. Acad. Sci. U. S. A. 22, 448–450. Suyama, M., Bork, P., 2001. Evolution of prokaryotic gene order: genome rearrangements in closely related species. Trends Genet. 17, 10–13. Swenson, K.M., Marron, M., Earnest-Deyoung, J.V., Moret, B.M.E., 2008. Approximating the true evolutionary distance between two genomes. J. Exp. Algorithmics 12, 3.5. Tamura, K., Stecher, G., Peterson, D., Filipski, A., Kumar, S., 2013. MEGA6: molecular evolutionary genetics analysis version 6.0. Mol. Biol. Evol. 30, 2725–2729. Tatusov, R.L., Fedorova, N.D., Jackson, J.D., Jacobs, A.R., Kiryutin, B., Koonin, E.V., Krylov, D.M., Mazumder, R., Mekhedov, S.L., Nikolskaya, A.N., Rao, B.S., Smirnov, S., Sverdlov, A.V., Vasudevan, S., Wolf, Y.I., Yin, J.J., Natale, D.A., 2003. The COG database: an updated version includes eukaryotes. BMC Bioinformatics 4, 41. Tesler, G., 2002. Efficient algorithms for multichromosomal genome rearrangements. J. Comput. Syst. Sci. 65, 587–609. Uno, T., Yagiura, M., 2000. Fast algorithms to enumerate all common intervals of two permutations. Algorithmica 26, 290–309. Watterson, G.A., Ewens, W.J., Hall, T.E., Morgan, A., 1982. The chromosome inversion problem. J. Theor. Biol. 99, 1–7. Wittler, R., 2010. Phylogeny-based analysis of gene clusters. Ph.D. Thesis. Bielefeld University, Germany. Yancopoulos, S., Attie, O., Friedberg, R., 2005. Efficient sorting of genomic permutations by translocation, inversion and block interchange. Bioinformatics 21, 3340–3346. Zhang, M., Leong, H.W., 2009. Gene team tree: a hierarchical representation of gene teams for all gap lengths. J. Comput. Biol. 16, 1383–1398.

Please cite this article in press as: Heydari, M., et al., Reconstruction of phylogenetic trees of prokaryotes using maximal common intervals. BioSystems (2014), http://dx.doi.org/10.1016/j.biosystems.2014.09.002

587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667

Reconstruction of phylogenetic trees of prokaryotes using maximal common intervals.

One of the fundamental problems in bioinformatics is phylogenetic tree reconstruction, which can be used for classifying living organisms into differe...
830KB Sizes 0 Downloads 4 Views