YMPEV 4808

No. of Pages 6, Model 5G

13 February 2014 Molecular Phylogenetics and Evolution xxx (2014) xxx–xxx 1

Contents lists available at ScienceDirect

Molecular Phylogenetics and Evolution journal homepage: www.elsevier.com/locate/ympev

Short Communication

2

MtPAN3: Site-class specific amino acid replacement matrices for mitochondrial proteins of Pancrustacea and Collembola

6 4 7 5 8

Q1

a

9 10

b

12 11 13 1 2 5 8 16 17 18 19 20

Francesco Nardi a,⇑, Pietro Liò b, Antonio Carapelli a, Francesco Frati a Department of Life Sciences, University of Siena, Via Aldo Moro 2, 53100 Siena, Italy Computer Laboratory, University of Cambridge. William Gates Building, 15 JJ Thomson Avenue, Cambridge CB3 0FD, UK

a r t i c l e Q3

21 22 23 24 25 26 27

i n f o

Article history: Received 29 November 2013 Revised 30 January 2014 Accepted 2 February 2014 Available online xxxx Keywords: Amino acid matrices k-Means clustering MtPAN3 Pancrustacea Collembola

a b s t r a c t Phylogenetic analyses of Pancrustacea have generally relied on empirical models of amino acid substitution estimated from large reference datasets and applied to the entire alignment. More recently, following the observation that different sites, or groups of sites, may evolve under different evolutionary constraints, methods have been developed to deal with site or site-class specific models. A set of three matrices has been here developed based on an alignment of complete mitochondrial pancrustacean genomes partitioned using an unsupervised clustering procedure acting over per-site physiochemical properties. The performance of the proposed matrix set – named MtPAN3 – was compared to relevant single matrix models (MtZOA, MtART, MtPAN) under ML and BI. While the application of the new model does not solve some of the topological problems frequently encountered with pancrustacean mitogenomic analyses, MtPAN3 largely outperforms its competitors based on AIC and Bayes factors, indicating a significantly improved fit to the empirical data. The applicability of the new model, as well as of multiple matrix models in general, is discussed and an R/BioPerl script that implements the procedure is provided. Ó 2014 Published by Elsevier Inc.

29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45

46 47

1. Introduction

48

Modern methods of phylogenetic analysis rely on a so-called model of molecular evolution, i.e. a mathematical model describing how sequences evolve through time (Liò and Goldman, 1998). At the core of an evolutionary model is the substitution rate matrix, in fact a composition of a symmetrical matrix of exchange rates that describes the instantaneous rate of change between each amino acid pair and a vector of equilibrium frequencies. The model can also include additional components, such as ASRV, tree shape and demographic processes, that are nevertheless not directly considered here. The actual values in the matrix can be either optimized alongside the phylogenetic tree during the analysis (i.e. mechanistic models: Yang et al., 1998) or estimated from large reference datasets a priori and subsequently applied to the specific dataset under consideration (i.e. empirical models). While nucleotide

49 50 51 52 53 54 55 56 57 58 59 60 61

Q2

Abbreviations: AIC, akaike information criterion; AICc, akaike information criterion, corrected; ASRV, among site rate variation; BF, Bayes factor; BI, bayesian inference; ML, maximum likelihood. ⇑ Corresponding author. Fax: +39 0577 234476. E-mail addresses: [email protected] (F. Nardi), [email protected] (P. Liò), [email protected] (A. Carapelli), [email protected] (F. Frati).

analyses almost invariantly use mechanistic models, amino acid analyses generally rely on empirical models, as the large (190) number of substitution parameters cannot generally be estimated form small/medium datasets. A number of matrices have been determined for both general interest proteins (e.g. PAM: Dayhoff et al., 1972; WAG: Whelan and Goldman, 2001; LG: Le and Gascuel, 2008) and specific genetic systems (e.g. MtREV: Adachi and Hasegawa, 1996 for mitochondrial proteins). Following the observation that differences exist between taxonomic groups, matrices specific to a given lineage have also been developed such as, among others, MtPAN for Pancrustacean mitochondrial genomes (Carapelli et al., 2007), MtART for Arthropoda (Abascal et al., 2007) and MtZOA for the entire Metazoa (Rota-Stabelli et al., 2009). Early methods of phylogenetic analysis generally assumed that each and every alignment site evolved under the same model (rates, exchange probabilities, equilibrium frequencies), with more sophisticated implementations going in the direction of modeling such aspects of sequence evolution explicitly. While ASRV is generally accounted for (Yang, 1996; Stamatakis, 2006a), variations in the substitution process (i.e. equilibrium frequencies and exchange probability matrix) are frequently overlooked. One possible approach has been to assign sites to a number of categories defined

http://dx.doi.org/10.1016/j.ympev.2014.02.001 1055-7903/Ó 2014 Published by Elsevier Inc.

Please cite this article in press as: Nardi, F., et al. MtPAN3: Site-class specific amino acid replacement matrices for mitochondrial proteins of Pancrustacea and Collembola. Mol. Phylogenet. Evol. (2014), http://dx.doi.org/10.1016/j.ympev.2014.02.001

62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84

YMPEV 4808

No. of Pages 6, Model 5G

13 February 2014 2 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113

F. Nardi et al. / Molecular Phylogenetics and Evolution xxx (2014) xxx–xxx

a priori based on protein secondary structure/solvent accessibility (Goldman et al., 1998; Liò and Goldman, 1999) and to estimate/apply a different rate matrix to each site category. More recently Lartillot and Philippe (2004) used a mixture model where the exchange probability matrix is common to all sites – in fact uniform mutation probabilities are assumed – while stationary frequencies are allowed to vary between groups of sites and sites are automatically assigned to a group based on their fit – the CAT model. The use of site specific – or group specific – exchange probability matrices has in turn proved more problematic. Le et al. (2008) have described new models where multiple matrices are used, either specific to a given subset of the alignment known a priori (e.g. exposed/buried or extended/helix/other in supervised approaches) or to site partitions that are defined during the analysis and hence cannot be associated a priori with specific protein features (a so-called unsupervised approach). Although mixture models are probably the most theoretically favorable context where such procedures can be implemented, and in fact the models described in Le et al. (2008) have been included in a specific version of the software PhyML (PhyML-mixtures), a generalized application of the procedure in an unsupervised context has proven problematic, leading the use of semi-supervised models in Le et al. (2008). In this study we assess the applicability of a non-supervised procedure of site clustering (Dunn et al., 2013) coupled with the estimation of empirical matrices for each site cluster (Dang et al., 2011). A new set of three matrices specific for Pancrustacea is provided, as well as a script that automates the procedure. The performance of the method is evaluated based on ML and BI both on the reference Pancrustacea dataset and on a single order of interest.

114

2. Materials and methods

115

2.1. Data and matrix transformation

116

130

The data analyzed here derive from the manually curated dataset of complete mitochondrial genomes described in Carapelli et al. (2007) and van der Wath et al. (2008), base to the MtPAN matrix, that was further processed to eliminate gapped (322) and ambiguous (31) characters. The entire dataset (81 Pancrustacea and 5 outgroups, 2653 aligned positions: Supplementary data 1) was used for clustering and matrix estimation. The entire dataset, as well sequences from one single Hexapod order of special interest to the authors (8 Collembola and 2 outgroups), were in turn used to evaluate the likelihood and phylogenetic performance of the new matrix set. The amino acid alignment was transformed into a matrix of per site physiochemical properties as in Dunn et al. (2013), with rows corresponding to amino acid positions and columns to each of nine physiochemical properties (e.g. bulkiness, hydrophobicity: Kidera et al., 1985) averaged over each alignment column.

131

2.2. Clustering

132

The matrix of per site physiochemical properties was used to cluster sites into groups characterized by uniform physiochemical properties by the k-means and gap-statistic procedure of Tibshirani et al. (2001), grossly sovrapponible to Dunn et al. (2013), as implemented in a custom script (Supplementary script 1, Supplementary text 1). Values of k from 1 to 10 were assessed, starting from 10,000 random seeds. The reference distribution for logWk was determined according to method (b) in Tibshirani et al. (2001; function kindly contributed by K. Dunn) based on 250 simulated datasets for each k. The optimal number of clusters was identified based on the gap statistic of Tibshirani et al. (2001) as well as a comparison of the arising clusters with those obtained by Dunn et al. (2013) for mammals and fish.

117 118 119 120 121 122 123 124 125 126 127 128 129

133 134 135 136 137 138 139 140 141 142 143 144

3

2.3. Matrix estimation

145

Four new amino acid substitutions matrices were estimated using the maximum-likelihood estimation procedure of Le and Gascuel (2008) as implemented in ReplacementMatrix (Dang et al., 2011). New matrices were optimized on the complete dataset and on each of the three site-partitions obtained from the clustering procedure. We refer to the former, that is in fact a reestimation of the original MtPAN matrix, as MtPAN2013, while the latter three are globally referred to as matrix set MtPAN3 and individually identified by the dominant amino acids: MtPANLI MtPANTA MtPANSGP.

146

2.4. Matrix evaluation

156

The relative performance of matrix set MtPAN3 was assessed in a ML as well as BI framework compared to single matrices MtZOA, MtART and MtPAN2013. Comparisons were based on the entire pancrustacean dataset as well as on the subset of Collembola sequences. Analyses performed on the two datasets are identical except for the estimation of marginal likelihoods (see below). A parallel version of RAxML 7.6.3 (Stamatakis, 2006b) running on the CIPRES webserver (Miller et al., 2010) was used to (a) estimate the ML tree and nodal support based on the four competing matrices/matrix set and (b) to estimate the likelihood of a single common tree for model comparison. While MtZOA and MtART matrices are built-in in the software, new matrices were provided using the -P and -q options. ML searches were repeated starting from 10 random trees. ASRV was modeled using a C distribution as advised by the software author and the frequency of amino acids was fixed as in the respective matrices. Nodal support was assessed based on 100 non-parametric bootstrap replicates. The fit of different matrices to the data was evaluated based on AICc (Burnham and Anderson, 2002). A modified version of MrBayes 3.2.2 (Ronquist et al., 2012) was used for BI. Given the absence of the aforementioned matrices in the canonical MrBayes release, the source code was manually modified (as in Rota-Stabelli et al., 2009; code available from F.N.) by substituting some of the original matrices with MtZOA, MtART, MtPAN2013 and matrix sets MtPAN3, mtMamR1-3 and mtFishR1-3 (see below) before compilation. All analyses, run over 8 processors on the EURORA cluster at CINECA (Bologna, Italy), consisted of two runs of four chains each continued for one million generations. A pure gamma model of ASRV was employed, with amino acid frequencies fixed as in the respective matrices. Longer runs and alternative ASRV models were evaluated with no significant improvement, and their use was discontinued. Log files were visualized using the software Tracer 1.5.0 (Rambaut and Drummond, 2007) and a relative burnin of 20% generations was applied to all runs. An initial run employed model jumping over the three-partitioned dataset, with equal prior probability given to each matrix. Subsequent runs applied either MtZOA, MtART, MtPAN2013 or matrix set MtPAN3 (this latter to the partitioned dataset with unlinked rate multiplier and alpha) to estimate the best tree and posterior probabilities at nodes. The fit of different matrices to the data was evaluated using BF. Estimation of marginal likelihood employed the harmonic mean method (Newton and Raftery, 1994) on the full dataset, as alternative methods proved computationally prohibitive, and the stepping stone procedure (Xie et al., 2011) on the Collembola dataset. The stepping stone analysis consisted in two parallel runs of 4 chains and 12 million generations, sampling every 1000th generation. An initial burnin of 300 samples was followed by 39 steps of 75 samples of burnin and 225 samples from supposed posterior distribution. The significance of Bayes factors was interpreted following Kass and Raftery (1995).

157

Please cite this article in press as: Nardi, F., et al. MtPAN : Site-class specific amino acid replacement matrices for mitochondrial proteins of Pancrustacea and Collembola. Mol. Phylogenet. Evol. (2014), http://dx.doi.org/10.1016/j.ympev.2014.02.001

147 148 149 150 151 152 153 154 155

158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207

YMPEV 4808

No. of Pages 6, Model 5G

13 February 2014 F. Nardi et al. / Molecular Phylogenetics and Evolution xxx (2014) xxx–xxx

214

In order to investigate the stability of model parameters across phylogenetically different taxa, and hence the portability of such models, a subset of the analyses (likelihood estimation of a single common tree and estimation of Bayes factors for model comparison) were replicated on the Pancrustacea and Collembola datasets applying the two three-matrix models described in Dunn et al. (2013) for fish and mammals.

215

2.5. Script

216

223

Transformation of the amino acid alignment into the corresponding physiochemical matrix and clustering of sites have been implemented in a R/BioPerl script (Supplementary script 1, Supplementary text 1). Input is an amino acid alignment in nexus format. Output are the physiochemical matrix, graphs of logWk and Gapk, and clustering of sites for each k assessed in a nexus compatible format. The user still has to select the most appropriate k based on the graphs and/or external evidence.

224

3. Results and discussion

225

3.1. Clustering

208 209 210 211 212 213

217 218 219 220 221 222

226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246

The matrix of per site physiochemical properties is provided as Supplementary text 2. When clustering in 1–10 groups is compared, values of logWk increase sharply with k going from 1 to 3, plateaux, and then increase again, with a marked elbow on k = 3 (Fig. 1). While a strict application of the one-standard-deviation rule of Tibshirani et al. (2001) would suggest the use of problematic high values of k, in fact above 10, the observation that (a) the difference between k = 3 and k = 4 is a mere 2 standard-deviations (compared to 4–43 for other values of k), (b) the course of logWk is sovrapponible in the Pancrustacea, mammal and fish dataset (Fig. 1), and most important (c) the clusters obtained with k = 3 correspond, in terms of number of sites, base composition and characteristic high rates between amino acid pairs, to the clusters obtained by Dunn et al. (2013) for mammals as well as fish datasets (Fig. 2), we considered this latter level of clustering (k = 3) to better describe the biological significance of the variability observed in the data. The three data partitions obtained, hereby referred to as LI, TA and SGP, include 1414, 657 and 582 characters, respectively. Partition LI, the largest one including roughly 50% of sites, corresponds to Group1 and associated matrix R1 in Dunn et al. (2013; Figs. 2 and 4), partition TA to Group 2 and

Fig. 1. Comparison of clustering in k groups. Gapk, i.e. the difference between observed and expected logWk, for k ranging between 1 and 10 clusters. Black line corresponds to the Pancrustacea dataset analyzed here, bars at points indicate ±Sk intervals. Grey bar correspond to the mammals (upper) and fish (lower) datasets, re-estimated based on Dunn et al. (2013) data for comparison. 3

3

matrix R2 and partition SGP to Group 3 and matrix R3 (see Supplementary Fig. 1 for genome wide distribution of clusters).

247

3.2. New matrices

249

New matrices are provided in paml and RAxML format in Supplementary text 3. The MtPAN2013 matrix appears to be fairly similar to MtZOA and MtART in terms of characteristic mutation types (e.g. I/V, D/E) and dominant amino acids (I, L, S). Focusing on exchange probabilities, high values tend to be stable across matrices, with only mutation Q/H displaying a mild increase (2.6) in MtPAN2013 compared to MtART. More substantial variations (>100) are concentrated among lower values and/or values that correspond to amino acids observed at very low frequencies, in fact the most difficult to estimate due to the small number of occurrences. Deviation in stationary frequencies between MtPAN2013 and MtART are ampler than between MtZOA and MtART (differences up to 0.024 and 0.015, respectively). Noticeable is the increase in amino acids encoded by AT-rich codons such as I, L, M, K and the parallel decrease in amino acids encoded by GC-rich codons G, P, A, H, clearly in line with the marked increase in AT content observed in many derived Hexapod lineages. The three matrices corresponding to the three partitions in the fish and mammal datasets of Dunn et al. (2013) and in the present MtPAN3 matrix set appear to be more differentiated. In all cases it is possible to identify some mutation types that are characteristic to a given partition, e.g. D/E and D/K in partition LI, I/V and F/Y in partition TA, V/M and again F/Y in partition SGP. Nevertheless substantial differences exist, such as E/R being among the most frequent mutations in mammals and Pancrustacea while virtually absent in fish (LI partition) and R/I being very frequent in fish and not in mammals and Pancrustacea (SGP partition). Matrices for partition TA are overall more similar across the three taxonomic groups than those for partitions LI and SGP.

250

3.3. Model fit to data

279

An initial assessment of which matrix is more appropriate for each of the three data partitions comes from the application of the model jumping procedure where different matrices (here MtZOA, MtART, MtPAN2013, MtPANLI MtPANTA and MtPANSGP) are proposed with equal prior probabilities to the partitioned alignment and their fit to the data is estimated a posteriori. The analysis unequivocally – and not unexpectedly – assigned matrix MtPANLI to the LI partition, MtPANTA to the TA partition and MtPANSGP to the SGP partition with posterior probability 1 in both the Pancrustacea and Collembola datasets. Model comparison based on AICc and Bayes factors provided similar results, consistent with the model jumping analysis (Table 1). Focusing on models developed on (or including) Pancrustacean data, matrix fit to the entire Pancrustacea dataset is (worse to best) MtZOA, MtART, MtPAN2013 and matrix set MtPAN3, with differences in AICc compared to the top scoring matrix, as well as of each matrix compared to others, all above 500. Bayes factors, here estimated based on the harmonic mean procedure, recover the same order, with differences in BFs in the order of hundreds to thousands, and hence all highly significant according to Kass and Raftery (1995). Matrix fit to the Collembola dataset recovers a slightly different ranking: MtZOA, MtPAN2013, MtART and matrix set MtPAN3. The difference in AICc as well as BF, here estimated using the stepping stone procedure, are substantial between MtPAN3 and the remaining matrices. The difference between MtART, MtPAN2013 and MtZOA is more limited, yet very significant at least between the first and the latter two. The better performance of MtART compared to MtPAN2013 in Collembola may find a justification in the fact that these latter are an early offshoot of

280

Please cite this article in press as: Nardi, F., et al. MtPAN : Site-class specific amino acid replacement matrices for mitochondrial proteins of Pancrustacea and Collembola. Mol. Phylogenet. Evol. (2014), http://dx.doi.org/10.1016/j.ympev.2014.02.001

248

251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278

281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308

YMPEV 4808

No. of Pages 6, Model 5G

13 February 2014 4

F. Nardi et al. / Molecular Phylogenetics and Evolution xxx (2014) xxx–xxx

Fig. 2. Clusters obtained with k = 3 in the Pancrustacea, mammals and fish datasets. A, B, C: amino acid frequencies in clusters LI, TA and SGP, respectively; D: fraction of sites of the overall alignment in the three clusters.

Table 1 Model comparison under ML and BI. ML: models are ordered from lower to higher likelihood, DAICc is the difference in AICc of each model compared to the model giving highest likelihood. BI: the marginal likelihood of each model is estimated as harmonic mean of likelihoods (in natural log units) for the Pancrustacea dataset and using the stepping stone procedure for the Collembola dataset. Models are ordered from lower to higher marginal likelihood. Bayes factors (in natural log units) are expressed as the difference between the log-marginal likelihood of each model and the preceding one. In italics are models not specifically developed for, or to include, Pancrustacea. Pancrustacea Model (ML) mtMamR1-3 MtFishR1-3a MtZOA MtART MtPAN2013 MtPAN3 Model (BI) mtFishR1-3a mtMamR1-3 MtZOA MtART MtPAN2013 MtPAN3 a b c d

309 310 311 312 313 314 315 316 317 318 319 320

Collembola Likelihood

DAICc 170452.97 169203.06 153742.31 153197.72 152642.17 150688.57

Marginal likelihood

Model (ML)

39528.79 37028.97 6102.9 5013.72 3902.61 – Bayes factor

172919 172139.5 167261.9 164562.5 163998 160531.2

– 779.5c 4877.6c 2699.4c 564.5c 3466.8c

MtFishR1-3a mtMamR1-3 MtZOA MtPAN2013 MtART MtPAN3 Model (BI) MtFishR1-3a,b MtZOA MtPAN2013 MtMamR1-3b MtART MtPAN3

Likelihood

DAICc 29720.12 29715.9 29081.62 29081.18 28983.67 27163.99

Marginal likelihood -29385.47 29257.76 29256.74 29216.32 29160.12 27148.31

5112.26 5103.81 3831.24 3830.38 3635.35 – Bayes factor – 127.71c 1.02d 40.42c 56.2c 2011.81c

Matrices ‘‘mol’’ in Dunn et al. (2013). The harmonic mean method was used instead of the stepping stone method. ‘‘very strong’’. ‘‘positive’’ evidence in favor of the model (Kass and Raftery, 1995).

Hexapoda, differentiating earlier than the development of hexapod derived evolutionary features (e.g. AT bias) that may be best captured by MtPAN2013. In summary, matrix set MtPAN3 largely outperforms all other single matrices regardless of the dataset evaluated. Not unexpectedly, comparisons of MtPAN3 with the other two available three-matrix models mtMamR1-3 and mtFishR1-3 derived from mammalian and fish data (Dunn et al., 2013), indicate that these latter perform significantly worse on the Pancrustacea and Collembola dataset (Table 1), with the possible exception of model MtMamR1-3 on the Collembola dataset, that nevertheless we deem not totally credible given the method used here to 3

estimate marginal likelihood. Nevertheless, ancillary analyses (data not shown) suggest that the major hurdle of these models when applied to Pancrustacea and Collembola may be in the associated stationary frequencies more than in the replacement matrix, as both models evaluated perform better in general, and better than available single matrix models in Collembola (ML only), when pancrustacean frequencies are enforced. The associated phylogenetic trees for the entire Pancrustacea dataset are comparable to the results of Carapelli et al. (2007), yet with some differences. Trees tend to be less resolved overall, possibly due at least in part to the different analytical strategy (e.g. selection of one best run among 10 trials vs combination of

Please cite this article in press as: Nardi, F., et al. MtPAN : Site-class specific amino acid replacement matrices for mitochondrial proteins of Pancrustacea and Collembola. Mol. Phylogenet. Evol. (2014), http://dx.doi.org/10.1016/j.ympev.2014.02.001

321 322 323 324 325 326 327 328 329 330 331 332

YMPEV 4808

No. of Pages 6, Model 5G

13 February 2014 F. Nardi et al. / Molecular Phylogenetics and Evolution xxx (2014) xxx–xxx

369

two runs) or exclusion of some additional characters. In fact, basalmost nodes, i.e. above the level of orders, are not resolved in most analyses. On the other hand, nodes defining orders and lower categories are generally recovered with high support. The recovered topologies carry some of the inconsistencies and/or problematic relationships discussed in Carapelli et al. (2007). The position of Collembola and Diplura outside an Insecta s.s. + Crustacea clade (i.e. a polyphiletic status of Hexapoda) is weakly supported, yet it is recovered in the best scoring ML tree under all matrices. Similarly, the problematic clade composed of Pachypsilla, Trialeurodes (Hexapoda, Homoptera), Xenos (Hexapoda, Strepsiptera) and Armillifer (Crustacea, Pentastomida) is recovered in most analyses as monophyletic with medium/high posterior probabilities. Comparing trees in order from the least to the most appropriate model, it is possible to observe an increase in resolution in the basal nodes with the appearance of relationships that are well supported from other sources of evidence, such as the basal splits of Microcoryphia, Zygentoma and Odonata among Hexapoda. Phylogenetic relationships among Collembola are less controversial, as all analyses recover one and the same topology that is in turn in agreement with our understanding of the group and a recent analysis presented in Carapelli et al. (2013). Resolution at nodes is high under ML (most 95–100) and medium under BI (most 70–90). Limited to what can be evaluated a posteriori based on the trees produced, the new MtPAN3 performs well when applied at the order/family level (e.g. Collembola) while is not totally satisfactory when applied at the level of classes to orders (e.g. Pancrustacea). This should nevertheless not come as a surprise, as many authors have tried to resolve phylogenetic relationships among Pancrustacea based on complete mitochondrial genome sequences with mixed results that range from finding high support for some untenable nodes to limited support overall (reviewed in Simon and Hadrys, 2013). Nevertheless, if model performance is evaluated in terms of the fit to empirical data (AICc, BF) the matrix set MtPAN3 proposed here clearly outperforms all its competitors, indicating that it is capable of better describing the overall evolutionary process and encouraging its further application.

370

3.4. Applicability

371

The procedure applied by Dunn et al. (2013) to mammals and fish and here to Pancrustacea, as well as the three empirical three-matrix sets now available, may allow for a wider use of multi-matrix models of amino acid substitution that have, in turn, been shown to substantially improve the fit to empirical data over single matrix models. One limiting factor to the use of multiple matrix sets is the necessity to define alignment partitions a priori. One possibility, suggested in Dunn et al. (2013), is to align the new user dataset to the alignment used for the original matrix estimation and directly transfer partition information. While this option is clearly sensible, although not technically straightforward, the need to match the two alignments may lead to the exclusion of characters that cannot be cross-aligned with confidence. The possibility, using the script provided, to easily apply this partitioning strategy to any new user dataset suggests two additional strategies. If the user dataset is sufficiently rich, specific matrices may be estimated directly from the data after partitioning (as done here). Alternatively, considering that the resulting partitions are analogous from taxa as different as mammals to fish and Pancrustacea, suggesting that this may be a shared feature of mitochondrial genomes more than the idiosyncratic behavior of a given taxonomic group, the dataset may be partitioned using the script provided and the applicability of each one of the three available matrix sets may be evaluated – alongside others – using standard methods (AIC, BF). A further option may arise from the foreseeable implementation of

333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368

372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395

5

such matrix sets in a software that can handle mixed models (e.g. Le et al., 2008). Software that may be used for the actual phylogenetic analysis – i.e. capable of both partitioning and the use of amino acid custom matrices – include MrBayes, RaxML in standalone version (although not well documented in the manual) and, thanks to the collaboration of Mark Miller, now also through the main interface of the freely accessible cluster CIPRES (www.phylo.org).

396

Acknowledgments

404

We wish to thank K. Dunn (Dalhausie U.) for sharing her original script and contributing the function for the reference distribution in Gap_script. We also with to thank E. Trentin (U. Siena) for stimulating discussions on clustering and J. Pons (IMEDEA) for suggestions on the stepping stone procedure. Special thanks to Susana Bueno Minguez (CINECA) for use of R on EURORA and Mark Miller (CIPRES) for modifying the RAxML interface to allow the use of multiple user matrices. This work was supported by PRIN grant Prot.2009NWXMXX to F.F. and computational resources were provided by ISCRA grant MtCBias to F.N. The funding institutions had no involvement in the research or preparation of the manuscript.

405

Appendix A. Supplementary material

417

Supplementary data associated with this article can be found, Q4 in the online version, at http://dx.doi.org/10.1016/j.ympev. 2014.02.001.

418

References

421

Abascal, F., Posada, D., Zardoya, R., 2007. MtArt: a new model of amino acid replacement for Arthropoda. Mol. Biol. Evol. 24, 1–5. Adachi, J., Hasegawa, M., 1996. MOLPHY version 2.3: programs for molecular phylogenetics based on maximum likelihood. Comput. Sci. Monogr 28, 1–150. Burnham, K.P., Anderson, D.R., 2002. Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach. Springer, New York. Carapelli, A., Liò, P., Nardi, F., van der Wath, E., Frati, F., 2007. Phylogenetic analysis of mitochondrial protein coding genes confirms the reciprocal paraphyly of Hexapoda and Crustacea. BMC Evol. Biol. 7 (Suppl. 2), S8. http://dx.doi.org/ 10.1186/1471-2148-7-S2-S8. Carapelli, A., Convey, P., Nardi, F., Frati, F., 2013. The mitochondrial genome of the Antarctic springtail Folsomotoma octooculata (Hexapoda; Collembola), and an update on the phylogeny of collembolan lineages based on mitogenomic data. Entomologia, in press. Dang, C.C., Lefort, V., Le, V.S., Le, Q.S., Gascuel, O., 2011. ReplacementMatrix: a web server for maximum-likelihood estimation of amino acid replacement matrices. Bioinformatics 27, 2758–2760. Dayhoff, M.O., Eck, R.V., Park, C.M., 1972. A model of evolutionary change in proteins. In: Dayhoff, M.O. (Ed.), Atlas of Protein Sequence and Structure, vol. 5. National Biomedical Research Foundation, Washington, DC, pp. 89–99. Dunn, K.A., Jiang, W., Field, C., Bielawski, J.P., 2013. Improving evolutionary models for mitochondrial protein data with site-class specific amino acid exchangeability matrices. PLoS ONE 8 (1), e55816. http://dx.doi.org/10.1371/ journal.pone.0055816. Goldman, N., Thorne, J., Jones, D., 1998. Assessing the impact of secondary structure and solvent accessibility on protein evolution. Genetics 149, 445–458. Kass, R.E., Raftery, A.E., 1995. Bayes factors. J. Am. Stat. Assoc. 90, 773–795. Kidera, A., Konishi, Y., Oka, M., Ooi, T., Scheraga, H.A., 1985. Statistical analysis of the physical properties of the 20 naturally occurring amino acids. J. Protein Chem. 4, 23–55. Lartillot, N., Philippe, H., 2004. A Bayesian mixture model for across-site heterogeneities in the amino-acid replacement process. Mol. Biol. Evol. 21, 1095–1109. Le, S.Q., Gascuel, O., 2008. An improved general amino acid replacement matrix. Mol. Biol. Evol. 25, 1307–1320. Le, S.Q., Lartillot, N., Gascuel, O., 2008. Phylogenetic mixture models for proteins. Phyl. Trans. R. Soc. B 363, 3965–3976. Liò, P., Goldman, N., 1998. Models of molecular evolution and phylogeny. Genome Res. 8, 1233–1244. Liò, P., Goldman, N., 1999. Using protein structural information in evolutionary inference: transmembrane proteins. Mol. Biol. Evol. 16, 1696–1710. Miller, M.A., Pfeiffer, W., Schwartz, T., 2010. Creating the CIPRES science gateway for inference of large phylogenetic trees. In: Proceedings of the Gateway

422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464

Please cite this article in press as: Nardi, F., et al. MtPAN3: Site-class specific amino acid replacement matrices for mitochondrial proteins of Pancrustacea and Collembola. Mol. Phylogenet. Evol. (2014), http://dx.doi.org/10.1016/j.ympev.2014.02.001

397 398 399 400 401 402 403

406 407 408 409 410 411 412 413 414 415 416

Q5

419 420

YMPEV 4808

No. of Pages 6, Model 5G

13 February 2014 6 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487

F. Nardi et al. / Molecular Phylogenetics and Evolution xxx (2014) xxx–xxx

Computing Environments Workshop (GCE), 14 November, 2010. New Oreleans, LA, pp. 1–8. Newton, M., Raftery, A., 1994. Approximate Bayesian inference with the weighted likelihood bootstrap. J. R. Stat. Soc. B Stat. Methodol. 56, 3–48. Rambaut, A., Drummond, A.J., 2007. Tracer v1.4, Available from . Ronquist, F., Teslenko, M., van der Mark, P., Ayres, D.L., Darling, A., Höhna, S., Larget, B., Liu, L., Suchard, M.A., Huelsenbeck, J.P., 2012. MrBayes 3.2: efficient Bayesian phylogenetic inference and model choice across a large model space. Syst. Biol. 61, 539–542. Rota-Stabelli, O., Yang, Z., Telford, M.J., 2009. MtZoa: a general mitochondrial amino acid substitutions model for animal evolutionary studies. Mol. Phylogenet. Evol. 52, 268–272. Simon, S., Hadrys, H., 2013. A comparative analysis of complete mitochondrial genomes among Hexapoda. Mol. Phyl. Evol. 69, 393–403. Stamatakis, A., 2006a. Phylogenetic models of rate heterogeneity: a high performance computing perspective. In: Proceedings of IPDPS, 2006, Rhodos, Greece. Stamatakis, A., 2006b. RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics 22, 2688– 2690. Tibshirani, R., Walther, G., Hastie, T., 2001. Estimating the number of clusters in a data set via the gap statistic. J. R. Statist. Soc. B 63 Part 2, pp. 411–423.

van der Wath, R.C., van der Wath, E., Carapelli, A., Nardi, F., Milanesi, L., Liò, P., 2008. Bayesian phylogeny on grid. Pp. 404–416. In: Elloumi, M., Kung, J., Linial, M., Murphy, R.F., Schneider, K., Toma, C., Bioinfromatics Research and Development (BIRD) in Communications in Computer and Information Science (CCSIS). Vol. 13, Springer, Berlin, Heidelberg. Whelan, S., Goldman, N., 2001. A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach. Mol. Biol. Evol. 18, 691–699. Xie, W., Lewis, P.O., Fan, Y., Kuo, L., Chen, M.-H., 2011. Improving marginal likelihood estimation for Bayesian phylogenetic model selection. Syst. Biol. 60, 150–160. Yang, Z., 1996. Among-site rate variation and its impact on phylogenetic analyses. TREE 11, 367–372. Yang, Z., Nielsen, R., Hasegawa, M., 1998. Models of amino acid substitution and application to mitochondrial protein evolution. Mol. Biol. Evol. 15, 1600–1611.

Please cite this article in press as: Nardi, F., et al. MtPAN3: Site-class specific amino acid replacement matrices for mitochondrial proteins of Pancrustacea and Collembola. Mol. Phylogenet. Evol. (2014), http://dx.doi.org/10.1016/j.ympev.2014.02.001

488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503

MtPAN(3): site-class specific amino acid replacement matrices for mitochondrial proteins of Pancrustacea and Collembola.

Phylogenetic analyses of Pancrustacea have generally relied on empirical models of amino acid substitution estimated from large reference datasets and...
727KB Sizes 1 Downloads 2 Views