Journal of Theoretical Biology 380 (2015) 16–23

Contents lists available at ScienceDirect

Journal of Theoretical Biology journal homepage: www.elsevier.com/locate/yjtbi

Development of quantitative interspecies toxicity relationship modeling of chemicals to fish M.H. Fatemi n, E. Mousa Shahroudi, Z. Amini Laboratory of Chemometrics, Department of Chemistry, University of Mazandaran, Babolsar, Iran

H I G H L I G H T S

G R A P H I C A L

 Quantitative interspecies-toxicity relationship methodologies were used to improve the prediction power of interspecies toxicity model.  By analyzing descriptors in the model, it was concluded that geometrical, topological and electronic interactions are important in the true estimation of the toxicity of chemicals.  Developed QSTR model successfully was used to correlate fish toxicity of chemical to their molecular structural descriptors as well as their toxicity to Daphnia magna.

Quantitative interspecies-toxicity relationship methodologies were used to improve the prediction power of interspecies toxicity model.

art ic l e i nf o

a b s t r a c t

Article history: Received 10 January 2015 Received in revised form 5 May 2015 Accepted 9 May 2015 Available online 19 May 2015

In this work, quantitative interspecies-toxicity relationship methodologies were used to improve the prediction power of interspecies toxicity model. The most relevant descriptors selected by stepwise multiple linear regressions and toxicity of chemical to Daphnia magna were used to predict the toxicities of chemicals to fish. Modeling methods that were used for developing linear and nonlinear models were multiple linear regression (MLR), random forest (RF), artificial neural network (ANN) and support vector machine (SVM). The obtained results indicate the superiority of SVM model over other models. Robustness and reliability of the constructed SVM model were evaluated by using the leave-one-out cross-validation method (Q2 ¼0.69, SPRESS ¼0.822) and Y-randomization test (R2 ¼0.268 for 30 trail). Furthermore, the chemical applicability domains of these models were determined via leverage approach. The developed SVM model was used for the prediction of toxicity of 46 compounds that their experimental toxicities to a fish were not being reported earlier from their toxicities to D. magna and relevant molecular descriptors. & 2015 Elsevier Ltd. All rights reserved.

Keywords: Toxicity Quantitative interspecies-toxicity relationship Daphnia magna Molecular descriptors

A B S T R A C T

1. Introduction Until recently medical substances (pharmaceuticals) have been exposed to the environment with very little attention (Fent et al., 2006; Jones et al., 2002; Sanderson and Thomsen, 2007). The major

n

Corresponding author. Tel.: þ 98 1135302395; fax: þ98 1135302350. E-mail address: [email protected] (M.H. Fatemi).

http://dx.doi.org/10.1016/j.jtbi.2015.05.017 0022-5193/& 2015 Elsevier Ltd. All rights reserved.

routes of entry of pharmaceuticals into the environment are from their use by individuals, either dispersed throughout the community or concentrated in medical centers of hospitals, and the disposal of unwanted or expired drugs by users. Therefore pharmaceuticals can be frequently found in rivers, streams, sewage effluents, surface water, and groundwater and even in drinking water, which is creating a big dilemma in water treatment for drinking purpose (Daughton and Jones-Lepp, 2001; Halling-Sørensen et al., 1998;

M.H. Fatemi et al. / Journal of Theoretical Biology 380 (2015) 16–23

Heberer et al., 1997; Ternes, 1998). The first report regarding detection of pharmaceuticals in wastewater effluents and surface waters was published in the United States in the 1970s (Garrison et al., 1976; Hignite and Azarnoff, 1977; Tabak and Bunch, 1970). Then clofibric acid and diazepam were detected in treated drinking water in Milan, Italy (Zuccato et al., 2000). Moreover Heberer and colleagues have reported the presence of clofibric acid, propylphenazone, and diclofenac in the drinking water of Berlin in the concentration range of several hundreds of nanograms per liter (Heberer, 2002). Also Frick detected three widely used nonprescription drugs, caffeine, cotinine, and acetaminophenone, in samples of potable water collected near Atlanta, Georgia (Frick and Zaugg, 2003). Although the current concentrations of pharmaceutical chemicals in both surface waters and effluents are low, their possible adverse effects on aquatic life and ultimately on human health are not well understood (Cunningham et al., 2006). Determination of the toxic effects of the pharmaceuticals on aquatic life can be elicited by either ecotoxicity testing or by applicable models. The experimental studies on pharmaceuticals are in general very limited relative to the studies on industrial chemicals, therefore development of some theoretical models for estimation of the toxicological properties of these compounds is important. Today quantitative structure-toxicity relationships (QSTRs) studies have been used successfully for modeling in the field of toxicology and drug design (Kar and Roy, 2010; Länge and Dietrich, 2002). For example, Tugcu et al. (2012) developed a QSAR model to estimate the acute pharmaceutical toxicity on fish by using multiple descriptors (Tugcu et al., 2012). They concluded that the toxicity of studied chemicals mainly depends on their hydrophobicity and heteroatom-bonded carbon atom. Thomsen and Sanderson (2009) performed comparative analysis of pharmaceuticals versus industrial chemicals acute aquatic toxicity (Sanderson and Thomsen, 2009). The results of this analysis indicate that pharmaceuticals are generally not more hazardous than industrial chemicals relative to their acute aquatic toxicity. In 2001 Di Marzio et al. studied the application of WHIM molecular descriptors in QSAR modeling of the fish toxicity of some aromatic hydrocarbons (Di Marzio et al., 2001). Also Duchowicz et al. (2009) developed quantitative structure-toxicity relationship models for predicting the fish toxicity of a diverse set of 92 benzene derivatives. While there are many reports about the toxicity of chemicals against small biological species such as Daphnia magna, there is little report on bigger animals such as fishes due to ethical limitation in the experiments on bigger animals. Therefore it was very interesting to predict the toxicity of chemicals against animals from their toxicities against smaller biological species. Reviewing the literature indicates that very limited number of quantitative interspecies-toxicity relationship models have been reported on interspecies quantitative correlation of ecotoxicity of pharmaceuticals. The main aim of the present work was to develop a QSTR model for the prediction of toxicity of some organic chemicals on fish from their toxicities to D. magna and molecular structural descriptors.

2. Methodology 2.1. Data set In this study, the data set was taken from Sanderson and Thomsen report (Sanderson and Thomsen, 2009). The toxicity of chemicals was reported as LC50 (mM) in logarithmic scale and was considered as a dependent variable. In this paper the values of LC50 for 77 chemicals on a fish and D. magna were reported that were selected as the main data set. Moreover the value of LC50 on D. magna for 59 chemicals was reported in reference (Kar and Roy, 2010) where their toxicities on fish were not experimentally determined. This set was considered as an external set, and the

17

Fig. 1. PCA score plot for 75 compounds.

values of LC50 of these chemicals on fish were predicted form their toxicities on D. magna by our developed model. Two compounds were eliminated from the main data set due to the inconsistency between their names and CAS numbers. Before the division of compounds into training and test sets, principal component analysis (PCA) score plot was used to check the applicability range of data and detect the outlier compounds (Fig. 1). As can be seen in Fig. 1 six compounds were obviously located separately from other compounds and were eliminated from the data set. These compounds were Acroleine (4), Budesonide (12), Ethyl bromide (29), Famotidine (31), Finasteride (33) and Nitroglycerin (53). Therefore, the data set has 69 members. The names and specification of these compounds and those that were used as external test chemicals are shown in Table 1. The main data set was split into two independent subsets, trainings and test sets, by sample set partitioning based on joint x–y distances (SPXY) algorithm. This algorithm considers both distances in the dependent and independent variable spaces and has been expressed in detail by Galvão et al. (2005). The SPXY method reforms the Kennard–Stone algorithm by encircling both x and y differences in the computation of inter-sample distance. In this method, the selection of training set samples is performed by choosing two objects that are most distant from each other as starting points. Subsequently, at each step, the algorithm selects the samples that show the largest minimum distance with respect to any sample already selected. The algorithm adds the newly selected samples to the set of training samples until the number of samples predefined by the analyst is achieved. In this way 80% of chemicals (55) were selected as the training set and the rest (14) as the test set. The training set participated in the generation of the model, adjusting its parameters and independent set of samples in test set was used to evaluate the performance of the model. 2.2. Descriptor calculation In order to develop a QSAR model it is necessary to calculate the molecular structural features of interested chemicals. In this way the structures of compounds in data set were imported from the Pubchem website (Pubchem, 2013) and their validations were checked by their individual CAS numbers from the Lookchem website (Lookchem, 2013). Then, the structures of all compounds were sketched using Hyperchem (ver. 7.0) package and were geometrically optimized by employing the semi-empirical AM1 method by MOPAC (ver. 6.0) package. The molecular geometries corresponding to the lowest energy structure were selected for calculations of the molecular descriptors. The Hyperchem and MOPAC output files were used by Dragon (ver. 3.0) and Codessa (ver. 2.7.2) packages for obtaining the large number of descriptors (Katritzky et al., 1995). The prescreening of these descriptors were performed by eliminating: (i) those that are not available for all compounds; (ii) descriptors having a small variation in magnitude for all structures; and (iii) descriptors whose values are equal to zero for

18

M.H. Fatemi et al. / Journal of Theoretical Biology 380 (2015) 16–23

Table 1 Data set and corresponding experimental and predicted LC50. No.

Chemical

LC50

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 40 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75

1-Methoxy-2-propanol 2,20 -Dithiobisbenzothiazole Acetylsalicylic acid Acroleine Acyclovir Amitriptyline Aniline Anisol Azosemide Benzylamine Betariboacetate Budesonide Caffeine Capecitabine Carbacystine Carvedilol Chloramphenicol Clofibrate Clofibric acid Clonazepam Cloprednol Diazepam Diethanolamine Diphenylpropanediol Dithiodianiline Epichlorohydrine Ethinylestradiol Ethoxyquin Ethyl bromide Famciclovir Famotidine Fenpropidin Finasteride Fluoxetine Glutaraldehyde Hexamethyldisilazane Ibuprofen Iopromide Isoniazid Lansoprazole Lomefloxacin Malonic acid diethylester Metaxalone Metipranolol Midazolam (base) N,N0 -dimethylaniline N,N-diethylaniline Naproxen Nicotine Nisolfipine Nitrofurantion Nitrofurazone Nitroglycerin Paracetamol Paroxetine Phenobarbital Piperidine Pivaloyl chloride Propionaldehyde Propranolol Propylthiouracil Pyrimethamine Risperidone Salicylic acid Tetracycline Tetralin Theophylline Tolcapone milled Trenbolone acetate Triamterene Triclosan Trimethylquinone Verapamil Warfarin Xylene

0.593 3.619 3.311 – 3.384 5.551 5.492 3.432 3.743 3.331 3.424 – 3.028 2.626 5.476 5.123 3.025 4.137 3.474 4.022 3.961 4.821 2.767 3.448 6.532 3.586 4.716 5.036 – 2.593 – 5.705 – 5.783 3.788 3.388 4.355 2.449 3.775 4.225 3.432 2.889 3.47 3.917 6.212 4.385 3.325 3.794 4.733 4.071 3.508 3.839 – 4.216 5.754 2.201 3.93 2.576 2.815 5.511 4.19 4.715 4.835 3.068 6.43 4.741 2.572 4.14 5.541 4.404 6.384 5.001 4.813 2.955 4.175

(D. magna)

LC50

(Fish)

1.292 3.702 3.08 – 3.375 5.551 3.944 2.955 3.63 3.688 3.424 – 3.349 2.616 5.731 5.609 3.025 4.499 3.382 2.499 4.103 4.351 2.119 3.846 6.671 3.482 5.268 4.082 – 2.513 – 5.022 – 5.645 3.951 2.985 3.076 2.915 3.75 4.312 3.315 3.341 3.914 4.491 4.879 3.565 4.386 2.614 4.608 5.112 4.377 4.297 – 2.602 5.217 2.681 3.268 2.604 3.618 5.142 3.194 4.625 4.835 3.572 3.305 3.799 2.627 5.26 5.319 4.29 6.047 5.177 4.865 4.41 4.123

Predicted LC50 1.43 3.57 2.94 – 3.51 5.42 4.08 3.09 3.76 3.55 3.65 – 3.48 2.47 5.59 5.78 3.16 4.36 3.4 3.08 4.24 4.6 2.25 4.13 6.54 3.62 5.14 4.13 – 2.65 – 5.16 5.52 3.82 3.12 3.13 3.04 3.62 4.18 3.49 3.48 3.58 4.36 4.74 4.13 4.03 2.76 4.47 4.98 4.24 4.16 – 2.94 5.26 2.54 3.13 2.74 3.48 5.46 3.54 4.34 4.97 3.54 3.44 3.93 2.76 5.12 5.23 4.15 6.19 4.18 4.73 4.27 3.37

(Fish)

M.H. Fatemi et al. / Journal of Theoretical Biology 380 (2015) 16–23

19

Table 1 (continued ) No.

Chemical

LC50

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46

External set 2-Methyl-5-ethylpyridine Acebutolol Albendazole Atenolol Benzimidazolepropylamine Carazolol Carbamazepine Chlortetracycline Cilazapril Cimetidine Ciprofloxacin Citalopram Ethyl-4-methyl-5-oxazole carboxylate Etidronic acid Fenofibrate Flumequine Fluorouracil Flutamide Griseofulvin (microsize) Ibandronate Isopropyldioxepen Ketoprofen Lincomycin Loracarbef Maleic acid Metoprolol m-Phenylenediamine Nadolol Ofloxacin Omeprazole Oxymetholone Prednisone Pyrogallolaldehyde Ranitidine Retapamolin Ropinirole Shikimic acid Sotalol Stavudine Sulfadiazine Sulfamethoxazole Sumatriptan Testosterone Thioguanine Triphenylphosphine Zalcitabine

– 3.481 3.819 7.044 3.902 3.562 4.305 4.234 3.573 2.626 2.533 3.742 4.92 2.668 2.592 3.858 3.433 3.716 5.301 2.506 5.358 2.272 3.599 5.777 2.56 2.764 4.483 4.263 3.377 4.317 3.594 4.535 3.822 3.669 2.685 4.112 2.953 3.177 2.958 2.359 3.349 4.404 3.008 5.17 4.004 4.72 2.072

more than 80% compounds, to avoid matrix calculation error. Some descriptors generated for each compound encoded similar information about the molecule of interest; therefore among the collinear descriptors it was desirable to test each pair of descriptor, which shows high correlation with each other (R40.90) and eliminate one of them that has lower correlation with toxicity. Then the stepwise multiple linear regression variable subset selection method was used for the selection of the most relevant descriptors from the remaining ones. The selected molecular descriptors were considered as inputs for development of the QSAR models. In order to improve the prediction power of interspecies toxicity model some relevance molecular descriptors were added to the model. 2.3. Support vector machine Support vector machine is a relatively new nonlinear technique in the field of chemometrics and has been introduced into the chemical community to perform nonlinear classification, multivariate function estimation or nonlinear regression (Vapnik, 1995). The original algorithm of SVM was invented by Vladimir Vapnik, and the current standard incarnation (soft margin) was proposed by Cortes and Vapnik (1995). In support vector regression (SVR), the basic idea is to map the data x into a higher-dimensional feature space F via a nonlinear

(D. magna)

LC50

(Fish)

– – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – –

Predicted LC50

(Fish)

– 3.147 4.561 6.607 3.999 3.819 6.001 2.622 2.863 4.091 2.761 3.626 4.778 3.085 5.247 4.491 3.456 3.283 3.700 1.956 4.278 2.654 4.103 4.217 2.299 4.249 4.458 3.235 3.426 3.906 3.984 4.486 3.141 2.678 2.445 4.463 4.706 3.271 2.873 2.139 2.487 3.625 3.872 5.337 4.011 4.359 3.304

mapping Φ and then to do linear regression in this space. Therefore, regression approximation addresses the problem l of estimating a function based on a given data set G¼ ðxi ; di Þ i ¼ 1 (xi is the input vector, di is the desired value, and l is the total number of data patterns). SVM approximates the function in the following form: y¼

l X

wi Φðxi Þ þb

ð1Þ

i¼1

 l where Φðxi Þ i ¼ 1 are the features of inputs, and fwi gli ¼ 1 and b are the coefficients. They are estimated by minimizing the regularized risk function R(C); N   1 1X Lε di ; yi þ ‖w‖2 RðC Þ ¼ C Ni¼1 2

where 8 ε for > < j d yj   Le di ; yi ¼ > : 0

ð2Þ

j d yj Z ε ð3Þ otherwise

and ε is a prescribed parameter.   P In Eq. (2), C (1/N) N i ¼ 1 Lε di ; yi is the so-called empirical  error  (risk), which is measured by ε-insensitive loss function, Lε di ; yi ,

20

M.H. Fatemi et al. / Journal of Theoretical Biology 380 (2015) 16–23

and indicates that it does not penalize errors below ε. The parameter of ε is called the tube size, and it is equivalent to the approximation accuracy placed on the training data points. The second term, ð1=2Þ‖w‖2 , is used as a measurement of function flatness. The parameter of C is a regularized constant parameter that determines the trade-off between the training error and the model flatness. The parameter, of C and ε and also kernel parameter need to be properly selected by the user, because the generalization performance of the SVR model depends on right setting of these parameters. The optimal value for ε depends on the type of noise present in the data, which is usually unknown. The value of ε can effect the number of support vectors used to construct the regression function. The bigger the ε, the fewer support vectors are selected. The parameter of C controls the trade-off between maximizing the margin and minimizing the training error. If C is too small, then insufficient stress will be placed on fitting the training data. If C is too large, then the algorithm will over fit to the training data. A good choice of C and ε also prevents overtraining. The overall performances of SVM models were evaluated in terms of root mean square error (RMSE), which is defined as below: sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2 Pn s  i ¼ 1 yi  yo RMSE ¼ ns

ð4Þ

where yi is the desired output, yo is the predicted value and ns is the number of samples in the analyzed set. The predictive power of the SVM models developed on the selected training set is estimated on the predictions of test set chemicals and also by cross-validation test by the calculating of SPRESS and Q2.SPRESS is the standardized predicted error sum of squares, which is a standard index to measure the accuracy of a modeling method based on the crossvalidation technique. The relative predictive ability of a model as measured by cross validation can be represented by the Q2 statistic. The value of Q2 reveals the fraction of response variation in the calibration set that is explained during cross-validation. These values are calculated as follows: sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2 P yo  yi SPRESS ¼ nk1 P Q2 ¼ 1

yo  yi P

2

 2 yi  ym

ð5Þ

ð6Þ

in the above expressions ym is the mean of dependent variable, n is the number of observations and k is the number of independent variables in regression equation. All calculations in this work were carried out by using STATISTICA software (Ver. 7.1).

where Xij denotes the value of descriptor j of compound i. The collective database X¼ fX i gni¼ 1 is represented by the n  m matrix X: 2 3 … x1m x11 x12 6 7 … x2m 7 6 x21 x22 6 7 T ð8Þ X ¼ ðX1 ; X2 ; :::; Xn Þ ¼ 6 … 7 ⋱ ⋮ 6 ⋮ 7 4 5 … xnm xn1 xn2 Here the superscript T denotes the vector/matrix transpose. A distance score for two different compounds Xi and Xk can be measured by the Euclidean distance norm dik based on the compound descriptors: vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi uX u m  2 ð9Þ xij xkj dik ¼ j j X i  X k j j t j¼1

The mean distances di of one sample to the remaining ones were computed as follows: Pn d ð10Þ di ¼ k ¼ 1 ik k ¼ 1; 2; …; n n1 Then the mean distances were normalized within the interval (0, 1) and plotted against the values of dependent variable as shown in Fig. 2. Inspection to this figure indicates a good distribution of test set among the whole of the data set. Moreover the broad representation of the chemistry space by training set is adequate to ensure models stability and the diversity of test sets can prove the predictive ability of the model. 3.2. Modeling The main aim of this work is to make an attempt to develop useful QSAR models to predict the toxicities of some chemical compounds to a fish. The stepwise-multiple linear regression technique was used to select relevant descriptors from one's calculated by the Dragon and Codessa software after elimination of redundant descriptors. Names and categories of seven selected descriptors were displayed in Table 2. Some explanations about the meaning and calculation procedures of these descriptors are summarized in the Handbook of Molecular Descriptors by Todeschini and Consonni (2008). The correlation matrix among these descriptors was shown in Table 3. As can be seen in this table, there is no high correlation (R40.7) between these descriptors. The selected descriptors together with the toxicity of selected chemicals to D. magna were used to develop some linear and nonlinear models by using MLR, ANN, RF and SVM techniques. The statistical parameters of these models are shown in Table 4. As can be seen in this table the SVM model is better than the other models, therefore the details of this model were explained in the following. In developed support vector machine model, radial

3. Result and discussion 3.1. Diversity analysis The diversity problem involves defining a diverse subset of “representative” compounds so that researchers can scan only a subset of the huge database each time (Maldonado et al., 2006). The diversity analysis was performed for the data set to make sure that the structures of the training or test cases can represent those of the whole ones. We consider a database of n compounds  mgenerated from m highly correlated chemical descriptors X j j ¼ 1 . Each compound is represented as vector Xi:   X i ¼ X i1 ; X i2 ; X i3 ; :::; X ij ; :::; X im for i ¼ 1; 2; :::; n

ð7Þ

Fig. 2. The results of diversity test.

M.H. Fatemi et al. / Journal of Theoretical Biology 380 (2015) 16–23

basis function (RBF) was used as Kernel function. Then the values of capacity (C), epsilon (ε), and the Kernel parameter (γ) were optimized by applying the grid search method. This procedure was explained in our previous work (Fatemi et al., 2008). The optimal values of these parameters were C ¼11, ε ¼0.05, and γ ¼ 1.8. Then developed support vector machine model was used to calculate the values of toxicity for chemicals in the test set as well as in the training set (Table 1). Fig. 3 shows the plot of support vector machine predicted against the experimental values of toxicity, which reveals good correlations between them (R2train ¼ 0.951, R2test ¼ 0.933). Also, as can be seen in Fig. 4 the random distribution of the residuals in both sides of zero line indicates that there is no systematic error in the developed support vector machine model. 3.3. Models validation Evaluation of the obtained model is a very important step in QSAR studies. Cross-validation (CV) test is one of the most extensively employed methods for the internal validation of a statistical model (Stone, 1974). In cross-validation procedure, the predictive ability of a model is estimated by using a reduced set of structural data. Usually, one element in the data set is extracted each time, and a new model is derived based on the reduced dataset, then this model is employed to predict the toxicity of the excluded molecule. The procedure is repeated n times until all compounds have been excluded and predicted once. This is the socalled leave-one-out (LOO) cross validation method (Cramer et al., 1988). The predictive power of developed model are evaluated by calculating the cross validated correlation coefficient (Q2) and standardized predicted residual error sum of squares (SPRESS) values. The calculated values of these parameters for leave one out cross validation test were Q2 ¼0.69 and SPRESS ¼0.822, respectively, which indicated the reliability and robustness of the developed support vector machine model. Next test was Yrandomization test. This test was performed to evaluate any chance correlation among the data matrix. In order to perform this test, dependent variable (y) is randomly scrambled and a new model is developed by using the original independent variables matrix. For our data set the average value of R2 after 30 times Y-

21

scrambling was 0.268, which revealed that the proposed model is well founded and not just the result of chance correlations. 3.4. Applicability domain analysis It needs to be emphasized that, no matter how robust, significant, and validated a QSAR model may be, it cannot be expected to reliably predict the modeled activity for the entire universe of chemicals. Therefore, before a QSAR model is put into use for screening chemicals, its domain of application (AD) must be defined (Tropsha et al., 2003). A simple measure of a chemical being too far from the applicability domain of the model is its leverage hi, which is defined as  1 Hi ¼ xTi XT X xi ði ¼ 1; :::; nÞ ð11Þ where xi is the descriptor row-vector of the query compound and X is the n  m matrix of m model descriptor values for n training set compounds. The warning leverage hn is, generally, fixed at 3m/n. To visualize the applicability domain of models, the standardized residuals versus leverage (Hat diagonal) values were plotted (William plot) for an immediate and simple graphical detection of both the response outliers (i.e., compounds with standardized residuals greater than three standard deviation units, 43σ) and structurally influential chemicals in the model (hi 4hn). Fig. 5 shows the results for the AD analysis of the QSAR models. As can be seen in Fig. 5 two chemicals (number of 30 and 32) are influential on the structural domain of the models. The anomalous behavior of these compounds could be due to one of the following reasons: (1) incorrect experimental input data, (2) the descriptors selected do not capture some Table 4 Statistical parameters of MLR, ANN, RF and SVM models. Model

R2train

SEtrain

R2test

SEtest

MLR ANN RF SVM

0.730 0.869 0.661 0.960

0.622 0.374 0.340 0.210

0.618 0.865 0.567 0.934

0.428 0.270 0.364 0.221

Table 2 Specification of multiple linear regression models. Chemical meaning

Notauion

Categories

Coefficient

SE

Moriguchioctanol-water partition coefficient Radial distribution function – 8/weighted by atomic masses Radial distribution function – 3 / weighted by atomic masses Moran autocorrelation – lag 1/weighted by atomic van der Waals volumes Lowest eigenvalue n. 1 of Burden matrix/weight by atomic masses R autocorrelation of lag 1/unweighted average valence connectivity index chi-1 LC50 Daphnia magna constant

MLOGP RDF080m RDF030m MATS1v BELm1 R1u X1Av DMT –

Molecular properties Conformational (3D) Descriptors Conformational (3D) Descriptors 2D Auto Correlation Indices Topological (2D) descriptors GETAWAY descriptors Connectivity indices – –

0.268 0.095  0.020 2.347  4.083  2.153  1.574 0.450 10.165

0.081 0.041 0.012 1.041 1.162 2.776 0.772 0.109 2.611

Table 3 Correlation matrix among descriptors.

MLOGP RDF080m RDF030m MATS1v BELm1 R1u X1Av DMT

MLOGP

RDF080m

RDF030m

MATS1v

BELm1

R1u

X1Av

DMT

1 0.279  0.017 0.230 0.675  0.415 0.039 0.504

1 0.470 0.132 0.289  0.506  0.131 0.354

1 0.162 0.040  0.345 0.098  0.053

1 0.459  0.112 0.095 0.457

1  0.348  0.250 0.429

1  0.115  0.411

1  0.089

1

22

M.H. Fatemi et al. / Journal of Theoretical Biology 380 (2015) 16–23

Fig. 3. Plot of predicted against the experimental for all molecules. Fig. 5. Applicability domain plot for SVM model (h*¼ 0.436).

Fig. 4. Plot of residual against experimental values. Fig. 6. The result of sensitivity analysis for the SVM model.

relevant structural features present in the molecules and are absent in the others, (3) their biological mechanism of flux is different from the remaining chemicals. The developed model can be used to predict the toxicity of a new compound if they locate in applicability domain of the model. The leverage values for 46 molecules were less than the warning value of hn ¼0.436; therefore they located in the applicability domain of the model. 3.5. Descriptors interpretation Sensitivity analysis approach is used to rank the general importance of the descriptors that appear in the model. According to this method, the differences between the root mean square error (RMSE) of the complete network predictions and the RMSE obtained when the ith variable is excluded from the trained network (RMSEi) were calculated (RMDIFi) according to the following equation: RMDIFi ¼ RMSEi –RMSE

ð12Þ

Generally, the values of RMDIFi are greater than zero. The variable with the greater importance is the one that leads to a greater value of RMDIFi. The numeric values of average RMDIFi for SVM model are plotted in Fig. 6. According to Fig. 6 the order of importance of descriptors was RDF080m4RDF030m4MATS1v4MLOGP4BELm14R1u4X1Av¼DMT. The first two important descriptors (RDF080m and RDF030m) belonged to radial distribution function descriptors that are based on the distance distribution in the geometrical representation of molecules (Hemmer et al., 1999). MATS1v descriptor is a 2D autocorrelation descriptor. These types of descriptors can describe that how a considered property is distributed along a topological molecular structure (Tropsha et al., 2003). These descriptors address the topology of the molecular structure in association with a selected physicochemical property. The MLOGP represents hydrophobicity of a molecule which can encode partitioning and hydrogen bonding ability of the molecules (Hirono et al., 1994). BELm1 is a Burden-CASUniversity of Texas (BCUT) eigenvalue descriptor (Burden, 1989). The

BCUT descriptors are the eigenvalues of a modified connectivity matrix known as the Burden matrix. These types of descriptors are based on a significant extension of the Burden approach, considering three classes of matrices whose diagonal elements correspond to the following: 1) atomic charge-related values; 2) atomic polarizabilityrelated values; and 3) atomic H-bond abilities. These descriptors are derived from the positive and negative eigenvalues of the adjacency matrix of the target molecule, weighting the diagonal elements with atom weights and the first eight highest and the lowest absolute eigenvalues, respectively. The R1u is a GETAWAY (GEometry, Topology, and Atom-Weights AssemblY) type descriptors (Consonni et al., 2002). These types of descriptors encode geometrical information given from influence matrix, topological information given by molecular graph, and chemical information from selected atomic properties and call encode information about the 3D structures of the molecules and represent some information about the structural and steric requirements of compounds. The last descriptor (X1Av) is a Kier–Hall connectivity indices (Kier and Hall, 1986). This topological descriptor represents the size of the hydrophobic segment and contains the group contributions for all non-hydrogen atoms in the molecules. Distribution of selected descriptors for SVM model in this study is shown in the radar chart (Fig. 7). A radar chart is a graphical plot that displays multivariate data in the form of a two-dimensional chart with several quantitative variables represented on axis starting from the same point. Typically, radar charts are generated in a multiplot format with many stars on each page and each star representing one observation (Natrella, 2010). The purpose of a radar chart is to compare m options across n parameters so that audience can be convinced that option A is better than option B. Radar chart is often used when neighboring variables are unrelated, creating spurious connection; so in this work instead of using a column chart for showing independent variables, we illustrated them graphically because column chart might look cluttered. Each of independent variables is in individual

M.H. Fatemi et al. / Journal of Theoretical Biology 380 (2015) 16–23

Fig. 7. Radar plot of the distribution of selected descriptors used in SVM model.

axes and each line is drawn connecting the data values for each compound. 4. Conclusion Quantitative interspecies-toxicity relationship methodologies were used to improve the prediction power of toxicity model. Both linear and nonlinear techniques including multiple linear regression, random forest, neural network and support vector machine methods were used to develop some QSTR models for predicting the toxicity of some chemical compounds. The good agreement between experimental results and the predicted values of toxicity confirms that among developed models, support vector machine model can be successfully used to correlate fish toxicity of chemical compounds to their molecular structural descriptors as well as their toxicity to D. magna. Also by analyzing descriptors that were used in developing the support vector machine model, it was concluded that geometrical, topological and electronic interactions are important in the true estimation of the toxicity of chemical. References Burden, F.R., 1989. Molecular identification number for substructure searches. J. Chem. Inf. Comput. Sci. 29, 225–227. Consonni, V., Todeschini, R., Pavan, M., 2002. Structure/response correlations and similarity/diversity analysis by GETAWAY descriptors. 1. Theory of the novel 3D molecular descriptors. J. Chem. Inf. Comput. Sci. 42, 682–692. Cortes, C., Vapnik, V., 1995. Support-vector networks. Mach. Learn. 20, 273–297. Cramer, R.D., Bunce, J.D., Patterson, D.E., Frank, I.E., 1988. Crossvalidation, bootstrapping, and partial least squares compared with multiple regression in conventional QSAR studies. Quant. Struct. Act. Relationsh. 7, 18–25. Cunningham, V.L., Buzby, M., Hutchinson, T., Mastrocco, F., Parke, N., Roden, N., 2006. Effects of human pharmaceuticals on aquatic life: next steps. Environ. Sci. Technol. 40, 3456–3462. Daughton, C.G., Jones-Lepp, T.L., 2001. Pharmaceuticals and Personal Care Products in the Environment: Scientific and Regulatory Issues. American Chemical Society, Washington, DC. Di Marzio, W., Galassi, S., Todeschini, R., Consolaro, F., 2001. Traditional versus WHIM molecular descriptors in QSAR approaches applied to fish toxicity studies. Chemosphere 44, 401–406.

23

Duchowicz, P., Marrugo, J., Ortiz, E., Castro, E., Vivas-Reyes, R., 2009. QSAR Study for the fish toxicity of benzene derivatives. J. Argent. Chem. Soc. 97, 116–127. Fatemi, M.H., Gharaghani, S., Mohammadkhani, S., Rezaie, Z., 2008. Prediction of selectivity coefficients of univalent anions for anion-selective electrode using support vector machine. Electrochim. Acta 53, 4276–4282. Fent, K., Weston, A.A., Caminada, D., 2006. Ecotoxicology of human pharmaceuticals. Aquat. Toxicol. 76, 122–159. Frick, E.A., Zaugg, S.D., 2003. Organic wastewater contaminants in the upper Chattahoochee River basin, Georgia, 1999–2002. Atlanta 9, 8. Galvão, R.K.H., Araujo, M.C.U., José, G.E., Pontes, M.J.C., Silva, E.C., Saldanha, T.C.B., 2005. A method for calibration and validation subset partitioning. Talanta 67, 736–740. Garrison, A., Pope, J., Allen, F., 1976. GC/MS analysis of organic compounds in domestic wastewaters. Identification and Analysis of Organic Pollutants in Water, pp. 517–556. Halling-Sørensen, B., Nors Nielsen, S., Lanzky, P., Ingerslev, F., Holten Lützhøft, H., Jørgensen, S., 1998. Occurrence, fate and effects of pharmaceutical substances in the environment – a review. Chemosphere 36, 357–393. Heberer, T., 2002. Tracking persistent pharmaceutical residues from municipal sewage to drinking water. J. Hydrol. 266, 175–189. Heberer, T., Dünnbier, U., Reilich, C., Stan, H.-J., 1997. Detection of drugs and drug metabolites in ground water samples of a drinking water treatment plant. Fresenius Environ. Bull. 6, 438–443. Hemmer, M.C., Steinhauer, V., Gasteiger, J., 1999. Deriving the 3D structure of organic molecules from their infrared spectra. Vib. Spectrosc. 19, 151–164. Hignite, C., Azarnoff, D.L., 1977. Drugs and drug metabolites as environmental contaminants: chlorophenoxyisobutyrate and salicylic acid in sewage water effluent. Life Sci. 20, 337–341. Hirono, S., Nakagome, I., Hirano, H., Yoshii, F., Moriguchi, I., 1994. Non-congeneric structure-pharmacokinetic property correlation studies using fuzzy adaptive least-squares: volume of distribution. Biol. Pharm. Bull. 17, 686–690. Jones, O., Voulvoulis, N., Lester, J., 2002. Aquatic environmental assessment of the top 25 English prescription pharmaceuticals. Water Res. 36, 5013–5022. Kar, S., Roy, K., 2010. First report on interspecies quantitative correlation of ecotoxicity of pharmaceuticals. Chemosphere 81, 738–747. Katritzky, A., Lobanov, V., Karelson, M., 1995. CODESSA: Training Manual. University of Florida, Gainesville, FL. Kier, L., Hall, L., 1986. Molecular Connectivity in Structure-Activity Analysis. John WileyJohn Wiley, New York. Länge, R., Dietrich, D., 2002. Environmental risk assessment of pharmaceutical drug substances – conceptual considerations. Toxicol. Lett. 131, 97–104. Lookchem, 2013. Lookchem, look for chemical. 〈http://www.lookchem.com〉 (accessed 01.11.13.). Maldonado, A.G., Doucet, J., Petitjean, M., Fan, B.-T., 2006. Molecular similarity and diversity in chemoinformatics: from theory to applications. Mol. Divers. 10, 39–79. Natrella, M., 2010. NIST/SEMATECH e-Handbook of Statistical Methods. Pubchem, 2013. The pubchem project. 〈http://pubchem.ncbi.nlm.nih.gov〉 (accessed 01.11.13.). Sanderson, H., Thomsen, M., 2007. Ecotoxicological quantitative structure–activity relationships for pharmaceuticals. Bull. Environ. Contam. Toxicol. 79, 331–335. Sanderson, H., Thomsen, M., 2009. Comparative analysis of pharmaceuticals versus industrial chemicals acute aquatic toxicity classification according to the United Nations classification system for chemicals. Assessment of the (Q) SAR predictability of pharmaceuticals acute aquatic toxicity and their predominant acute toxic mode-of-action. Toxicol. Lett. 187, 84–93. Stone, M., 1974. Cross-validatory choice and assessment of statistical predictions. J. R. Stat. Soc. Ser. B (Methodological) 36, 111–147. Tabak, H.H., Bunch, R., 1970. Steroid hormones as water pollutants. I. Metabolism of natural and synthetic ovulation-inhibiting hormones by microorganisms of activated sludge and primary settled sewage. Dev. Ind. Microbiol. 11, 367–376. Ternes, T.A., 1998. Occurrence of drugs in German sewage treatment plants and rivers. Water Res. 32, 3245–3260. Todeschini, R., Consonni, V., 2008. Handbook of Molecular Descriptors. John Wiley & Sons. Tropsha, A., Gramatica, P., Gombar, V.K., 2003. The importance of being earnest: validation is the absolute essential for successful application and interpretation of QSPR models. QSAR Comb. Sci. 22, 69–77. Tugcu, G., Saçan, M.T., Vracko, M., Novic, M., Minovski, N., 2012. QSTR modelling of the acute toxicity of pharmaceuticals to fish. SAR QSAR Environ. Res. 23, 297–310. Vapnik, V.N., 1995. The Nature of Statistical Learning Theory. Springer, New York. Zuccato, E., Calamari, D., Natangelo, M., Fanelli, R., 2000. Presence of therapeutic drugs in the environment. The lancet 355, 1789–1790.

Development of quantitative interspecies toxicity relationship modeling of chemicals to fish.

In this work, quantitative interspecies-toxicity relationship methodologies were used to improve the prediction power of interspecies toxicity model. ...
844KB Sizes 0 Downloads 8 Views